下面來從零開發(fā)一個MapReduce程序,并在hadoop集群上運行。
mapper代碼 map.py:
import sys for line in sys.stdin: word_list = line.strip().split(' ') for word in word_list: print ' '.join([word.strip(), str(1)])
View Code
reducer代碼 reduce.py:
import sys cur_word = None sum = 0 for line in sys.stdin: ss = line.strip().split(' ') if len(ss) < 2: continue word = ss[0].strip() count = ss[1].strip() if cur_word == None: cur_word = word if cur_word != word: print ' '.join([cur_word, str(sum)]) cur_word = word sum = 0 sum += int(count) print ' '.join([cur_word, str(sum)]) sum = 0
View Code
資源文件 src.txt(測試用,在集群中跑時,記得上傳到hdfs上):
hello ni hao ni haoni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ao ni haoni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni haoao ni haoni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao Dad would get out his mandolin and play for the family Dad loved to play the mandolin for his family he knew we enjoyed singing I had to mature into a man and have children of my own before I realized how much he had sacrificed I had to,mature into a man and,have children of my own before.I realized how much he had sacrificed
View Code
首先本地調(diào)試查看結(jié)果是否正確,輸入命令以下:
cat src.txt | python map.py | sort -k 1 | python reduce.py
命令行中輸出的結(jié)果:
a 2 and 2 and,have 1 ao 1 before 1 before.I 1 children 2 Dad 2 enjoyed 1 family 2 for 2 get 1 had 4 hao 33 haoao 1 haoni 3 have 1 he 3 hello 1 his 2 how 2 I 3 into 2 knew 1 loved 1 man 2 mandolin 2 mature 1 much 2 my 2 ni 34 of 2 out 1 own 2 play 2 realized 2 sacrificed 2 singing 1 the 2 to 2 to,mature 1 we 1 would 1
View Code
通過調(diào)試發(fā)現(xiàn)本地調(diào)試,代碼是OK的。下面扔到集群上面跑。為了方便,專門寫了一個腳本 run.sh,解放勞動力嘛。
HADOOP_CMD="/home/hadoop/hadoop/bin/hadoop" STREAM_JAR_PATH="/home/hadoop/hadoop/contrib/streaming/hadoop-streaming-1.2.1.jar" INPUT_FILE_PATH="/home/input/src.txt" OUTPUT_PATH="/home/output" $HADOOP_CMD fs -rmr $OUTPUT_PATH $HADOOP_CMD jar $STREAM_JAR_PATH -input $INPUT_FILE_PATH -output $OUTPUT_PATH -mapper "python map.py" -reducer "python reduce.py" -file ./map.py -file ./reduce.py
下面解析下腳本:
HADOOP_CMD: hadoop的bin的路徑 STREAM_JAR_PATH:streaming jar包的路徑 INPUT_FILE_PATH:hadoop集群上的資源輸入路徑 OUTPUT_PATH:hadoop集群上的結(jié)果
輸入以下命令查看經(jīng)過reduce階段后輸出的記錄:
cat src.txt | python map.py | sort -k 1 | python reduce.py | wc -l 命令行中
在瀏覽器輸入:master:50030 查看任務(wù)的詳細(xì)情況。
Kind % Complete Num Tasks Pending Running Complete Killed Failed/Killed Task Attempts map 100.00% 2 0 0 2 0 0 / 0 reduce 100.00% 1 0 0 1 0 0 / 0
Map-Reduce Framework中看到這個。
Counter Map Reduce Total Reduce output records 0 0 43
證明整個過程成功。第一個hadoop程序開發(fā)結(jié)束。
聲明:本網(wǎng)頁內(nèi)容旨在傳播知識,若有侵權(quán)等問題請及時與本網(wǎng)聯(lián)系,我們將在第一時間刪除處理。TEL:177 7030 7066 E-MAIL:11247931@qq.com