日期:2013/10/13 系統(tǒng) :Ubuntu12.04LTS JDK :1.7.0_21 Nutch :2.2.1 MySQL :5.5.32 ------------------------------------------------------------------------------------------------------------------------------------------------------------
日期:2013/10/13
系統(tǒng):Ubuntu12.04LTS
JDK:1.7.0_21
Nutch:2.2.1
MySQL:5.5.32
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Pre1:安裝配置OracleJDK
Pre2:安裝配置MySQL sudo apt-get install mysql-server,mysql-client
Pre3:安裝配置Apache Ant sudo apt-get install ant
Start:Ubuntu下搭建Nutch2.2.1,并以MySQL作為數(shù)據(jù)庫(kù),UTF-8為默認(rèn)編碼綜合配置
Step1:MySQL配置
首先編輯 /etc/mysql/my.cnf 文件在[mysqld]下面添加以下內(nèi)容:
innodb_file_format=barracuda innodb_file_per_table=true innodb_large_prefix=true character-set-server=utf8 collation-server=utf8mb4_unicode_ci max_allowed_packet=500M
然后創(chuàng)建數(shù)據(jù)庫(kù)與數(shù)據(jù)表:
CREATE DATABASE nutch DEFAULT CHARACTER SET utf8mb4 DEFAULT COLLATE utf8;
CREATE TABLE `webpage` ( `id` varchar(767) NOT NULL, `headers` blob, `text` mediumtext DEFAULT NULL, `status` int(11) DEFAULT NULL, `markers` blob, `parseStatus` blob, `modifiedTime` bigint(20) DEFAULT NULL, `score` float DEFAULT NULL, `typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL, `batchId` varchar(32) CHARACTER SET latin1 DEFAULT NULL, `baseUrl` varchar(767) DEFAULT NULL, `content` longblob, `title` varchar(2048) DEFAULT NULL, `reprUrl` varchar(767) DEFAULT NULL, `fetchInterval` int(11) DEFAULT NULL, `prevFetchTime` bigint(20) DEFAULT NULL, `inlinks` mediumblob, `prevSignature` blob, `outlinks` mediumblob, `fetchTime` bigint(20) DEFAULT NULL, `retriesSinceFetch` int(11) DEFAULT NULL, `protocolStatus` blob, `signature` blob, `metadata` blob, PRIMARY KEY (`id`) ) ENGINE=InnoDB ROW_FORMAT=COMPRESSED DEFAULT CHARSET=utf8;
Step2:Nutch配置
獲取Nutch2.2.1,從官網(wǎng)http://www.apache.org/dyn/closer.cgi/nutch/下載,然后解壓至本地安裝目錄,如本地根目錄為${APACHE_NUTCH_HOME}
將以下行的注釋取消:
default”/>
修改以下行:
編輯${APACHE_NUTCH_HOME}/conf/gora.properties文件,注釋掉默認(rèn)的數(shù)據(jù)庫(kù)連接配置,同時(shí)添加以下配置內(nèi)容:
############################### # MySQL configure # ############################### gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=true gora.sqlstore.jdbc.user=xxxx(MySQL用戶名) gora.sqlstore.jdbc.password=xxxx(MySQL密碼)
修改${APACHE_NUTCH_HOME}/conf/gora.properties文件,這里的修改建議按照前面介紹的自動(dòng)生成數(shù)據(jù)表的方法進(jìn)行修改,網(wǎng)上說(shuō)的要將primarykey的長(zhǎng)度從512修改成767,即:
改: Step5:nutch-site.xml配置 添加以下配置: (關(guān)于ant的命令,這里就不說(shuō)明了),只需要切換到${APACHE_NUTCH_HOME}下執(zhí)行ant clean 然后ant 即可。構(gòu)建完畢后會(huì)在${APACHE_NUTCH_HOME}目錄下生成runtime 文件夾。 Step:7 網(wǎng)頁(yè)抓取,種子配置 創(chuàng)建種子文件java.lang.NullPointerException
at org.apache.avro.util.Utf8.
cd${APACHE_NUTCH_HOME}/runtime/local
mkdir -p urls
echo 'http://www.sina.com.cn' > urls/seed.txt
echo 'http://www.ifeng.com' > urls/seed.txt
bin/nutchcrawl urls -depth 5 -topN 10
至此,已經(jīng)完成了基本的配置。
聲明:本網(wǎng)頁(yè)內(nèi)容旨在傳播知識(shí),若有侵權(quán)等問題請(qǐng)及時(shí)與本網(wǎng)聯(lián)系,我們將在第一時(shí)間刪除處理。TEL:177 7030 7066 E-MAIL:11247931@qq.com