2013-01-02 52 views
1

我正在使用nutch 2.1並爬行一個網站。問題在於抓取程序不斷顯示抓取url spinwaiting/active,並且由於抓取需要很多時間,所以與mysql的連接會超時。我怎樣才能減少一次提取數量,以便mysql不會超時?在nutch中有一個設置,我可以說只能獲取100或500個URL,然後解析並存儲到mysql,然後再次獲取下一個100或500個URL?nutch爬行陷入spinwaiting或活動。如何減少獲取週期?

錯誤消息:

Unexpected error for http://www.example.com 
java.io.IOException: java.sql.BatchUpdateException: The last packet successfully received from the server was 36,928,172 milliseconds ago. The last packet sent successfully to the server was 36,928,172 milliseconds ago. is longer than the server configured value of 'wait_timeout'. You should consider either expiring and/or testing connection validity before use in your application, increasing the server configured values for client timeouts, or using the Connector/J connection property 'autoReconnect=true' to avoid this problem. 
    at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:340) 
    at org.apache.gora.mapreduce.GoraRecordWriter.write(GoraRecordWriter.java:65) 
    at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:587) 
    at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) 
    at org.apache.nutch.fetcher.FetcherReducer$FetcherThread.output(FetcherReducer.java:663) 
    at org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:534) 
Caused by: java.sql.BatchUpdateException: The last packet successfully received from the server was 36,928,172 milliseconds ago. The last packet sent successfully to the server was 36,928,172 milliseconds ago. is longer than the server configured value of 'wait_timeout'. You should consider either expiring and/or testing connection validity before use in your application, increasing the server configured values for client timeouts, or using the Connector/J connection property 'autoReconnect=true' to avoid this problem. 
    at com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:2028) 
    at com.mysql.jdbc.PreparedStatement.executeBatch(PreparedStatement.java:1451) 
    at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:328) 
    ... 5 more 
Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: The last packet successfully received from the server was 36,928,172 milliseconds ago. The last packet sent successfully to the server was 36,928,172 milliseconds ago. is longer than the server configured value of 'wait_timeout'. You should consider either expiring and/or testing connection validity before use in your application, increasing the server configured values for client timeouts, or using the Connector/J connection property 'autoReconnect=true' to avoid this problem. 
    at sun.reflect.GeneratedConstructorAccessor49.newInstance(Unknown Source) 
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) 
    at java.lang.reflect.Constructor.newInstance(Constructor.java:525) 
    at com.mysql.jdbc.Util.handleNewInstance(Util.java:411) 
    at com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:1116) 
    at com.mysql.jdbc.MysqlIO.send(MysqlIO.java:3364) 
    at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1983) 
    at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2163) 
    at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2624) 
    at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2127) 
    at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2427) 
    at com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:1980) 
    ... 7 more 
Caused by: java.net.SocketException: Broken pipe 
    at java.net.SocketOutputStream.socketWrite0(Native Method) 
    at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109) 
    at java.net.SocketOutputStream.write(SocketOutputStream.java:153) 
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) 
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) 
    at com.mysql.jdbc.MysqlIO.send(MysqlIO.java:3345) 
    ... 13 more 

回答

1

我使用Nutch的2.1和爬行的站點。問題是 抓取程序不斷顯示提取url spinwaiting/active,並且由於 抓取需要很多時間,所以與mysql的連接會超時。如何 我可以一次減少提取的次數,以便mysql不會超時 ?

爲了減少提取次數,您可以將以下屬性添加到您的nutch-site.xml並根據您的需要編輯值。請不要修改Nutch的-default.xml中寧物業從那裏複製到Nutch的-site.xml中和管理的價值:

<property> 
    <name>fetcher.threads.fetch</name> 
    <value>20</value> 
    </property> 

關於超時問題,您可以將此屬性可以添加到您的Nutch現場.xml帶有您認爲需要的加載時間值。

<property> 
    <name>http.timeout</name> 
    <value>240000</value> 
    <description>The default network timeout, in milliseconds.</description> 
</property> 

是否有Nutch的,我可以說只能取100或500的URL然後解析並存儲到MySQL,然後再獲取下一個100頁或500的URL設置?

Nutch在循環中執行步驟 - 在您的爬網命令中指定的稱爲「深度」的多個迭代中生成/讀取/解析/更新。如果您想控制抓取,可以按照教程鏈接http://wiki.apache.org/nutch/NutchTutorial的第3.2節(使用單個命令進行整個Web抓取)中所述執行每個步驟。這會給你很好的方向,並且明確地知道發生了什麼。在獲取每個片段時檢查狀態,以便知道每個片段中有多少個網址正在提取