2016-08-04 64 views
2

我試圖從抓取鏈接新聞文章: -Nutch的抓取工具無法檢索新聞文章內容

Article 1

Article 2

但我沒有收到文離開頁面到索引中的內容字段(elasticsearch)。

成果爬行的是: -

{ 
    "took": 2, 
    "timed_out": false, 
    "_shards": { 
    "total": 5, 
    "successful": 5, 
    "failed": 0 
    }, 
    "hits": { 
    "total": 2, 
    "max_score": 0.09492774, 
    "hits": [ 
     { 
     "_index": "news", 
     "_type": "doc", 
     "_id": "http://www.bloomberg.com/press-releases/2016-07-08/network-1-announces-settlement-of-patent-litigation-with-apple-inc", 
     "_score": 0.09492774, 
     "_source": { 
      "tstamp": "2016-08-04T07:21:59.614Z", 
      "segment": "20160804125156", 
      "digest": "d583a81c0c4c7510f5c842ea3b557992", 
      "host": "www.bloomberg.com", 
      "boost": "1.0", 
      "id": "http://www.bloomberg.com/press-releases/2016-07-08/network-1-announces-settlement-of-patent-litigation-with-apple-inc", 
      "url": "http://www.bloomberg.com/press-releases/2016-07-08/network-1-announces-settlement-of-patent-litigation-with-apple-inc", 
      "content": "" 
     } 
     }, 
     { 
     "_index": "news", 
     "_type": "doc", 
     "_id": "http://www.bloomberg.com/press-releases/2016-07-05/apple-donate-life-america-bring-national-organ-donor-registration-to-iphone", 
     "_score": 0.009845509, 
     "_source": { 
      "tstamp": "2016-08-04T07:22:05.708Z", 
      "segment": "20160804125156", 
      "digest": "2a94a32ffffd0e03647928755e055e30", 
      "host": "www.bloomberg.com", 
      "boost": "1.0", 
      "id": "http://www.bloomberg.com/press-releases/2016-07-05/apple-donate-life-america-bring-national-organ-donor-registration-to-iphone", 
      "url": "http://www.bloomberg.com/press-releases/2016-07-05/apple-donate-life-america-bring-national-organ-donor-registration-to-iphone", 
      "content": "" 
     } 
     } 
    ] 
    } 
} 

中,我們可以看到,內容字段爲空。我嘗試了nutch-site.txt中的不同選項。但結果仍然一樣。請幫助我。

回答

3

不知道爲什麼nutch無法提取文章內容。但是我發現了一個使用Jsoup的解決方法。我開發了一個自定義分析過濾器插件,用於分析整個文檔,並在解析器過濾器返回的ParseResult中設置分析文本。並用我的自定義解析過濾器在parse-plugins.xml

更換解析HTML的插件,這將是這樣的: -

document = Jsoup.parse(new String(content.getContent(),"UTF-8"),content.getUrl()); 
    parse = parseResult.get(content.getUrl()); 
    status = parse.getData().getStatus(); 
    title = document.title(); 
    parseData = new ParseData(status, title,parse.getData().getOutlinks(), parse.getData().getContentMeta(), parse.getData().getParseMeta()); 
    parseResult.put(content.getUrl(), new ParseText(document.body().text()), parseData); 
1

只是出於上下文的回答,但嘗試使用Apache ManifoldCF。它提供了內置的彈性搜索連接器,以及更好的日誌歷史來找出爲什麼數據沒有編入索引。ManifoldCF中的連接器部分允許您指定應在哪個字段中索引內容。這是一個很好的開源替代方案。

+0

謝謝:)。我會看看它。 – Sachin

+0

我想選擇特定的div或任何其他標籤內的鏈接,並獲取該鏈接的內容併爲它們編制索引。我們是否可以用多方面做這樣的事情 – Sachin