如何使用nutch插件解析位於特定HTML標記中的內容？

我正在使用Nutch來抓取網站，我想解析Nutch抓取的html頁面的特定部分。例如，如何使用nutch插件解析位於特定HTML標記中的內容？

<h><title> title to search </title></h> 
    <div id="abc"> 
     content to search 
    </div> 
    <div class="efg"> 
     other content to search 
    </div>

我想解析ID爲「abc」和class =「efg」的div元素等等。

我知道我必須創建一個自定義解析插件，因爲Nutch提供的htmlparser插件可以移除所有的html標籤，css和javascript內容，只留下文本內容。我提到這個博客http://sujitpal.blogspot.in/2009/07/nutch-custom-plugin-to-parse-and-add.html，但我發現這是用html標籤解析，而我想解析具有特定值的屬性的html標籤。我發現傑里科已被提及可用於解析特定的html標籤，但我可以找到與傑里科有關的nutch插件的任何示例。

我需要了解如何設計用於與具有特定值屬性標記的基礎上，解析HTML頁面的戰略一定的指導意義。

來源

2013-07-31 abhijeet

你可以使用這個插件基於CSS規則來提取網頁數據：

https://github.com/BayanGroup/nutch-custom-search

在您的例子，你可以用這種方式進行配置：

<config> 
    <fields> 
     <field name="custom_content" /> 
    </fields> 
    <documents> 
     <document url=".+" engine="css"> 
      <extract-to field="custom_content"> 
       <text> 
        <expr value="#abc" /> 
       </text> 
       <text> 
        <expr value=".efg" /> 
       </text> 
      </extract-to> 
     </document> 
    </documents> 
</config>

來源

2013-12-18 12:08:42 tahagh

當我試圖上面的例子在'extractors.xml'中，那麼Nutch不會索引到Solr。如果我工作，如果我刪除任何一個''元素。該插件不會接受多個''元素？ –

此插件不適用於Nutch最新版本，即2.X版本 – horro

如何使用nutch插件解析位於特定HTML標記中的內容？

回答

相關問題