如何防止Solr添加頁眉和頁腳？

我有一個網絡抓取工具（Ncrawler），它抓取網站內容，並且添加了代碼以將數據索引到solr。我的要求是避免將網站的頁眉，頁腳和導航窗格添加到索引進行索引。如何防止Solr添加頁眉和頁腳？

有沒有辦法做到這一點？任何幫助將非常感激。

感謝，阿努

2013-04-16 Anu

您可以利用HtmlDocumentProcessor類，它具有在構造一個filterTextRules參數。此參數需要作爲Dictionary<string,string>傳遞，其中開始和結束字符串用於過濾標記。

作爲一個例子，我們說，你有你的HTML頁面的頁眉和頁腳，它們作爲HTML結構如下圖所示：

<!-- Begin Header --> 
all header markup is here 
<!-- End Header --> 

<!-- Begin Footer --> 
all footer markup is here 
<!-- End Footer -->

在這種情況下，你可以在你的管道作爲初始化HtmlDocumentProcessor如下：

var pipelines = new IPipelineStep[] 
       { 
        new HtmlDocumentProcessor(
         new Dictionary<string, string> 
          { 
           {"<!--Begin Header", "<!--End Header"}, 
           {"<!--Begin Footer", "<!--End Footer"}, 
          }, 
          null), 
         new PdfIFilterProcessor(), 
         new TextDocumentProcessor(), 
       }; 

    using (var crawler = new NCrawler.Crawler(new Uri("http://ncrawler.codeplex.com"), 
      pipelines)) 
    { 
      //Processing here 
    }

希望這可以幫助。有關filterTextRules參數及其工作原理的更多詳細信息，請參閱HtmlDocumentProcessor source。

來源

2013-04-16 12:27:49

謝謝非常感謝... :) @Paige Cook。你的答案真的能幫助我......不僅在這個問題上，而且在最後一個問題上。是否有任何值得一讀的ncrawler-solr集成的參考鏈接或電子書？ – Anu

很高興這些幫助。不幸的是，沒有任何NCrawler-Solr Integration參考，我通過反覆試驗瞭解了所有這些。 –

如何防止Solr添加頁眉和頁腳？

回答

相關問題