2013-03-20 30 views
1

我設法抓取一個網站使用NCrawler。是否可以將該數據導入到SOLR中,以便我可以使用SOLR中的索引數據進行搜索?NCrawler與SOLR

如果有可能,我該如何將抓取的數據推送到SOLR?任何幫助將非常感激。

在此先感謝。

+0

您可以使用Nutch與Solr ...提供開箱即用的Solr爬行和集成。 – Jayendra 2013-03-20 08:29:41

+0

謝謝@Jayendra,我曾嘗試過使用Nutch,但無法成功爬網,並發現它不支持在Windows中與cygwin,因爲我得到了一些錯誤完全像[this](http://stackoverflow.com/questions/ 15188050/Nutch的功能於Windows的失敗到組的權限-的路徑)。所以我打算使用NCrawler進行爬網。是可以分別使用NCrawler和SOLR進行抓取和索引? – Anu 2013-03-20 09:00:30

回答

4

是的,可以將抓取的數據索引到Solr。我以前做過這個。您將需要創建一個自定義管道步驟來實現IPipelineStep並將其添加到您的NCrawler實現中。我使用SolrNet作爲連接到Solr的客戶端。

這裏有一些代碼可以幫助你開始。

SolrNet.Startup.Init<IndexItem>("http://localhost:8983/solr"); 

using(Crawler c = new Crawler("http://ncrawler.codeplex.com/", 
    new HtmlDocumentProcessor(), new AddCrawledItemToSolrIndex())) 
{ 
    c.ThreadCount = 3; 
    c.MaxCrawlDepth = 2; 
    c.ExcludeFilter = new[] { new RegexFilter(
     new Regex(@"(\.jpg|\.css|\.js|\.gif|\.jpeg|\.png|\.ico)", 
      RegexOptions.Compiled | RegexOptions.CultureInvariant | RegexOptions.IgnoreCase)) }, 
    c.Crawl(); 
} 

定製IPipelineStep

using System; 
using System.Collections.ObjectModel; 
using Microsoft.Practices.ServiceLocation; 
using MyCrawler.Index; 
using NCrawler; 
using NCrawler.Interfaces; 
using SolrNet; 

namespace MyCrawler.Crawler 
{ 
    public class AddCrawledItemToSolrIndex : IPipelineStep 
    { 
     public void Process(NCrawler.Crawler crawler, PropertyBag propertyBag) 
     { 
      if (string.IsNullOrWhiteSpace(propertyBag.Text)) 
       return; 

      var indexItem = new IndexItem 
      { 
       Id = propertyBag.Step.Uri.ToString(), 
       Url = propertyBag.Step.Uri.ToString(), 
       Host = propertyBag.Step.Uri.Host, 
       Content = propertyBag.Text, 
       Title = propertyBag.Title, 
       LastModified = Convert.ToInt64(DateTimeToUnixTimestamp(propertyBag.LastModified)), 
       Date = propertyBag.LastModified.ToString("yyyyMMdd"), 
       Keywords = ExtractKeywords(propertyBag.Headers), 
       Type = SplitString(propertyBag.ContentType, ';'), 
       Digest = CreateMD5Hash(propertyBag.Text), 
      }; 
      var solr = ServiceLocator.Current.GetInstance<ISolrOperations<IndexItem>>(); 
      solr.Add(indexItem, new AddParameters {CommitWithin = 10000}); 
     } 

     private Collection<string> SplitString(string input, char splitOn) 
     { 
      var values = input.Split(splitOn); 
      var valueCollection = new Collection<string>(); 
      if (values.Length == 0) return valueCollection; 
      foreach (var value in values) 
      { 
       valueCollection.Add(value.Trim()); 
      } 

      return valueCollection; 

     } 

     private double DateTimeToUnixTimestamp(DateTime dateTime) 
     { 
      return (dateTime - new DateTime(1970, 1, 1).ToLocalTime()).TotalSeconds; 
     } 

     private string CreateMD5Hash(string input) 
     { 
      // Use input string to calculate MD5 hash 
      var md5 = MD5.Create(); 
      var inputBytes = Encoding.ASCII.GetBytes(input); 
      var hashBytes = md5.ComputeHash(inputBytes); 

      // Convert the byte array to hexadecimal string 
      var sb = new StringBuilder(); 
      for (int i = 0; i < hashBytes.Length; i++) 
      { 
       //sb.Append(hashBytes[i].ToString("X2")); 
       // To force the hex string to lower-case letters instead of 
       // upper-case, use he following line instead: 
       sb.Append(hashBytes[i].ToString("x2")); 
      } 
      return sb.ToString(); 
     } 


     private Collection<string> ExtractKeywords(System.Net.WebHeaderCollection headers) 
     { 
      var keywords = headers["keywords"]; 
      if (string.IsNullOrWhiteSpace(keywords)) 
      { 
       return new Collection<string>(); 
      } 

      return SplitString(keywords, ','); 
     } 
    } 
} 

這是使用以下IndexItem.cs類用於映射到Solr的索引字段。

using System.Collections.ObjectModel; 
using SolrNet.Attributes; 

namespace MyCrawler.Index 
{ 
    public class IndexItem 
    { 
     [SolrField("id")] 
     public string Id { get; set; } 
     [SolrField("url")] 
     public string Url { get; set; } 
     [SolrField("host")] 
     public string Host { get; set; } 
     [SolrField("content")] 
     public string Content { get; set; } 
     [SolrField("title")] 
     public string Title { get; set; } 
     [SolrField("description")] 
     public string Description { get; set; } 
     [SolrField("digest")] 
     public string Digest { get; set; } 
     [SolrField("keywords")] 
     public Collection<string> Keywords { get; set; } 
     [SolrField("date")] 
     public string Date { get; set; } 
     [SolrField("contentLength")] 
     public long ContentLength { get; set; } 
     [SolrField("lastModified")] 
     public long LastModified { get; set; } 
     [SolrField("type")] 
     public Collection<string> Type { get; set; } 
    } 
} 

Solr字段定義(schema.xml)取自Nutch代碼庫。

<!-- core fields --> 
    <field name="segment" type="string" stored="true" indexed="false"/> 
    <field name="digest" type="string" stored="true" indexed="false"/> 
    <field name="boost" type="float" stored="true" indexed="false"/> 

    <!-- meta-tag fields --> 
    <field name="keywords" type="text_general" stored="true" indexed="true" multiValued="true"/> 
    <field name="description" type="text_general" stored="true" indexed="true"/> 

    <!-- fields for index-basic plugin --> 
    <field name="host" type="url" stored="false" indexed="true"/> 
    <field name="site" type="string" stored="true" indexed="true"/> 
    <field name="url" type="url" stored="true" indexed="true" 
     required="true"/> 
    <field name="content" type="text_general" stored="true" indexed="true"/> 
    <field name="title" type="text_general" stored="true" indexed="true"/> 
    <field name="cache" type="string" stored="true" indexed="false"/> 
    <field name="tstamp" type="long" stored="true" indexed="true"/> 

    <!-- fields for index-anchor plugin --> 
    <field name="anchor" type="string" stored="true" indexed="true" 
     multiValued="true"/> 

    <!-- fields for index-more plugin --> 
    <field name="type" type="string" stored="true" indexed="true" 
     multiValued="true"/> 
    <field name="contentLength" type="long" stored="true" 
     indexed="false"/> 
    <field name="lastModified" type="long" stored="true" 
     indexed="true"/> 
    <field name="date" type="string" stored="true" indexed="true"/> 

    <!-- fields for languageidentifier plugin --> 
    <field name="lang" type="string" stored="true" indexed="true"/> 

    <!-- fields for subcollection plugin --> 
    <field name="subcollection" type="string" stored="true" 
     indexed="true"/> 

    <!-- fields for feed plugin --> 
    <field name="author" type="string" stored="true" indexed="true"/> 
    <field name="tag" type="string" stored="true" indexed="true"/> 
    <field name="feed" type="string" stored="true" indexed="true"/> 
    <field name="publishedDate" type="string" stored="true" 
     indexed="true"/> 
    <field name="updatedDate" type="string" stored="true" 
     indexed="true"/> 

    <!-- catchall field, containing all other searchable text fields (implemented 
    via copyField further on in this schema --> 
    <field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/> 

    <field name="_version_" type="long" indexed="true" stored="true"/> 

    <field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/> 
</fields> 

很明顯,你會希望修改這個以滿足您的需求,它可能會使用一些性能改進。但應該是一個很好的參考點。

+0

非常感謝Paige Cook。是的,這對我來說絕對是一個很好的參考點!非常感謝,讓我試試。 :) – Anu 2013-03-20 12:55:28