C＃URL爬蟲沒有得到足夠的鏈接？

但是，我有下面的代碼，當我啓動它時，我只有縫隙才能得到一些返回的URL。C＃URL爬蟲沒有得到足夠的鏈接？

while (stopFlag != true) 
{ 
    WebRequest request = WebRequest.Create(urlList[i]); 
    using (WebResponse response = request.GetResponse()) 
    { 
     using (StreamReader reader = new StreamReader 
      (response.GetResponseStream(), Encoding.UTF8)) 
     { 
      string sitecontent = reader.ReadToEnd(); 
      //add links to the list 
      // process the content 
      //clear the text box ready for the HTML code 
      //Regex urlRx = new Regex(@"((https?|ftp|file)\://|www.)[A-Za-z0-9\.\-]+(/[A-Za-z0-9\?\&\=;\+!'\(\)\*\-\._~%]*)*", RegexOptions.IgnoreCase); 
      Regex urlRx = new Regex(@"(?<url>(http:[/][/]|www.)([a-z]|[A-Z]|[0-9]|[/.]|[~])*)", RegexOptions.IgnoreCase); 

      MatchCollection matches = urlRx.Matches(sitecontent); 
      foreach (Match match in matches) 
      { 
       string cleanMatch = cleanUP(match.Value); 
       urlList.Add(cleanMatch); 

       updateResults(theResults, "\"" + cleanMatch + "\",\n"); 

      } 
     } 
    } 
}

我認爲錯誤在正則表達式內。

我試圖實現的是拉一個網頁，然後抓取該頁面的所有鏈接 - 將這些鏈接添加到列表中，然後爲每個列表項獲取下一頁並重復該過程。

來源

2012-07-09 developer__c

，而不是試圖用regex to parse HTML，我建議使用一個良好的HTML解析器 - 的HTML Agilty Pack是一個不錯的選擇：

什麼是完全的HTML敏捷性包（HAP）？

這是一個敏捷的HTML解析器，它構建了一個讀/寫DOM並支持普通的XPATH或XSLT（實際上，您不需要理解XPATH或XSLT來使用它，不用擔心）。它是一個.NET代碼庫，允許您解析「離開網頁」的HTML文件。解析器對「真實世界」格式錯誤的HTML非常寬容。對象模型與提出System.Xml非常相似，但是對於HTML文檔（或流）。

來源

2012-07-09 20:35:50 Oded

還有哪些HTML解析器可用？ HTML敏捷包缺少任何文檔。 – 2012-07-09 20:41:13

@thatnerdoverthere - 源代碼下載中有很多**例子。 – Oded 2012-07-09 20:42:15

C＃URL爬蟲沒有得到足夠的鏈接？

回答

相關問題