從C＃網站抓取內容＃

這裏新增了C＃，但我已經使用了Java多年。我試着用google搜索，並得到了一些不太符合我需要的答案。我想從網站中獲取（X）HTML，然後使用DOM（實際上，CSS選擇器是可取的，但無論如何）來獲取特定元素。這在C＃中完成得如何？從C＃網站抓取內容＃

來源

2011-06-29 Peter

您可以加入一些示例代碼，讓我們一起工作？ – jp2code

這太糟糕了評論不能downvoted。 –

我聽說你想用HtmlAgilityPack來處理HTML文件。這將給你Linq訪問，與好事（tm）。您可以使用System.Net.WebClient下載該文件。

來源

2011-06-29 14:15:11

您可以使用Html Agility Pack加載html並找到您需要的元素。

來源

2011-06-29 14:16:00 Giorgi

要獲取HTML，您可以使用WebClient對象。

解析HTML可以使用HTMLAgility librrary。

來源

2011-06-29 14:16:53 Maxim

爲了讓你開始，你可以很容易地使用HttpWebRequest來獲取URL的內容。從那裏，你將不得不做一些解析HTML。這就是開始變得棘手的地方。您不能使用正常的XML解析器，因爲許多（大多數？）網站HTML頁面不是100％有效的XML。 Web瀏覽器專門實現瞭解析器來解決無效部分。在Ruby中，我會使用類似Nokogiri的東西來解析HTML，因此您可能需要查找它的.NET端口或特定設計用於讀取HTML的另一個解析器。

編輯：

由於話題很可能上來：WebClient vs. HttpWebRequest/HttpWebResponse

而且，這要歸功於回答了提HtmlAgility別人。我不知道它存在。

來源

2011-06-29 14:17:12 CodingWithSpike

研究使用html敏捷包，這是解析html的更常見的庫之一。

http://htmlagilitypack.codeplex.com/

來源

2011-06-29 14:17:21 Tija

// prepare the web page we will be asking for 
     HttpWebRequest request = (HttpWebRequest) 
      WebRequest.Create("http://www.stackoverflow.com"); 

     // execute the request 
     HttpWebResponse response = (HttpWebResponse)request.GetResponse(); 

     // we will read data via the response stream 
     Stream resStream = response.GetResponseStream(); 

     string tempString = null; 
     int count  = 0; 
     do 
     { 
      // fill the buffer with data 
      count = resStream.Read(buf, 0, buf.Length); 

      // make sure we read some data 
       if (count != 0) 
      { 
      // translate from bytes to ASCII text 
      tempString = Encoding.ASCII.GetString(buf, 0, count); 

      // continue building the string 
      sb.Append(tempString); 
      } 
     } 
     while (count > 0); // any more data to read?

然後使用XQuery表達式或正則表達式來獲取元素，你需要

來源

2011-06-29 14:19:01 jaywayco

你可以使用System.Net.WebClient或System.Net.HttpWebrequest抓取網頁，但分析的元素不被類支持。

使用HtmlAgilityPack（http://html-agility-pack.net/）

HtmlWeb htmlWeb = new HtmlWeb(); 
htmlWeb.UseCookies = true; 


HtmlDocument htmlDocument = htmlWeb.Load(url); 


// after getting the document node 
// you can do something like this 
foreach (HtmlNode item in htmlDocument.DocumentNode.Descendants("input")) 
{ 
    // item mathces your req 
    // take the item. 
}

來源

2011-06-29 14:22:03

從C＃網站抓取內容＃

回答

相關問題