2011-06-09 76 views
-5

我正在使用C#,並且想要在網站上抓取所有內容(但不包括可能附加到頁面的圖像,腳本或文件)。我如何用C#和ASP.NET做到這一點?只從網站頁面讀取HTML內容

+1

你想在服務器端讀取頁面的HTML或什麼? – PSK 2011-06-09 10:50:03

+0

你需要提供更多的細節,你的問題不清楚。 – PSK 2011-06-09 10:54:13

+1

您想僅從網頁中提取文字? – 2011-06-09 10:57:02

回答

1

嗨,你可以使用下面的代碼片段從HERE做到這一點:

StringBuilder sb = new StringBuilder(); 
byte[]  buf = new byte[8192]; 

HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://www.your-url.com"); 
HttpWebResponse response = (HttpWebResponse)request.GetResponse(); 

Stream resStream = response.GetResponseStream(); 

string tempString = null; 
int count  = 0; 
do 
{ 
    count = resStream.Read(buf, 0, buf.Length); 

    if (count != 0) 
    { 
     tempString = Encoding.ASCII.GetString(buf, 0, count); 
     sb.Append(tempString); 
    } 
} 
while (count > 0); 

Console.WriteLine(sb.ToString()); 
0

您還可以在PageRender方法獲取HTML如下。

protected override void Render(System.Web.UI.HtmlTextWriter writer) 
     { 

      StringBuilder sb = new StringBuilder(); 
      StringWriter sw = new StringWriter(sb); 

      HtmlTextWriter writer = new HtmlTextWriter(sw); 
      base.Render(writer); 
      string markupText = sb.ToString(); 
      // markupText will contain the HTML of the Page 
      writer.Write(markupText); 
     }