給定一個網址，我希望能夠捕獲該網址指向的頁面的標題，以及作爲其他信息 - 例如第一個文本的片段段落在一個頁面上？ - 甚至可能是來自頁面的圖像。從頁面獲取數據，一個網址指向

Digg.com在提交網址時很好用。

這樣的事怎麼可能在.Net c＃中完成？

來源

2011-01-06 raklos

您正在尋找可以解析格式錯誤的HTML文檔的HTML Agility Pack。
您可以使用其HTMLWeb類來通過HTTP下載網頁。

您還可以使用.Net的WebClient class通過HTTP下載文本。
但是，它不會幫助你解析HTML。

來源

2011-01-06 13:33:15 SLaks

你可以嘗試這樣的事：

using System; 
using System.Collections.Generic; 
using System.IO; 
using System.Net; 
using System.Text; 

namespace WebGet 
{ 
    class progMain 
    { 
     static void Main(string[] args) 
     { 
      ASCIIEncoding asc = new ASCIIEncoding(); 
      WebRequest wrq = WebRequest.Create("http://localhost"); 

      WebResponse wrp = wrq.GetResponse(); 
      byte [] responseBuf = new byte[wrp.ContentLength]; 

      int status = wrp.GetResponseStream().Read(responseBuf, 0, responseBuf.Length); 
      Console.WriteLine(asc.GetString(responseBuf)); 
     } 
    } 
}

一旦你的緩衝區，你可以處理它尋找段落或圖片的HTML標籤中提取返回的數據的部分。

來源

2011-01-06 13:55:59 user565494

您可以使用如下函數提取頁面的標題。您需要修改正則表達式來查找（比如說）第一段文本，但由於每個頁面都不相同，因此這可能很困難。但是，您可以查找元描述標記並從中獲取值。

public static string GetWebPageTitle(string url) 
{ 
    // Create a request to the url 
    HttpWebRequest request = HttpWebRequest.Create(url) as HttpWebRequest; 

    // If the request wasn't an HTTP request (like a file), ignore it 
    if (request == null) return null; 

    // Use the user's credentials 
    request.UseDefaultCredentials = true; 

    // Obtain a response from the server, if there was an error, return nothing 
    HttpWebResponse response = null; 
    try { response = request.GetResponse() as HttpWebResponse; } 
    catch (WebException) { return null; } 

    // Regular expression for an HTML title 
    string regex = @"(?<=<title.*>)([\s\S]*)(?=</title>)"; 

    // If the correct HTML header exists for HTML text, continue 
    if (new List<string>(response.Headers.AllKeys).Contains("Content-Type")) 
     if (response.Headers["Content-Type"].StartsWith("text/html")) 
     { 
     // Download the page 
     WebClient web = new WebClient(); 
     web.UseDefaultCredentials = true; 
     string page = web.DownloadString(url); 

     // Extract the title 
     Regex ex = new Regex(regex, RegexOptions.IgnoreCase); 
     return ex.Match(page).Value.Trim(); 
     } 

    // Not a valid HTML page 
    return null; 
}

來源

2011-01-06 13:59:31 Scott

您可以使用Selenium RC（開源，www.seleniumhq.org）來解析頁面中的數據等。它是一個帶有C＃.Net庫的Web測試自動化工具。

硒有完整的API來讀取html頁面上的特定項目。

來源

2011-01-06 13:59:49 StefanE

從頁面獲取數據，一個網址指向

回答

你可以嘗試這樣的事：

相關問題