用Html Agility Pack颳去網站。 GET的響應不如預期

使用System.Net.HttpRequest我想在我的代碼中在以下搜索引擎上模仿用戶搜索。用Html Agility Pack颳去網站。 GET的響應不如預期

http://www.scirus.com

搜索URL的一個例子是如下：

http://www.scirus.com/srsapp/search?q=core+facilities&t=all&sort=0&g=s

我有以下代碼來執行HTTP GET。注意我正在使用HtmlAgilityPack。

protected override HtmlDocument MakeRequestHtml(string requestUrl) 
{ 
    try 
    { 
     HttpWebRequest request = WebRequest.Create(requestUrl) as HttpWebRequest; 
     request.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"; 
     HttpWebResponse response = request.GetResponse() as HttpWebResponse; 

     HtmlDocument htmlDoc = new HtmlDocument(); 
     htmlDoc.Load(response.GetResponseStream()); 
     return (htmlDoc); 

    } 
    catch (Exception e) 
    { 
     Console.WriteLine(e.Message); 
     Console.Read(); 
     return null; 
    } 
}

其中「requestUrl」是上面顯示的示例搜索URL。

htmlDoc.DocumentNode.InnerHtml的內容不包含任何搜索結果，並且看起來完全不像您複製粘貼上面顯示的示例搜索URL到瀏覽器中的搜索結果頁面。

我猜這是因爲你必須先有一個會話才能執行請求。任何人都可以建議是否有可行的方法來複制用戶代理的行爲？或者，也許有一種更好的方式來達到「刮」我不知道的搜索結果的目標？建議請。

robots.txt的內容：htmlDoc.DocumentNode.InnerHtml

Response

來源

2012-06-03 dior001

OK我實際上WebClient的

 static void Main(string[] args) 
    { 
     WebClient client = new WebClient(); 
     client.Headers.Set("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0"); 
     string str = client.DownloadString("http://www.scirus.com/srsapp/search?q=core+facilities&t=all&sort=0&g=s"); 
     byte[] bit = new System.Text.ASCIIEncoding().GetBytes(str); 
     FileStream fil = File.OpenWrite("test.txt"); 
     fil.Write(bit,0,bit.Length); 
    }

測試，這裏是下載的文件http://pastebin.com/qswtgC4n

來源

2012-06-03 06:35:07 Lakis

謝謝你的作品。其實原始代碼也適用。問題是由於MakeRequestHtml方法的requestUrl參數的格式不正確造成的。 – dior001

的

#/robots.txt file for http://www.scirus.com 

User-agent: NetMechanic 
Disallow: /srsapp/sciruslink 

User-agent: * 
Disallow: /srsapp/sciruslink 
Disallow: /srsapp/search 
Disallow: /srsapp/search_simple 
Disallow: /search_simple 
# for dev and accept server uncomment below line at Build time to disallow robots completely 
##Disallow:/

內容你可能需要設置一個用戶代理，例如

request.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)";

您還應該檢查該網站的Robots.txt文件以確保您受到歡迎。

來源

2012-06-03 02:38:45

感謝您的回覆。在通過代碼生成的響應中，我仍然通過瀏覽器生成完全不同的HTML。我發佈了包含robots.txt的更新代碼。你有進一步的建議嗎？ – dior001

我測試了更改用戶代理並完美運行 – Lakis

感謝您對它進行測試。我更新了帖子以顯示我從請求中獲得的回覆。這與在瀏覽器http://www.scirus.com/srsapp/search?q=core+facilities&t=all&sort=0&g=s中打開此鏈接不同。您是否可以通過在瀏覽器中打開上面的鏈接來確認您是否獲得了運行代碼的相同HTML，並且如果您獲得了相同的HTML，那麼我可能會錯誤地獲取我在時刻。 – dior001

-1

確保您不會ping服務器過度，特別是如果代碼加載文檔先前的工作。您可能遇到了將您發送到robots.txt或類似頁面的服務器規則。

來源

2014-12-10 02:04:02 alec

用Html Agility Pack颳去網站。 GET的響應不如預期

回答

相關問題