2012-10-13 36 views
4

Selenium是否可以爬取頂級域名(TLD)並逐步導出找到的任何404的列表?Selenium建立404s列表

我卡在Windows機器上幾個小時,想要運行前回* nix中的舒適一些測試...

+0

你如何運行你的測試?通過將測試數據導出到SQL Server,NUnit可以做類似的事情,但是如果你沒有進入windows/ms/.net,我只能給你一個概念性的答案。 – Izzy

+0

測試通過Python運行,建立在unittest庫上。它確實在Windows機器上執行WebDriver測試,並可以利用數據庫導出測試數據。請發佈您的解決方案作爲答案,只要它設法抓取一個網站並且標記404's,那將適合賬單 –

+1

只要做了一個快速搜索,會** [this](https://github.com/cmwslw/硒爬行)**爲你工作?聽起來它會通過並獲得一個列表,你可以寫一些代碼來獲取404的。雖然,它確實需要鏈接被暴露。你在尋找更像'wget -r'的東西嗎? – kgdesouz

回答

1

我不知道Python的非常好,也沒有任何的其常用的庫,但我可能會做這樣的事情(使用C#代碼的例子,但這個概念應適用):

// WARNING! Untested code here. May not completely work, and 
// is not guaranteed to even compile. 

// Assume "driver" is a validly instantiated WebDriver instance 
// (browser used is irrelevant). This API is driver.get in Python, 
// I think. 
driver.Url = "http://my.top.level.domain/"; 

// Get all the links on the page and loop through them, 
// grabbing the href attribute of each link along the way. 
// (Python would be driver.find_elements_by_tag_name) 
List<string> linkUrls = new List<string>(); 
ReadOnlyCollection<IWebElement> links = driver.FindElement(By.TagName("a")); 
foreach(IWebElement link in links) 
{ 
    // Nice side effect of getting the href attribute using GetAttribute() 
    // is that it returns the full URL, not relative ones. 
    linkUrls.Add(link.GetAttribute("href")); 
} 

// Now that we have all of the link hrefs, we can test to 
// see if they're valid. 
List<string> validUrls = new List<string>(); 
List<string> invalidUrls = new List<string>(); 
foreach(string linkUrl in linkUrls) 
{ 
    HttpWebRequest request = WebRequest.Create(linkUrl) as HttpWebRequest; 
    request.Method = "GET"; 

    // For actual .NET code, you'd probably want to wrap this in a 
    // try-catch, and use a null check, in case GetResponse() throws, 
    // or returns a type other than HttpWebResponse. For Python, you 
    // would use whatever HTTP request library is common. 

    // Note also that this is an extremely naive algorithm for determining 
    // validity. You could just as easily check for the NotFound (404) 
    // status code. 
    HttpWebResponse response = request.GetResponse() as HttpWebResponse; 
    if (response.StatusCode == HttpStatusCode.OK) 
    { 
     validUrls.Add(linkUrl); 
    } 
    else 
    { 
     invalidUrls.Add(linkUrl); 
    } 
} 

foreach(string invalidUrl in invalidUrls) 
{ 
    // Here is where you'd log out your invalid URLs 
} 

在這一點上,你必須有效和無效的URL列表。你可以將這些全部包裝成一種方法,你可以將你的TLD URL傳遞給它,並且用每個有效的URL遞歸地調用它。這裏的關鍵是你不用Selenium來真正確定鏈接的有效性。如果您確實在進行遞歸爬網,您不希望「點擊」鏈接導航到下一頁。相反,您希望直接導航到頁面上的鏈接。

您還可以採用其他方法,例如通過代理運行所有內容,並以這種方式捕獲響應代碼。這取決於您希望如何構建解決方案。