0
但是,我有下面的代碼,當我啓動它時,我只有縫隙才能得到一些返回的URL。C#URL爬蟲沒有得到足夠的鏈接?
while (stopFlag != true)
{
WebRequest request = WebRequest.Create(urlList[i]);
using (WebResponse response = request.GetResponse())
{
using (StreamReader reader = new StreamReader
(response.GetResponseStream(), Encoding.UTF8))
{
string sitecontent = reader.ReadToEnd();
//add links to the list
// process the content
//clear the text box ready for the HTML code
//Regex urlRx = new Regex(@"((https?|ftp|file)\://|www.)[A-Za-z0-9\.\-]+(/[A-Za-z0-9\?\&\=;\+!'\(\)\*\-\._~%]*)*", RegexOptions.IgnoreCase);
Regex urlRx = new Regex(@"(?<url>(http:[/][/]|www.)([a-z]|[A-Z]|[0-9]|[/.]|[~])*)", RegexOptions.IgnoreCase);
MatchCollection matches = urlRx.Matches(sitecontent);
foreach (Match match in matches)
{
string cleanMatch = cleanUP(match.Value);
urlList.Add(cleanMatch);
updateResults(theResults, "\"" + cleanMatch + "\",\n");
}
}
}
}
我認爲錯誤在正則表達式內。
我試圖實現的是拉一個網頁,然後抓取該頁面的所有鏈接 - 將這些鏈接添加到列表中,然後爲每個列表項獲取下一頁並重復該過程。
還有哪些HTML解析器可用? HTML敏捷包缺少任何文檔。 – 2012-07-09 20:41:13
@thatnerdoverthere - 源代碼下載中有很多**例子。 – Oded 2012-07-09 20:42:15