2
我試圖解析網站「https://www.crunchbase.com」。但是這個網站有一個「Antibot保護」。而且我不知道如何從頁面獲取任何html元素。嘗試解析HTML時出錯
首先我做了一個「ssl」安全通道。
ServicePointManager.Expect100Continue = true;
ServicePointManager.SecurityProtocol = SecurityProtocolType.Ssl3 | SecurityProtocolType.Tls | SecurityProtocolType.Tls11 | SecurityProtocolType.Tls12;
然後我用瀏覽器的用戶代理字符串做了一個HttpRequest。
var request = (HttpWebRequest)WebRequest.Create("https://www.crunchbase.com");
request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0";
request.Timeout = 10000;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
Console.WriteLine("Server status code: " + response.StatusCode);
並用一個StreamWriter加載頁面:
using (StreamReader sr = new StreamReader(response.GetResponseStream()))
{
string result = sr.ReadToEnd();
Console.WriteLine(result);
}
但是結果是: enter image description here
最後我試圖讓從頁面的所有URL:
HtmlWeb web = new HtmlWeb();
HtmlDocument document = web.Load(response.ResponseUri.AbsoluteUri);
string respUri = response.ResponseUri.ToString();
HtmlNode[] nodes = document.DocumentNode.SelectNodes("//a").ToArray();
foreach (var item in nodes)
{
Console.WriteLine(item.InnerHtml);
}
但適用性會引發Unhadled異常。