2013-10-19 108 views
1

在新的類,我有一個方法:在第一我試圖從一個網站提取所有鏈接,但只有一些鏈接被提取爲什麼?

class MyClient : WebClient 
     { 
      public bool HeadOnly { get; set; } 
      protected override WebRequest GetWebRequest(Uri address) 
      { 
       WebRequest req = base.GetWebRequest(address); 
       if (HeadOnly && req.Method == "GET") 
       { 
        req.Method = "HEAD"; 
       } 
       return req; 
      } 
     } 

     public static HtmlAgilityPack.HtmlDocument getHtmlDocumentWebClient(string url, bool useProxy, string proxyIp, int proxyPort, string usename, string password) 
     { 
      try 
      { 
       doc = null; 
       using (MyClient clients = new MyClient()) 
       { 
        clients.HeadOnly = false; 
        byte[] body = clients.DownloadData(url); 
        // note should be 0-length 
        string type = clients.ResponseHeaders["content-type"]; 
        clients.HeadOnly = false; 
        // check 'tis not binary... we'll use text/, but could 
        // check for text/html 
        if (type == null) 
        { 
         return null; 
        } 
        else 
        { 
         if (type.StartsWith(@"text/html")) 
         { 
          string text = clients.DownloadString(url); 


          doc = new HtmlAgilityPack.HtmlDocument(); 
          WebClient client = new WebClient(); 
          //client.Headers.Add("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)"); 
          client.Credentials = CredentialCache.DefaultCredentials; 
          client.Proxy = WebRequest.DefaultWebProxy; 
          if (useProxy) 
          { 
           //Proxy     
           if (!string.IsNullOrEmpty(proxyIp)) 
           { 
            WebProxy p = new WebProxy(proxyIp, proxyPort); 
            if (!string.IsNullOrEmpty(usename)) 
            { 
             if (password == null) 
              password = string.Empty; 
             NetworkCredential nc = new NetworkCredential(usename, password); 
             p.Credentials = nc; 
            } 
           } 
          } 
          doc.Load(client.OpenRead(url)); 

         } 
        } 
       } 
      } 
      catch (Exception err) 
      { 

      } 
      return doc; 
     } 

     private static string GetUrl(string url) 
     { 
      string startTag = "Url: "; 
      string endTag = " ---"; 
      int startTagWidth = startTag.Length; 
      int endTagWidth = endTag.Length; 
      int index = 0; 
      index = url.IndexOf(startTag, index); 
      int start = index + startTagWidth; 
      index = url.IndexOf(endTag, start + 1); 
      string g = url.Substring(start, index - start); 
      return g; 
     } 

然後:

public List<string> test(string mainUrl, int levels) 
     { 
      List<string> csFiles = new List<string>(); 
      wc = new System.Net.WebClient(); 
       HtmlWeb hw = new HtmlWeb(); 
       List<string> webSites; 
       csFiles.Add("temp string to know that something is happening in level = " + levels.ToString()); 
       csFiles.Add("current site name in this level is : " + mainUrl); 
       try 
       { 
        HtmlAgilityPack.HtmlDocument doc = TimeOut.getHtmlDocumentWebClient(mainUrl, false, "", 0, "", ""); 

         currentCrawlingSite.Add(mainUrl); 
         webSites = getLinks(doc); 

在我有可變文檔是從類超時調用的方法在哪裏下載網址類我有這個方法:

private List<string> getLinks(HtmlAgilityPack.HtmlDocument document) 
     { 

       List<string> mainLinks = new List<string>(); 
       var linkNodes = document.DocumentNode.SelectNodes("//a[@href]"); 
       if (linkNodes != null) 
       { 
        foreach (HtmlNode link in linkNodes) 
        { 
         var href = link.Attributes["href"].Value; 
         if (href.StartsWith("http://") == true || href.StartsWith("https://") == true || href.StartsWith("www") == true) // filter for http 
         { 
          mainLinks.Add(href); 
         } 
        } 
       } 

       return mainLinks; 


     } 

因此,例如,可以說主要的網址爲:

https://github.com/jasonwupilly/Obsidian/tree/master/Obsidian 

在那裏我可以看到更多的10個鏈接。 但事實上,當我把一行後面的斷點:webSites = getLinks(doc); 我看到裏面只有7個鏈接。 網站的列表類型

爲什麼我看到的只有7個環節,如果主URL有超過10個環節,他們都開始用HTTP或HTTPS或www

我想,或許真的與方法: getLinks 是不對的。出於某種原因,它沒有得到所有的鏈接。

回答

1

我懷疑一些鏈接有一個相對URL(例如href="/foo/bar/"),並且它們被您的條件href應該以「http://」或「https://」開頭的條件過濾掉。在這些情況下,您應該將相對URL與頁面的URL組合起來:

Uri baseUri = new Uri(pageUrl); 
Uri fullUri = new Uri(baseUri, relativeUrl); 
相關問題