2010-01-31 19 views
2

使用正則表達式中提取的URL,我在下面的網址csharp-online 拍攝靈感來自於例如節目,並打算在此頁面檢索的所有網址alexa在.NET

using System; 
using System.Collections; 
using System.Collections.Generic; 
using System.Linq; 
using System.Text; 
using System.Net; 
using System.Text.RegularExpressions; 
namespace ExtractingUrls 
{ 
    class Program 
    { 
     static void Main(string[] args) 
     { 
      WebClient client = new WebClient(); 
      const string url = "http://www.alexa.com/topsites/category/Top/Society/History/By_Topic/Science/Engineering_and_Technology"; 
      string source = client.DownloadString(url); 
      //Console.WriteLine(Getvals(source)); 
      string matchPattern = 
        @"<a.rel=""nofollow"".style=""font-size:0.8em;"".href=[""'](?<url>[^""^']+[.]*)[""'].class=""offsite"".*>(?<name>[^<]+[.]*)</a>"; 
      foreach (Hashtable grouping in ExtractGroupings(source, matchPattern, true)) 
      { 
       foreach (DictionaryEntry DE in grouping) 
       { 
        Console.WriteLine("Value = " + DE.Value); 
        Console.WriteLine(""); 
       } 
      } 
      // End. 
      Console.ReadLine(); 
     } 
     public static ArrayList ExtractGroupings(string source, string matchPattern, bool wantInitialMatch) 
     { 
      ArrayList keyedMatches = new ArrayList(); 
      int startingElement = 1; 
      if (wantInitialMatch) 
      { 
       startingElement = 0; 
      } 
      Regex RE = new Regex(matchPattern, RegexOptions.Multiline); 
      MatchCollection theMatches = RE.Matches(source); 
      foreach (Match m in theMatches) 
      { 
       Hashtable groupings = new Hashtable(); 
       for (int counter = startingElement; counter < m.Groups.Count; counter++) 
       { 
        // If we had just returned the MatchCollection directly, the 
        // GroupNameFromNumber method would not be available to use 
        groupings.Add(RE.GroupNameFromNumber(counter), 
        m.Groups[counter]); 
       } 
       keyedMatches.Add(groupings); 
      } 
      return (keyedMatches); 
     } 
    } 
} 

但在這裏我面臨一個問題,當我執行每個URL時會顯示三次,這是首先顯示整個定位標記,然後該URL顯示兩次。任何人都可以建議我應該在哪裏糾正,以便我可以讓每個URL只顯示一次。

+2

** DO _not_解析HTML使用正則表達式** http://stackoverflow.com/questions/1732348/regex-match-open-tags-除了-xhtml-self-contained-tags – SLaks 2010-01-31 23:49:35

+0

@SLacks:「有時候適合解析一個有限的,已知的HTML集合」 – 2010-02-06 01:12:01

回答

1
在你的正則表達式

,並整場比賽。如果我正確地讀它,你應該只希望匹配的URL部分,這是3個集團的第二....

,而不是這樣的:

for (int counter = startingElement; counter < m.Groups.Count; counter++) 
      { 
       // If we had just returned the MatchCollection directly, the 
       // GroupNameFromNumber method would not be available to use 
       groupings.Add(RE.GroupNameFromNumber(counter), 
       m.Groups[counter]); 
      } 

你不希望這個?:

groupings.Add(RE.GroupNameFromNumber(1),m.Groups[1]); 
3

使用HTML Agility Pack解析HTML。我認爲這會讓你的問題更容易解決。

下面是做這件事:你有兩個分組,

WebClient client = new WebClient(); 
string url = "http://www.alexa.com/topsites/category/Top/Society/History/By_Topic/Science/Engineering_and_Technology"; 
string source = client.DownloadString(url); 
HtmlDocument doc = new HtmlDocument(); 
doc.LoadHtml(source); 
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href and @rel='nofollow']")) 
{ 
    Console.WriteLine(link.Attributes["href"].Value); 
} 
1
int startingElement = 1; 
if (wantInitialMatch) 
{ 
startingElement = 0; 
} 

...

for (int counter = startingElement; counter < m.Groups.Count; counter++) 
{ 
// If we had just returned the MatchCollection directly, the 
// GroupNameFromNumber method would not be available to use 
    groupings.Add(RE.GroupNameFromNumber(counter), 
    .Groups[counter]); 
} 

你傳球wantInitialMatch = true,所以你的for循環返回:

.Groups[0] //entire match 
.Groups[1] //(?<url>[^""^']+[.]*) href part 
.Groups[2] //(?<name>[^<]+[.]*) link text 
+0

謝謝保羅,現在我明白我哪裏出錯了。 – Chaitanya 2010-01-31 23:54:23