2013-04-22 14 views
0

我有一個數據表中的URLS列表。我想刪除以相同域開始的行。現在,我有這樣的代碼:linq to sql刪除以相同域開頭的URL

List<int> toRemove = new List<int>(); 
      toRemove.Clear(); 
      string initialDomain; 
      string compareDomainName; 
      for(int i = 0; i<UrlList.Rows.Count -1; i++) 
      { 
       if (toRemove.Contains(i)) 
        continue; 

       initialDomain = new Uri(UrlList.Rows[i][0] as String).Host; 
       for(int j = i + 1; j < UrlList.Rows.Count; j++) 
       { 
        compareDomainName = new Uri(UrlList.Rows[j][0] as String).Host; 
        if (String.Compare(initialDomain, compareDomainName, true) == 0) 
        { 
         toRemove.Add(j); 
        } 
       } 

       percent = i * 100/total; 
       if (percent > lastPercent) 
       { 
        progress.EditValue = percent; 
        Application.DoEvents(); 
        lastPercent = percent; 

       } 
      } 

      for(int i = toRemove.Count-1; i>=0; i--) 
      { 
       UrlList.Rows.RemoveAt(toRemove[i]); 
      } 

它非常適合數據量小,但是當我加載的URL的一個長長的清單是非常緩慢的。現在我想轉移到LINQ,但我不知道如何使用LINQ來實現這一點。任何幫助?

更新* 我並不需要刪除eduplicate行。例如。 我有一個URL列表 現在,我知道如何刪除重複的行。我的問題是: 我有URL的簡單列表:

http://centroid.steven.centricagency.com/forms/contact-us?page=1544 
http://chirp.wildcenter.org/poll 
http://itdiscover.com/links/ 
http://itdiscover.com/links/?page=132 
http://itdiscover.com/links/?page=2 
http://itdiscover.com/links/?page=3 
http://itdiscover.com/links/?page=4 
http://itdiscover.com/links/?page=6 
http://itdiscover.com/links/?page=8 

http://www.foreignpolicy.com/articles/2010/06/21/la_vie_en 
http://www.foreignpolicy.com/articles/2010/06/21/the_worst_of_the_worst 
http://www.foreignpolicy.com/articles/2011/04/25/think_again_dictators 
http://www.foreignpolicy.com/articles/2011/08/22/the_dictators_survival_guide 
http://www.gsioutdoors.com/activities/pdp/glacier_ss_nesting_wine_glass/gourmet_backpacking/ 
http://www.gsioutdoors.com/products/pdp/telescoping_foon_orange/ 
http://www.gsioutdoors.com/products/pdp/telescoping_spoon_blue/ 

現在我希望這個名單:

http://centroid.steven.centricagency.com/forms/contact-us?page=1544 
    http://chirp.wildcenter.org/poll 
    http://itdiscover.com/links/ 
    http://www.foreignpolicy.com/articles/2010/06/21/la_vie_en 
http://www.gsioutdoors.com/activities/pdp/glacier_ss_nesting_wine_glass/gourmet_backpacking/ 
+0

要刪除重複的行? – Tim 2013-04-22 06:23:29

回答

2
var result = urls.Distinct(new UrlComparer()); 

public class UrlComparer : IEqualityComparer<string> 
{ 
    public bool Equals(string x, string y) 
    { 
     return new Uri(x).Host == new Uri(y).Host; 
    } 

    public int GetHashCode(string obj) 
    { 
     return new Uri(obj).Host.GetHashCode(); 
    } 
} 

您還可以實現擴展方法DistinctBy

public static partial class MyExtensions 
{ 
    public static IEnumerable<T> DistinctBy<T, TKey>(this IEnumerable<T> source, Func<T, TKey> keySelector) 
    { 
     HashSet<TKey> knownKeys = new HashSet<TKey>(); 
     return source.Where(x => knownKeys.Add(keySelector(x))); 
    } 
} 

var result = urls.DistinctBy(url => new Uri(url).Host); 
+0

謝謝,我已經使用IEqualityComparer方法。什麼時候會有時間會嘗試比較兩個例子的速度。 – 2013-04-22 06:45:39

-1

嗨實現此功能來消除重複的行

public DataTable FilterURLS(DataTable urllist) 
{ 
     return 
      (from urlrow in urllist.Rows.OfType<DataRow>() 
      group urlrow by urlrow.Field<string>("Host") into g 
      select g 
      .OrderBy(r => r.Field<int>("ID")) 
      .First()).CopyToDataTable(); 
    } 
0

嘗試使用此:所以基本上

IEnumerable<string> DeleteDuplicates(IEnumerable<string> source) 
{ 
    var hosts = new HashSet<string>(); 

    foreach (var s in source) 
    { 
     var host = new Uri(s).Host.ToLower(); 

     if (hosts.Contains(host)) 
      continue; 

     hosts.Add(host); 

     yield return s; 
    } 
}