在.NET中有效地合併字符串數組，保持不同的值

我正在使用.NET 3.5。我有兩個字符串數組，這可能共享一個或多個值：在.NET中有效地合併字符串數組，保持不同的值

string[] list1 = new string[] { "apple", "orange", "banana" }; 
string[] list2 = new string[] { "banana", "pear", "grape" };

我想辦法將它們合併到一個數組沒有重複的值：

{ "apple", "orange", "banana", "pear", "grape" }

我可以做到這一點LINQ：

string[] result = list1.Concat(list2).Distinct().ToArray();

但我想這對大型數組不是很有效。

有沒有更好的方法？

來源

2008-09-28 Jason Anderson

string[] result = list1.Union(list2).ToArray();

從msdn：「此方法從設置返回排除重複這是不同的行爲的CONCAT（TSource）方法，該方法返回所有輸入序列中的元素包括重複項。「

來源

2008-09-29 01:04:37 Wonko

我回到這個話題發佈完全這個解決方案。我相信它在各方面都是理想的！ – 2008-09-29 14:20:54

可能創建一個散列表作爲鍵值（只添加那些不存在的值），然後將鍵轉換爲數組可能是一個可行的解決方案。

來源

2008-09-28 18:08:04

免責聲明這是不成熟的優化。對於您的示例數組，請使用3.5擴展方法。在你知道你在這個地區有性能問題之前，你應該使用庫代碼。

如果你能數組排序，或當你到這一點在代碼中他們排序，你可以使用下面的方法。

這些將從兩個項目中拉出一個項目，並生成「最低」項目，然後從相應的源中獲取新項目，直到兩個源都用盡。在從兩個來源獲取的當前項目相同的情況下，它將生成來自第一個來源的項目，並在兩個來源中跳過它們。

private static IEnumerable<T> Merge<T>(IEnumerable<T> source1, 
    IEnumerable<T> source2) 
{ 
    return Merge(source1, source2, Comparer<T>.Default); 
} 

private static IEnumerable<T> Merge<T>(IEnumerable<T> source1, 
    IEnumerable<T> source2, IComparer<T> comparer) 
{ 
    #region Parameter Validation 

    if (Object.ReferenceEquals(null, source1)) 
     throw new ArgumentNullException("source1"); 
    if (Object.ReferenceEquals(null, source2)) 
     throw new ArgumentNullException("source2"); 
    if (Object.ReferenceEquals(null, comparer)) 
     throw new ArgumentNullException("comparer"); 

    #endregion 

    using (IEnumerator<T> 
     enumerator1 = source1.GetEnumerator(), 
     enumerator2 = source2.GetEnumerator()) 
    { 
     Boolean more1 = enumerator1.MoveNext(); 
     Boolean more2 = enumerator2.MoveNext(); 

     while (more1 && more2) 
     { 
      Int32 comparisonResult = comparer.Compare(
       enumerator1.Current, 
       enumerator2.Current); 
      if (comparisonResult < 0) 
      { 
       // enumerator 1 has the "lowest" item 
       yield return enumerator1.Current; 
       more1 = enumerator1.MoveNext(); 
      } 
      else if (comparisonResult > 0) 
      { 
       // enumerator 2 has the "lowest" item 
       yield return enumerator2.Current; 
       more2 = enumerator2.MoveNext(); 
      } 
      else 
      { 
       // they're considered equivalent, only yield it once 
       yield return enumerator1.Current; 
       more1 = enumerator1.MoveNext(); 
       more2 = enumerator2.MoveNext(); 
      } 
     } 

     // Yield rest of values from non-exhausted source 
     while (more1) 
     { 
      yield return enumerator1.Current; 
      more1 = enumerator1.MoveNext(); 
     } 
     while (more2) 
     { 
      yield return enumerator2.Current; 
      more2 = enumerator2.MoveNext(); 
     } 
    } 
}

請注意，如果其中一個來源包含重複項，則可能會在輸出中看到重複項。如果您想在已經排序名單上刪除這些重複，使用下面的方法：

private static IEnumerable<T> CheapDistinct<T>(IEnumerable<T> source) 
{ 
    return CheapDistinct<T>(source, Comparer<T>.Default); 
} 

private static IEnumerable<T> CheapDistinct<T>(IEnumerable<T> source, 
    IComparer<T> comparer) 
{ 
    #region Parameter Validation 

    if (Object.ReferenceEquals(null, source)) 
     throw new ArgumentNullException("source"); 
    if (Object.ReferenceEquals(null, comparer)) 
     throw new ArgumentNullException("comparer"); 

    #endregion 

    using (IEnumerator<T> enumerator = source.GetEnumerator()) 
    { 
     if (enumerator.MoveNext()) 
     { 
      T item = enumerator.Current; 

      // scan until different item found, then produce 
      // the previous distinct item 
      while (enumerator.MoveNext()) 
      { 
       if (comparer.Compare(item, enumerator.Current) != 0) 
       { 
        yield return item; 
        item = enumerator.Current; 
       } 
      } 

      // produce last item that is left over from above loop 
      yield return item; 
     } 
    } 
}

注意，所有這些都將在內部使用的數據結構，以保持數據的副本，所以他們會很便宜，如果輸入被排序。如果你不能或不能保證，你應該使用你已經找到的3.5擴展方法。

這裏的示例代碼調用上述方法：

String[] list_1 = { "apple", "orange", "apple", "banana" }; 
String[] list_2 = { "banana", "pear", "grape" }; 

Array.Sort(list_1); 
Array.Sort(list_2); 

IEnumerable<String> items = Merge(
    CheapDistinct(list_1), 
    CheapDistinct(list_2)); 
foreach (String item in items) 
    Console.Out.WriteLine(item);

來源

2008-09-28 18:12:00

+1開箱思考：如果他們排序？還有很多代碼。再次，排序它們所需的時間可能會超出整個目的。因此，免責聲明:) – Lucas 2008-10-01 21:35:28

你不知道，直到你衡量哪種方式速度更快。 LINQ方式優雅而易於理解。

另一種方法是將一個集合實現爲一個哈希數組（Dictionary）並將這兩個數組的所有元素添加到該集合中。然後使用set.Keys.ToArray（）方法來創建結果數組。

來源

2008-09-28 18:14:55 danatel

.NET 3.5引入了HashSet的類，它可以這樣做：

IEnumerable<string> mergedDistinctList = new HashSet<string>(list1).Union(list2);

不確定的表現，但它應該擊敗你給LINQ的例子。編輯：我站在糾正。 Concat和Distinct的延遲實現具有關鍵的內存和速度優勢。Concat/Distinct的速度提高了大約10％，並節省了多份數據。

我通過代碼確認：

Setting up arrays of 3000000 strings overlapping by 300000 
Starting Hashset... 
HashSet: 00:00:02.8237616 
Starting Concat/Distinct... 
Concat/Distinct: 00:00:02.5629681

是輸出：

 int num = 3000000; 
     int num10Pct = (int)(num/10); 

     Console.WriteLine(String.Format("Setting up arrays of {0} strings overlapping by {1}", num, num10Pct)); 
     string[] list1 = Enumerable.Range(1, num).Select((a) => a.ToString()).ToArray(); 
     string[] list2 = Enumerable.Range(num - num10Pct, num + num10Pct).Select((a) => a.ToString()).ToArray(); 

     Console.WriteLine("Starting Hashset..."); 
     Stopwatch sw = new Stopwatch(); 
     sw.Start(); 
     string[] merged = new HashSet<string>(list1).Union(list2).ToArray(); 
     sw.Stop(); 
     Console.WriteLine("HashSet: " + sw.Elapsed); 

     Console.WriteLine("Starting Concat/Distinct..."); 
     sw.Reset(); 
     sw.Start(); 
     string[] merged2 = list1.Concat(list2).Distinct().ToArray(); 
     sw.Stop(); 
     Console.WriteLine("Concat/Distinct: " + sw.Elapsed);

來源

2008-09-28 18:22:27 TheSoftwareJedi

實際上，我期望它比Concat/Distinct方式少*效率，因爲Union需要形成第二個HashSet。 – 2008-09-28 18:29:03

爲什麼你可以想象，這將是低效？就我所知，Concat和Distinct都是懶惰評估的，在Distinct的幕後使用HashSet來跟蹤已經返回的元素。

我不知道你是如何管理，使其比一般的方法:)

編輯更高效：鮮明實際使用套裝（一個內部類），而不是HashSet的，但要點是仍然正確。這是一個很好的例子，說明LINQ是多麼的簡潔。最簡單的答案几乎與沒有更多領域知識的情況下一樣高效。

效果是等價的：

public static IEnumerable<T> DistinctConcat<T>(IEnumerable<T> first, IEnumerable<T> second) 
{ 
    HashSet<T> returned = new HashSet<T>(); 
    foreach (T element in first) 
    { 
     if (returned.Add(element)) 
     { 
      yield return element; 
     } 
    } 
    foreach (T element in second) 
    { 
     if (returned.Add(element)) 
     { 
      yield return element; 
     } 
    } 
}

來源

2008-09-28 18:27:36

在.NET中有效地合併字符串數組，保持不同的值

回答

相關問題