在Scala中並行迭代集合的最有效方式是什麼（TOP N模式）

我是Scala的新手，想要構建一個實時應用程序來匹配某些人。對於一個給定的人，我想獲得匹配得分最高的前50名的人。在Scala中並行迭代集合的最有效方式是什麼（TOP N模式）

成語如下：

val persons = new mutable.HashSet[Person]() // Collection of people 
/* Feed omitted */ 
val personsPar = persons.par // Make it parall 
val person = ... // The given person 

res = personsPar 
     .filter(...) // Some filters 
     .map{p => (p,computeMatchingScoreAsFloat(person, p))} 
     .toList 
     .sortBy(-_._2) 
     .take(50) 
     .map(t => t._1 + "=" + t._2).mkString("\n")

在上面的示例代碼，HashSet的使用，但它可以是任何類型的集合，因爲我敢肯定這是不是最佳

的問題是人員包含超過500萬個元素，computeMatchingScoreAsFloat方法計算一種具有200個浮點數的2個向量的相關值。這個計算在我的6核心電腦上花費約2秒。

我的問題是，在Scala中做這個TOPN模式的最快方法是什麼？

子問題： - 我應該使用什麼實現集合（或其他？）？ - 我應該使用期貨嗎？

注：它在平行於被計算，computeMatchingScoreAsFloat的純計算單獨（沒有排名/ TOP N）開超過一秒，我的計算機

上< 200毫秒，如果多線程

編輯：感謝紀堯姆，計算時間從2秒降低到700個毫秒

def top[B](n:Int,t: Traversable[B])(implicit ord: Ordering[B]):collection.mutable.PriorityQueue[B] = { 

    val starter = collection.mutable.PriorityQueue[B]()(ord.reverse) // Need to reverse for us to capture the lowest (of the max) or the greatest (of the min) 

    t.foldLeft(starter)(
    (myQueue,a) => { 
     if(myQueue.length <= n){ myQueue.enqueue(a);myQueue} 
     else if(ord.compare(a,myQueue.head) < 0 ) myQueue 
     else{ 
     myQueue.dequeue 
     myQueue.enqueue(a) 
     myQueue 
     } 
    } 
) 
}

感謝

來源

2016-11-04 ogen

你有這個基準嗎？ computeMatchingScoreAsFloat是最昂貴的部分？它足夠重要，可以進行並行處理嗎？ –

是的，它必須並行計算，computeMatchingScoreAsFloat（沒有排名/ TOP N）的純計算需要超過一秒，如果我的電腦上有多線程，則需要40 ms – ogen

我會提出一些更改：

1-我相信過濾器和映射步驟需要遍歷集合兩次。有一個懶惰的收集會減少到一個。要麼有一個懶惰的收集（如流）或將其轉換爲一個，例如對於一個列表：

myList.view

2-排序步驟需要排序所有元素。相反，您可以使用一個累加器存儲前N個記錄來執行FoldLeft。有關實現的示例，請參閱： Simplest way to get the top n elements of a Scala Iterable。如果你想獲得最高性能（真正落入其駕駛室），我可能會測試優先隊列而不是列表。舉例來說，這樣的事情：

def IntStream(n:Int):Stream[(Int,Int)] = if(n == 0) Stream.empty else (util.Random.nextInt,util.Random.nextInt) #:: IntStream(n-1) 

    def top[B](n:Int,t: Traversable[B])(implicit ord: Ordering[B]):collection.mutable.PriorityQueue[B] = { 

    val starter = collection.mutable.PriorityQueue[B]()(ord.reverse) // Need to reverse for us to capture the lowest (of the max) or the greatest (of the min) 

    t.foldLeft(starter)(
     (myQueue,a) => { 
     if(myQueue.length <= n){ myQueue.enqueue(a);myQueue} 
     else if(ord.compare(a,myQueue.head) < 0 ) myQueue 
     else{ 
      myQueue.dequeue 
      myQueue.enqueue(a) 
      myQueue 
     } 
     } 
    ) 
    } 

def diff(t2:(Int,Int)) = t2._2 
top(10,IntStream(10000))(Ordering.by(diff)) // select top 10

我真的認爲你的問題需要單收集橫移，所以你就能下來讓你的運行時間低於1秒

祝你好運！

來源

2016-11-04 15:53:23

感謝您的幫助，我會測試這個告訴你新的計算時間 – ogen

看來我不能在並行集合（並行和懶惰的集合）上查看視圖，我可以處理ParSeq或一個SeqView。你知道我能做到嗎？Scala中是否有ParSeqView？ – ogen

您是否嘗試過在非平行集合上運行它？我認爲它具有平行集合有很大的開銷，並且可能是不合理的（另外，你有一個「toList」，我懷疑它合併成一個不平行的集合）。否則，你可以運行一個flatMap來結合過濾器和map（見http://stackoverflow.com/questions/32234132/how-to-combine-filter-and-map-in-scala），這將有效地得到相同的結果作爲懶惰集合 –

在Scala中並行迭代集合的最有效方式是什麼（TOP N模式）

回答

相關問題