2014-01-30 34 views
0

我正在研究在Java堆中添加值的不同可能性。我正在使用PriorityHeap類。當我在應用程序中注意到運行時間緩慢時,我決定給這個看看。我添加了幾千個,有時是數百萬個自定義條目(我有一個自定義類,它有3個字段:int,LongWritable和Text,都來自hadoop.io; this儀器代理說我的記錄平均有200個字節)。將add()與addAll()插入到Java PriorityHeap中

是很明顯,使用的addAll()代替add()方法把條目堆將提高性能,只是因爲這會避免幾個heapify操作?

我試着用不同的策略使用以下例如:

package Sorting; 

import java.io.IOException; 
import java.util.ArrayList; 
import java.util.List; 
import java.util.PriorityQueue; 

public class Main { 

private static final int HEAP_SIZE = 1000000; 
private static final int BULK_LIST_SIZE = HEAP_SIZE/10; 

private static String normal; 
private static String bulk; 
private static String fullBulk; 

public static void main(String[] args) throws IOException { 
normal = ""; 
bulk = ""; 
fullBulk = ""; 
long time = 0; 

warmup(); 

normal = ""; 
bulk = ""; 
fullBulk = ""; 

for (int i = 0; i < 100; i++) { 

    // Normal add time 
    System.out.println("Normal starts..."); 
    time = normalExecution(); 
    System.out.println("Normal add time " + time); 

    // Bulk add time 
    System.out.println("Bulk starts..."); 
    time = bulk(); 
    System.out.println("Bulk add time " + time); 

    // Bulk add time with list and heap with same size 
    System.out.println("Full Bulk starts..."); 
    time = fullBulk(); 
    System.out.println("Full Bulk add time " + time); 
} 
System.out.println(normal); 
System.out.println(bulk); 
System.out.println(fullBulk); 

} 

private static long fullBulk() { 
long time; 
long start; 
List<Double> fullBulkList = new ArrayList<Double>(HEAP_SIZE); 
PriorityQueue<Double> fullBulkHeap = new PriorityQueue<Double>(HEAP_SIZE); 

start = System.nanoTime(); 
for (int j = 0; j < HEAP_SIZE; j++) { 
    if (fullBulkList.size() == HEAP_SIZE) { 
    fullBulkHeap.addAll(fullBulkList); 
    fullBulkList.clear(); 
    } 
} 
fullBulkHeap.addAll(fullBulkList); 
time = System.nanoTime() - start; 

fullBulk = fullBulk + "\t" + time; 
fullBulkList = null; 
fullBulkHeap = null; 
return time; 
} 

private static long bulk() { 
long time; 
long start; 
List<Double> bulkList = new ArrayList<Double>(BULK_LIST_SIZE); 
PriorityQueue<Double> bulkHeap = new PriorityQueue<Double>(HEAP_SIZE); 
start = System.nanoTime(); 
for (int j = 0; j < HEAP_SIZE; j++) { 
    if (bulkList.size() == BULK_LIST_SIZE) { 
    bulkHeap.addAll(bulkList); 
    bulkList.clear(); 
    } 
} 
bulkHeap.addAll(bulkList); 
time = System.nanoTime() - start; 
bulk = bulk + "\t" + time; 
bulkList = null; 
bulkHeap = null; 
return time; 
} 

private static long normalExecution() { 

long time; 
long start; 
PriorityQueue<Double> normalHeap = new PriorityQueue<Double>(HEAP_SIZE); 
start = System.nanoTime(); 
for (int j = 0; j < HEAP_SIZE; j++) { 
    normalHeap.add(Double.MAX_VALUE); 
} 
time = System.nanoTime() - start; 
normal = normal + "\t" + time; 
normalHeap = null; 
return time; 
} 

private static void warmup() { 
System.out.println("Starting warmup"); 
for (int i = 0; i < 1000; i++) { 
    normalExecution(); 
    bulk(); 
    fullBulk(); 
} 
for (int i = 0; i < 1000; i++) { 

    bulk(); 
    fullBulk(); 
    normalExecution(); 
} 
for (int i = 0; i < 1000; i++) { 

    fullBulk(); 
    normalExecution(); 
    bulk(); 
} 
System.out.println("Warmup finished"); 
} 

} 

這導致以下結果:

enter image description here

中的第11次迭代的巨大峯值正常添加方法由GC調用解釋:[GC 1347684K->31354K(1446400K), 0.0331610 secs]

Mediam值分別爲16049669,783724和800276。 ST開發者是3512492.89,244374.17和33344.17。

+0

僅從算法的角度來看,即使會有堆的批量插入,也需要篩選每個新添加的元素以保留堆屬性。你的基準也有缺陷(你的瓶頸是IO不是堆),請看http://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java –

+0

非常感謝這個鏈接。 –

回答

2

PriorityQueue不覆蓋從AbstractQueue繼承的方法addAll

AbstractQueue這個方法看起來像這樣。

public boolean addAll(Collection<? extends E> c) { 
    if (c == null) 
     throw new NullPointerException(); 
    if (c == this) 
     throw new IllegalArgumentException(); 
    boolean modified = false; 
    for (E e : c) 
     if (add(e)) 
      modified = true; 
    return modified; 
} 

正如您所看到的,它只是循環並調用add

所以我不認爲addAll會比add更好。

+0

我同意你的看法,但是如果這個內部添加元素循環會導致更少的上下文切換,那麼它會比在對象之外嗎?我根據問題評論中的鏈接提供的一些建議重新運行測試,結果很明顯... –

+0

@PedroDusso你甚至知道上下文切換是什麼嗎?你的基準是假的,因爲你在代碼中擁有IO訪問權限。第一次調用可能從磁盤讀取,而其他調用則被緩存。這是差異來自哪裏。 –

+0

@ThomasJungblut是的,我有。這是一個糟糕的表達。我的意思是可能是緩存局部性問題。從類內部的addall添加的調用比從外部調用更加接近。不過,我同意你的算法觀點。 –