Parallel.For()
可以通過並行代碼提高性能提升不少,但它也有開銷(線程之間的同步,在每次迭代中調用委託)。而且由於在你的代碼中,每次迭代都很短(基本上只有幾個CPU指令),這個開銷可能會變得很突出。因此,我認爲使用Parallel.For()
不適合您。相反,如果您手動並行處理代碼(在這種情況下非常簡單),您可能會看到性能提高。
爲了驗證這一點,我進行了一些測量:我在200 000 000個項目的數組上運行了不同的MultiplicateArray()
實現(我使用的代碼如下)。在我的機器上,串行版本一直花費0.21秒,而Parallel.For()
通常花費了大約0.45秒,但是時不時,它會跳到8-9秒!
首先,我會嘗試改善常見情況,稍後我會回到那些尖峯。我們想要通過N CPU來處理數組,因此我們將其分成N等大小的零件並分別處理每個零件。結果? 0.35秒。這比串行版本還要糟糕。但for
遍歷數組中的每個項是最優化的構造之一。我們不能幫助編譯器嗎?提取計算循環的邊界可能會有所幫助。事實證明它的確如此:0.18秒。這比串行版本更好,但不是太多。有趣的是,在我的4核機器上(沒有超線程)將並行度從4改爲2並不會改變結果:仍然是0.18秒。這讓我得出結論,CPU不是這裏的瓶頸,內存帶寬是。
現在,回到尖峯:我的自定義並行化沒有它們,但Parallel.For()
呢,爲什麼? Parallel.For()
確實使用範圍分區,這意味着每個線程都處理它自己的數組部分。但是,如果一個線程提前結束,它將嘗試幫助處理尚未完成的另一個線程的範圍。如果發生這種情況,你會得到很多錯誤的分享,這可能會使代碼變慢。而我自己的強迫虛假分享的測試似乎表明這可能確實是問題所在。強制Parallel.For()
的並行度似乎有助於稍微增加峯值。
當然,所有這些測量都是特定於我電腦上的硬件,並且會因您而異,所以您應該自行測量。
我使用的代碼:
static void Main()
{
double[] array = new double[200 * 1000 * 1000];
for (int i = 0; i < array.Length; i++)
array[i] = 1;
for (int i = 0; i < 5; i++)
{
Stopwatch sw = Stopwatch.StartNew();
Serial(array, 2);
Console.WriteLine("Serial: {0:f2} s", sw.Elapsed.TotalSeconds);
sw = Stopwatch.StartNew();
ParallelFor(array, 2);
Console.WriteLine("Parallel.For: {0:f2} s", sw.Elapsed.TotalSeconds);
sw = Stopwatch.StartNew();
ParallelForDegreeOfParallelism(array, 2);
Console.WriteLine("Parallel.For (degree of parallelism): {0:f2} s", sw.Elapsed.TotalSeconds);
sw = Stopwatch.StartNew();
CustomParallel(array, 2);
Console.WriteLine("Custom parallel: {0:f2} s", sw.Elapsed.TotalSeconds);
sw = Stopwatch.StartNew();
CustomParallelExtractedMax(array, 2);
Console.WriteLine("Custom parallel (extracted max): {0:f2} s", sw.Elapsed.TotalSeconds);
sw = Stopwatch.StartNew();
CustomParallelExtractedMaxHalfParallelism(array, 2);
Console.WriteLine("Custom parallel (extracted max, half parallelism): {0:f2} s", sw.Elapsed.TotalSeconds);
sw = Stopwatch.StartNew();
CustomParallelFalseSharing(array, 2);
Console.WriteLine("Custom parallel (false sharing): {0:f2} s", sw.Elapsed.TotalSeconds);
}
}
static void Serial(double[] array, double factor)
{
for (int i = 0; i < array.Length; i++)
{
array[i] = array[i] * factor;
}
}
static void ParallelFor(double[] array, double factor)
{
Parallel.For(
0, array.Length, i => { array[i] = array[i] * factor; });
}
static void ParallelForDegreeOfParallelism(double[] array, double factor)
{
Parallel.For(
0, array.Length, new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount },
i => { array[i] = array[i] * factor; });
}
static void CustomParallel(double[] array, double factor)
{
var degreeOfParallelism = Environment.ProcessorCount;
var tasks = new Task[degreeOfParallelism];
for (int taskNumber = 0; taskNumber < degreeOfParallelism; taskNumber++)
{
// capturing taskNumber in lambda wouldn't work correctly
int taskNumberCopy = taskNumber;
tasks[taskNumber] = Task.Factory.StartNew(
() =>
{
for (int i = array.Length * taskNumberCopy/degreeOfParallelism;
i < array.Length * (taskNumberCopy + 1)/degreeOfParallelism;
i++)
{
array[i] = array[i] * factor;
}
});
}
Task.WaitAll(tasks);
}
static void CustomParallelExtractedMax(double[] array, double factor)
{
var degreeOfParallelism = Environment.ProcessorCount;
var tasks = new Task[degreeOfParallelism];
for (int taskNumber = 0; taskNumber < degreeOfParallelism; taskNumber++)
{
// capturing taskNumber in lambda wouldn't work correctly
int taskNumberCopy = taskNumber;
tasks[taskNumber] = Task.Factory.StartNew(
() =>
{
var max = array.Length * (taskNumberCopy + 1)/degreeOfParallelism;
for (int i = array.Length * taskNumberCopy/degreeOfParallelism;
i < max;
i++)
{
array[i] = array[i] * factor;
}
});
}
Task.WaitAll(tasks);
}
static void CustomParallelExtractedMaxHalfParallelism(double[] array, double factor)
{
var degreeOfParallelism = Environment.ProcessorCount/2;
var tasks = new Task[degreeOfParallelism];
for (int taskNumber = 0; taskNumber < degreeOfParallelism; taskNumber++)
{
// capturing taskNumber in lambda wouldn't work correctly
int taskNumberCopy = taskNumber;
tasks[taskNumber] = Task.Factory.StartNew(
() =>
{
var max = array.Length * (taskNumberCopy + 1)/degreeOfParallelism;
for (int i = array.Length * taskNumberCopy/degreeOfParallelism;
i < max;
i++)
{
array[i] = array[i] * factor;
}
});
}
Task.WaitAll(tasks);
}
static void CustomParallelFalseSharing(double[] array, double factor)
{
var degreeOfParallelism = Environment.ProcessorCount;
var tasks = new Task[degreeOfParallelism];
int i = -1;
for (int taskNumber = 0; taskNumber < degreeOfParallelism; taskNumber++)
{
tasks[taskNumber] = Task.Factory.StartNew(
() =>
{
int j = Interlocked.Increment(ref i);
while (j < array.Length)
{
array[j] = array[j] * factor;
j = Interlocked.Increment(ref i);
}
});
}
Task.WaitAll(tasks);
}
輸出示例:
Serial: 0,20 s
Parallel.For: 0,50 s
Parallel.For (degree of parallelism): 8,90 s
Custom parallel: 0,33 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,18 s
Custom parallel (false sharing): 7,53 s
Serial: 0,21 s
Parallel.For: 0,52 s
Parallel.For (degree of parallelism): 0,36 s
Custom parallel: 0,31 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,19 s
Custom parallel (false sharing): 7,59 s
Serial: 0,21 s
Parallel.For: 11,21 s
Parallel.For (degree of parallelism): 0,36 s
Custom parallel: 0,32 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,18 s
Custom parallel (false sharing): 7,76 s
Serial: 0,21 s
Parallel.For: 0,46 s
Parallel.For (degree of parallelism): 0,35 s
Custom parallel: 0,31 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,18 s
Custom parallel (false sharing): 7,58 s
Serial: 0,21 s
Parallel.For: 0,45 s
Parallel.For (degree of parallelism): 0,40 s
Custom parallel: 0,38 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,18 s
Custom parallel (false sharing): 7,58 s
你的陣列有多大?如果這樣的代碼需要幾分鐘的時間,它們一定很大。是否有其他方式來並行化代碼? – svick
什麼是您的硬件?另外,並行任務對於I/O操作(寫入磁盤,調用Web服務......)非常有效,不是用於「工作者」任務(CPU密集型) – Cybermaxs
@Cybermaxs我認爲你錯了。並行化CPU密集型任務可能非常有效(如果您有多核計算機)。另一方面,並行寫入磁盤很可能會降低性能,而不是提高性能,因爲您迫使磁盤不斷在多個位置之間尋找。 – svick