1
我一直在與時間測量(基準測試)並行算法,更具體的矩陣乘法。我使用下面的算法:MPI通信延遲 - 消息的大小是否重要?
if(taskid==MASTER) {
averow = NRA/numworkers;
extra = NRA%numworkers;
offset = 0;
mtype = FROM_MASTER;
for (dest=1; dest<=numworkers; dest++)
{
rows = (dest <= extra) ? averow+1 : averow;
MPI_Send(&offset, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD);
MPI_Send(&rows, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD);
MPI_Send(&a[offset][0], rows*NCA, MPI_DOUBLE, dest, mtype,MPI_COMM_WORLD);
MPI_Send(&b, NCA*NCB, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD);
offset = offset + rows;
}
mtype = FROM_WORKER;
for (i=1; i<=numworkers; i++)
{
source = i;
MPI_Recv(&offset, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status);
MPI_Recv(&rows, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status);
MPI_Recv(&c[offset][0], rows*NCB, MPI_DOUBLE, source, mtype,
MPI_COMM_WORLD, &status);
printf("Resultados recebidos do processo %d\n",source);
}
}
else {
mtype = FROM_MASTER;
MPI_Recv(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status);
MPI_Recv(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status);
MPI_Recv(&a, rows*NCA, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status);
MPI_Recv(&b, NCA*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status);
for (k=0; k<NCB; k++)
for (i=0; i<rows; i++)
{
c[i][k] = 0.0;
for (j=0; j<NCA; j++)
c[i][k] = c[i][k] + a[i][j] * b[j][k];
}
mtype = FROM_WORKER;
MPI_Send(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD);
MPI_Send(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD);
MPI_Send(&c, rows*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD);
}
我注意到,對方陣花了比矩形的人更少的時間。 例如:如果我使用4個節點(一個爲主節點),A爲500x500且B爲500x500,則每個節點的迭代次數等於4150萬,而如果A是2400000x6且B是6x6,則每個節點迭代2880萬次。儘管第二種情況的迭代次數較少,但大約需要1.00秒,而第一次只花費大約0.46s。
邏輯上,第二個應該更快,因爲它每個節點的迭代次數更少。 做了一些數學運算後,我意識到MPI在第一種情況下每個消息發送和接收83,000個元素,第二種情況發送和接收4,800,000個元素。
消息的大小是否證明了延遲?