從大量ASCII文件中讀取數據的最快方法

對於我已經提交的大學練習，我需要閱讀包含大量圖像名稱（每行1個）的.txt文件。然後，我需要打開每個圖像作爲ASCII文件，並閱讀他們的數據（以ppm格式圖像），並與他們做一系列的事情。事情是，我注意到我的程序在讀取文件部分的數據時花費了70％的時間，而不是在我正在做的其他計算中（使用散列表查找每個像素的重複次數，找到不同的像素之間的2圖像等..），我覺得很奇怪，至少可以說。從大量ASCII文件中讀取數據的最快方法

這是PPM格式的樣子：

P3 //This value can be ignored when reading the file, because all image will be correctly formatted 
4 4 
255 //This value can be also ignored, will be always 255. 
0 0 0 0 0 0 0 0 0 15 0 15 
0 0 0 0 15 7 0 0 0 0 0 0 
0 0 0 0 0 0 0 15 7 0 0 0 
15 0 15 0 0 0 0 0 0 0 0 0

這是我是如何從文件中讀取數據：

ifstream fdatos; 
fdatos.open(argv[1]); //Open file with the name of all the images 

const int size = 128; 
char file[size]; //Where I'll get the image name 

Image *img; 
while (fdatos >> file) { //While there's still images anmes left, continue 
    ifstream fimagen; 
fimagen.open(file); //Open image file 
img = new Image(fimagen); //Create new image object with it's data file 
    ……… 
    //Rest of the calculations whith that image 
    ……… 
delete img; //Delete image object after done 
    fimagen.close(); //Close image file after done 
} 

fdatos.close();

和圖像目標讀取裏面像這樣的數據：

const int tallafirma = 100; 
char firma[tallafirma]; 
fich_in >> std::setw(100) >> firma; // Read the P3 part, can be ignored 

int maxvalue, numpixels; 
fich_in >> height >> width >> maxvalue; // Read the next three values 
numpixels = height*width; 
datos = new Pixel[numpixels]; 

int r,g,b; //Don't need to be ints, max value is 256, so an unsigned char would be ok. 
for (int i=0; i<numpixels; i++) { 
    fich_in >> r >> g >> b; 
    datos[i] = Pixel(r, g ,b); 
} 
//This last part is the slow one, 
//I thing I should be able to read all this data in one single read 
//to buffer or something which would be stored in an array of unsigned chars, 
//and then I'd only need to to do: 
//buffer[0] -> //Pixel 1 - Red data 
//buffer[1] -> //Pixel 1 - Green data 
//buffer[2] -> //Pixel 1 - Blue data

那麼，有什麼想法？我想我可以在一次調用中將它完全改進爲一個數組，我只是不知道這是如何完成的。

另外，知道「索引文件」中有多少圖像是可行的嗎？知道文件的行數是否可行？（因爲每行有一個文件名..）

謝謝！

編輯：這是我如何emasure時間。

#include <sys/time.h> 
#include <sys/resource.h> 
double get_time() 
{ 
    struct timeval t; 
    struct timezone tzp; 
    gettimeofday(&t, &tzp); 
    return t.tv_sec + t.tv_usec*1e-6; 
} 

double start = get_time(); 
//Everything to be measured here. 
double end = get_time(); 

cout << end-start << endl;

來源

2011-02-12 asendra

嗯，我是這個學科的講座之一，也是這個學生參加的編程比賽的組織者。隨時可以幫助他，但假設學生必須自己解決比賽，或者只是閱讀不同的節目源，而不是使用積極的查詢來社區。無論如何，正如我檢測到這個查詢，任何其他參與者可以，並且該副本完全禁止........ – 2011-03-06 10:11:10

你正在分配內存並在每個循環中刪除它。如果你如此關心表現，我認爲這不是好事。

因此，您可以做的一項改進是：將內存一旦分配給您的程序即可重新使用。

void *memory = malloc(sizeof(Image)); //reusable memory. 

//placement new to construct the object in the already allocated memory! 
img = new (memory) Image(fimagen); 

//... 

img->~Image(); //calling the destructor 

//when you're done free the memory 
free(memory); //use free, as we had used malloc when allocating!

同樣，你可以在Image類重用內存，尤其是在這個行：

datos = new Pixel[numpixels];

最後，而不是讀取RGB到本地變量，然後將它們複製到圖像數據，並不那麼優雅，所以也可以在這裏做一點改進，

//this is yours : temporaries, and copying! 
fich_in >> r >> g >> b; 
datos[i] = Pixel(r, g ,b); 

//this is mine : no temporaries, no copying. directly reading into image data! 
fich_in >> datos[i].r >> datos[i].g >> datos[i].b;

除此之外，我認爲沒有太多的餘地可以提高你的代碼性能。

來源

2011-02-12 15:28:47 Nawaz

嗯，我不知道我可以實際做到這一點，我很新在這一切。我會試着看看它是否有幫助。此外，我並不擔心，只是我注意到這一點，並看到人們將整個文件讀取到一個緩衝區，我認爲這會改善一些事情，因爲您只需要一次訪問磁盤??我不知道。 – asendra 2011-02-12 15:48:45

恩，@Nawaz，就是這樣！臨時文件和RGB變量上的複製。用一套小型設備（30張256×256圖像）進行測試，時間從0.23下降到0.098。神聖煙霞xD – asendra 2011-02-12 16:22:47

不可能在不讀取整個文件的情況下計算文本文件中的行數。

對於其他優化，使用time命令（如果您在Unix/Linux上）檢查程序是否使用了大量「用戶」時間（實際計算）與其「wallclock」時間相比（總時間在過程的開始和結束之間）。如果沒有，您可能正在等待磁盤或網絡。

來源

2011-02-12 15:37:50

從大量ASCII文件中讀取數據的最快方法

回答

相關問題