C++的搜索性能

可能重複：
C++ Search Performance C++的搜索性能

我擁有的是兩個文本文件。其中一個包含約70,000個名稱（〜1.5MB）的列表。另一個包含將從各種來源獲得的文本。也就是說，這個文件的內容每次執行時都會改變（〜0.5MB）。從本質上講，我希望能夠將一些文本粘貼到文本文件中，並查看我的列表找到哪些名稱。有點像查找功能（CTR + F），但有70,000個關鍵字。

在任何情況下，我有什麼迄今是：

int main() 
{ 

ifstream namesfile("names.txt"); //names list 
ifstream miscfile("misc.txt");  //misc text 
vector<string> vecnames;   //vector to hold names 
vector<string> vecmisc;   //vector to hold misc text 
size_t found; 

string s; 
string t; 

while (getline(namesfile,s))  
    vecnames.push_back(s); 

while (getline(miscfile,t))   
    vecmisc.push_back(t); 

//outer loop iterates through names list 
for (vector<string>::size_type i = 0; i != vecnames.size(); ++i) { 
    //inner loop iterates through the lines of the mist text file 
    for (vector<string>::size_type j = 0;j != vecmisc.size(); ++j) { 
     found=vecmisc[j].find(vecnames[i]); 
     if (found!=string::npos) { 
      cout << vecnames[i] << endl; 
      break; 
     } 
    } 
} 

cout << "SEARCH COMPLETE"; 

//to keep console application from exiting 
getchar(); 

return 0; 
}

但是現在這個偉大的工程，只要提取我需要的數據，它是非常緩慢的，顯然效率不高，因爲每名需要我可能再次搜索整個文件，這會給出（75000 x混雜文本文件中的行）迭代。如果有人可以幫助，我一定會很感激。一些示例代碼是最受歡迎的。另外，如果這有什麼不同，我使用Dev C++。

有人建議我在我的數據上實現一個哈希集，但是，我不知道如何去做這件事。如果有人瞭解我如何應用這種方法，我會感激一個正確的方向。真誠的感謝。

來源

2011-05-22 Dom

您的代碼示例缺少您在代碼中使用的veccomp和vectenk的定義。 – 2011-05-22 13:21:06

固定爲你 – sehe 2011-05-22 14:03:55

請不要重新發布相同的問題。如果您想添加更多的信息，請修改原件。 – finnw 2011-05-22 14:12:31

將大文件讀入內存可能會更好; qsort（）;然後逐行讀取第二個文件，並在第二個文件的每個條目中搜索bsearch（）。

來源

2011-05-22 13:43:23

您可以從所有名稱構造一個trie，並標記作爲端點的節點，以便在知道何時匹配（或者您可以等待不匹配並從上一個末尾發出子字符串到該點比賽）。然後，您嘗試將輸入與trie匹配，一次一個char，並且應該具有O（n）性能。

trieRoot = preprocessedListOfNames 

trieCursor = trieRoot 
for each character in text 
    if character in trieCursor.neighbors 
     trieCursor = trieCursor.neighbors[character] 
    else 
     if matchSize > 1 and trieCursor.isEndpoint 
      emit match 
     trieCursor = trieRoot

如果名單是相對靜態的，你甚至可以預先處理它，並將其存儲，這樣你就不會得到你想要做搜索，每次來構建它。

來源

2011-05-22 13:43:31 Asgeir

也查找基數樹。基本上是一個專門的形式，儘管一些代碼只是交替使用這些術語。 – 2011-05-22 14:09:03

我也想推薦[Aho-Corasick字符串匹配算法]（http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm），這是一個專門化，它回退到後綴爲節點爲失敗的匹配，給線性時間搜索。 – Hasturkun 2011-05-22 15:44:34

前綴，而不是後綴。（如果我的評論沒有意義） – Hasturkun 2011-05-22 15:59:31

將vecnames從一個向量更改爲一組。將其調用更改爲push_back以進行插入。然後，不要循環播放，只需循環播放vecmisc並致電vecnames.find(...)以檢查每個輸入是否是其中一個名稱。這會將你的O（n m）系統變成O（n log m）。您也可以使用hash_set並實現O（n）（實踐中可能會或可能不會更快）。

來源

2011-05-22 13:44:20

您可以使用STL的地圖/關聯陣列數據結構。地圖不一定以線性方式存儲數據，因此查找操作通常需要的時間少於線性時間 - 即 - O（n）。

對於您的情況，您可以使用類型 - map<string,bool>的地圖。示例用法 - http://www.cprogramming.com/tutorial/stl/stlmap.html

將vector<string> vecmisc;替換爲map<string,bool> vecmisc。

for (vector<string>::size_type i = 0; i != vecnames.size(); ++i) { 
// No inner loop needed 

    found=vecmisc.find(vecnames[i]); 
    if (found!=string::npos) { 
     cout << vecnames[i] << endl; 
     break; 
    } 

}

來源

2011-05-22 13:45:55 Abhijith

「地圖不必以線性方式存儲數據」？地圖確實不會以線性方式存儲數據，這種方式是正確的。 – leftaroundabout 2011-05-22 18:29:35

一些輕微的改進。

基準之前： ~~7分鐘56秒和計數（將更新）~~ 更新：終於完成在15m25s，得到的X

基準後大致3000 性能提高：0.3？秒（參見下面的更新數字）

代碼：

#include <set> #include <string> #include <iostream> #include <iterator> #include <fstream> template <class It> It readInto(std::istream& is, It OI); std::set<std::string> readnames(const std::string& filename) { std::string s; std::set<std::string> result; std::ifstream namesfile(filename.c_str()); //names list readInto(namesfile, std::inserter(result, result.end())); return result; } int main() { std::set<std::string> vecnames = readnames("names.txt"); //inner loop iterates through the lines of the mist text file std::ifstream miscfile("misc.txt"); //misc text std::string line; while (std::getline(miscfile, line)) if (vecnames.end() != vecnames.find(line)) std::cout << line << std::endl; return 0; } // helper to read linewise into output iterator template <class It> It readInto(std::istream& is, It OI) { std::string line; while (std::getline(is, line)) { if (line.size()>0) // TODO you may want to trim/normalize these OI++ = line; } return OI; }

數據：

$ CP的/ etc /詞典的共同/詞語names.txt中 $ $ WC names.txt中雜項.TXT 98569 98568 931708 names.txt中 166634 529910 4283592 misc.txt

這導致151486行輸出（包含3968個的uniqe值，WHE Ñ檢查:)

$ ./t2 | wc -l 859 $ ./t2 | sort -u | wc -l 2

因爲這是一個相當傾斜基準，我基準另一個極端，以及：具有優化的基礎上I386（32位）進行

$ cp names.txt misc.txt $ time ./t | wc -l 98568 real 0m0.365s user 0m0.372s sys 0m0.228s

測試克++ 4.4.5，重定向標準輸出到/ dev/null並在最後刪除getchar()調用

來源

2011-05-22 14:42:49 sehe

首先，我衷心感謝您的幫助。不幸的是，名字有時不止一個字（即John Paul Jr. III）。也就是說，如果這個整個名字在misc內。文本，那麼它應該被髮送到標準輸出。而在上述實現中，所有的約翰，保羅，Jrs和III都出現在整個misc中。文本也將被髮送到輸出。任何方式來調整這使它爲此工作？再次感謝。 – Dom 2011-05-22 15:27:20

簡單。回到循環讀取行。 **只是不要將它們存儲在矢量無故**。我會把它作爲一個練習，因爲它是晚餐時間:) – sehe 2011-05-22 15:31:37

很酷。我很快就會在這裏嘗試。謝謝。 – Dom 2011-05-22 15:47:34

C++的搜索性能

回答

相關問題