檢查字符串中

我試圖找到字符串在那裏我有〜250萬串的向量的重複的實例的大載體複製〜檢查字符串中

目前我使用類似：

std::vector<string> concatVec; // Holds all of the concatenated strings containing columns C,D,E,J and U. 
std::vector<string> dupecheckVec; // Holds all of the unique instances of concatenated columns 
std::vector<unsigned int> linenoVec; // Holds the line numbers of the unique instances only 

// Copy first element across, it cannot be a duplicate yet 
dupecheckVec.push_back(concatVec[0]); 
linenoVec.push_back(0); 

// Copy across and do the dupecheck 
for (unsigned int i = 1; i < concatVec.size(); i++) 
{ 
    bool exists = false; 

    for (unsigned int x = 0; x < dupecheckVec.size(); x++) 
    { 
     if (concatVec[i] == dupecheckVec[x]) 
     { 
      exists = true; 
     } 
    } 

    if (exists == false) 
    { 
     dupecheckVec.push_back(concatVec[i]); 
     linenoVec.push_back(i); 
    } 
    else 
    { 
     exists = false; 
    } 
}

這很好的小文件，但很明顯，最終以一個非常長的時間，文件大小的增長，由於嵌套的for循環和越來越多的包含在dupecheckVec字符串。

什麼可能是在一個大的文件要做到這一點不那麼可怕呢？

來源

2011-03-30 rbj

您是否想過使用'unique'算法和'erase'？ – lrm29 2011-03-30 13:50:27

@ lrm29：獨特的要求矢量有序，這可能是或不是一個問題在這裏。 – 2011-03-30 14:10:49

這就是爲什麼我沒有發佈它作爲答案。使用算法可能沒有發生OP。 – lrm29 2011-03-30 14:12:22

如果你不介意重新排序的載體，那麼這應該做它在O(n*log(n))時間：

std::sort(vector.begin(), vector.end()); 
vector.erase(std::unique(vector.begin(), vector.end()), vector.end());

爲了維護秩序，你也可以使用的載體（行號，串*）對：按字符串排序，使用比較字符串內容的比較器唯一化，最後按照行號排序，如下所示：

struct pair {int line, std::string const * string}; 

struct OrderByLine { 
    bool operator()(pair const & x, pair const & y) { 
     return x.line < y.line; 
    } 
}; 

struct OrderByString { 
    bool operator()(pair const & x, pair const & y) { 
     return *x.string < *y.string; 
    } 
}; 

struct StringEquals { 
    bool operator()(pair const & x, pair const & y) { 
     return *x.string == *y.string; 
    } 
}; 

std::sort(vector.begin(), vector.end(), OrderByString()); 
vector.erase(std::unique(vector.begin(), vector.end(), StringEquals()), vector.end()); 
std::sort(vector.begin(), vector.end(), OrderByLine());

來源

2011-03-30 13:57:49

非常感謝您的幫助代碼示例！ – rbj 2011-03-30 14:10:38