如何快速散列非常大的子串而不會發生衝突？

我有一個應用程序，它的一部分可以找到輸入字符串的所有迴文子字符串。輸入字符串長度可以達到100,000，所以子字符串可以非常大。例如，一個應用程序的輸入導致超過300,000個子字符串迴文超過10,000個。該應用程序後來計算所有迴文的平等和計數的唯一的哈希使用標準哈希是在函數中找到迴文。哈希值被存儲在一個向量中，後來在應用程序中計入唯一性。這種輸入和輸出條件的問題是非常大的子串的散列需要太長的時間，並且會在散列中產生衝突。所以我想知道是否有算法（散列），可以快速和唯一地散列一個非常大的子字符串（最好通過索引範圍的速度子字符串，但與唯一性的準確性）。哈希是在函數get_palins的末尾完成的。代碼如下。如何快速散列非常大的子串而不會發生衝突？

#include <iostream> 
#include <string> 
#include <cstdlib> 
#include <time.h> 
#include <vector> 
#include <algorithm> 
#include <unordered_map> 
#include <map> 
#include <cstdio> 
#include <cmath> 
#include <ctgmath> 

using namespace std; 

#define MAX 100000 
#define mod 1000000007 

vector<long long> palins[MAX+5]; 

// Finds all palindromes for the string 
void get_palins(string &s) 
{ 
    int N = s.length(); 
    int i, j, k, // iterators 
    rp,  // length of 'palindrome radius' 
    R[2][N+1]; // table for storing results (2 rows for odd- and even-length palindromes 

    s = "@" + s + "#"; // insert 'guards' to iterate easily over s 

    for(j = 0; j <= 1; j++) 
    { 
     R[j][0] = rp = 0; i = 1; 

     while(i <= N) 
     { 
      while(s[i - rp - 1] == s[i + j + rp]) { rp++; } 
      R[j][i] = rp; 
      k = 1; 
      while((R[j][i - k] != rp - k) && (k < rp)) 
      { 
       R[j][i + k] = min(R[j][i - k],rp - k); 
       k++; 
      } 
      rp = max(rp - k,0); 
      i += k; 
     } 
    } 

    s = s.substr(1,N); // remove 'guards' 

    for(i = 1; i <= N; i++) 
    { 
     for(j = 0; j <= 1; j++) 
      for(rp = R[j][i]; rp > 0; rp--) 
      { 
       int begin = i - rp - 1; 
       int end_count = 2 * rp + j; 
       int end = begin + end_count - 1; 
       if (!(begin == 0 && end == N -1)) 
       { 
        string ss = s.substr(begin, end_count); 
        long long hsh = hash<string>{}(ss); 
        palins[begin].push_back(hsh); 

       } 
      } 
    } 
} 
unordered_map<long long, int> palin_counts; 
unordered_map<char, int> end_matches; 

// Solve when at least 1 character in string is different 
void solve_all_not_same(string &s) 
{ 
    int n = s.length(); 
    long long count = 0; 

    get_palins(s); 

    long long palin_count = 0; 

    // Gets all palindromes into unordered map 
    for (int i = 0; i <= n; i++) 
    { 
     for (auto& it : palins[i]) 
     { 
      if (palin_counts.find(it) == palin_counts.end()) 
      { 
       palin_counts.insert({it,1}); 
      } 
      else 
      { 
       palin_counts[it]++; 
      } 
     } 
    } 

    // From total palindromes, get proper border count 
    // minus end characters of substrings 
    for (auto it = palin_counts.begin(); it != palin_counts.end(); ++it) 
    { 
     int top = it->second - 1; 

     palin_count += (top * (top + 1))/2; 
     palin_count %= mod; 
    } 

    // Store string character counts in unordered map 
    for (int i = 0; i <= n; i++) 
    { 
     char c = s[i]; 

     //long long hsh = hash<char>{}(c); 

     if (end_matches[c] == 0) 
      end_matches[c] = 1; 
     else 
      end_matches[c]++; 

    } 

    // From substring end character matches, get proper border count 
    // for end characters of substrings 
    for (auto it = end_matches.begin(); it != end_matches.end(); it++) 
    { 
     int f = it->second - 1; 
     count += (f * (f + 1))/2; 
    } 

    cout << (count + palin_count) % mod << endl; 

    for (int i = 0; i < MAX+5; i++) 
     palins[i].clear(); 
} 

int main() 
{ 

    string s; 
    cin >> s; 

    solve_all_not_same(s); 

    return 0; 
}

來源

2016-11-02 te7

你確定它只是哈希，這是瓶頸嗎？僅僅通過掃描上面的代碼，我發現了很多低效的事情。例如：爲一個已經很大的字符串加上後綴和前綴加上大量額外的子字符串副本，我不確定，但如果使用一對值來指示字符串中的開始和結束位置，那麼可以肯定地避免。 – Arunmu

另外，'R [2] [N + 1]'不是標準的C++。它可能適合你的平臺... – Arunmu

http://stackoverflow.com/questions/98153/whats-the-best-hashing-algorithm-to-use-on-a-stl-string-when-using-哈希映射可能是一個可能的解決方案。此外，如果你可以添加一些智能（更新散列，我們在rabin-karp中做的），你可能會得到一個巨大的加速。 – Arunmu

面對的問題X（找到所有迴文個子串），你問（如何快速散子）解決Ÿ：The XY Problem。
對於迴文檢測，請考慮後綴數組（一個用於輸入的反轉或附加到輸入的數組）。
對於重疊字符串的快速哈希，請查看rolling hashes。

來源

2016-11-04 08:09:39 greybeard

遲來，我發現te7的評論包含一個HackerRank問題的鏈接：在約束條件中不相似，並且第一個示例輸出的解釋與該輸出的第一條語句相矛盾。 – greybeard

感謝您的回覆。我很感激。我使用Murmur3 hash 64x128解決了這個問題。我會研究這些鏈接。謝謝 – te7

對於字符串匹配，請嘗試_Rabin-Karp滾動hash_。 – greybeard

如何快速散列非常大的子串而不會發生衝突？

回答

相關問題