算法從字符串列表中找到相同的子字符串

我在這裏有點迷路，歡迎一些幫助。這個想法是從一個字符串列表中找到一個匹配的子字符串。它不一定是完美的。讓我們用一個例子來說明這一點：算法從字符串列表中找到相同的子字符串

「COUNTRY_NAME」「COUNTRY_ID」「Rankcountry」「ThisWillNotMatch」

將返回「國」

它要的東西有效，爲「畜生武力「阿爾戈看起來有點可怕。

來源

2015-01-14 ic3

什麼是「不完美」的比賽平均值（即，什麼是比賽的標準）？ – kraskevich

這是一個盡力而爲的策略，這個想法提出了最終用戶可能會改變的「某些事情」。 – ic3

你是否至少從一些代碼開始？你可以從一個「蠻力」算法開始，這個算法對於不那麼大的字符串來說可以是有效的，並在稍後轉向更高效的字符串。 –

靈感來自Dr. Dobb's的Jon Bentley's "Algorithm Alley"專欄。

構建每個後綴的索引。對索引進行排序將常見的子字符串結合在一起。對相鄰的子串進行排序後的索引，您可以輕鬆找到最長的子串（或最常見的子串）。

#include <algorithm> 
#include <cstddef> 
#include <iostream> 
#include <string> 
#include <vector> 

std::size_t LengthInCommon(const char *left, const char *right) { 
    std::size_t length_of_match = 0; 
    while (*left == *right && *left != '\0') { 
    ++length_of_match; 
    ++left; 
    ++right; 
    } 
    return length_of_match; 
} 

std::string FindLongestMatchingSubstring(const std::vector<std::string> &strings) { 
    // Build an index with a pointer to each possible suffix in the array. O(n) 
    std::vector<const char *> index; 
    for (const auto &s : strings) { 
    for (const auto &suffix : s) { 
     index.push_back(&suffix); 
    } 
    } 

    // Sort the index using the underlying substrings. O(n log_2 n) 
    std::sort(index.begin(), index.end(), [](const char *left, const char *right) { 
    return std::strcmp(left, right) < 0; 
    }); 

    // Common strings will now be adjacent to each other in the index. 
    // Walk the index to find the longest matching substring. 
    // O(n * m) where m is average matching length of two adjacent strings. 
    std::size_t length_of_longest_match = 0; 
    std::string match; 
    for (std::size_t i = 1; i < index.size(); ++i) { 
    const char *left = index[i - 1]; 
    const char *right = index[i]; 
    std::size_t length_of_match = LengthInCommon(left, right); 
    if (length_of_longest_match < length_of_match) { 
     length_of_longest_match = length_of_match; 
     match.assign(index[i], index[i] + length_of_longest_match); 
    } 
    } 

    return match; 
} 

int main() { 
    std::vector<std::string> strings; 
    strings.push_back("Country_Name"); 
    strings.push_back("Country_id"); 
    strings.push_back("RankCountry"); 
    strings.push_back("ThisWillNotMatch"); 
    std::cout << FindLongestMatchingSubstring(strings) << std::endl; 
    return 0; 
}

打印：

Country_

來源

2015-01-14 19:17:50

我還是不明白爲什麼「c」不能成爲答案。我想你喜歡更長的字符串。需要直接獲得您的優化功能！

無論如何，你可以用Tries來解決這個問題。爲每個字符串創建一個Trie。爲每個節點計數1。並通過總結計數來合併所有嘗試。這樣你可以得到所有的子串和它們的計數。現在，使用你的優化函數來選擇最優的函數。

來源

2015-01-14 16:58:31 ElKamina

在我的回答中，你稱之爲「嘗試」，我稱之爲「步伐」。我會被槍殺嗎？ :) – BitTickler

不確定它是否「有效」或被認爲是蠻力......我把它留給其他人來判斷。

輸入=字符串列表
對於每個s在輸入做：computeAllStrides（串 - >字符串列表）（見下面的代碼）
創建空的，可變的字典，字符串類型，的值的關鍵類型int
所有步幅=步驟2中字符串列表的列表 - >更新字典更新詞典（跨步）當字典中存在跨步時 - >增加相應條目的值更新跨步不存在時的詞典（跨步）在詞典 - >添加（步幅，1）到詞典
查找得到stride.Length *頻率的最大值的字典條目
找到最大報告值。

如果不區分大小寫，請首先對每個輸入字符串執行toLowercase操作。

open System.Collections.Generic 

let input = ["Country_Name"; "Country_id"; "RankCountry"; "ThisWillNotMatch"; ] 

let rec getAllStrides text = 
    let length = String.length text 
    match length with 
    | 0 -> [] 
    | 1 -> [text] 
    | _ -> [ for i = 1 to length do yield text.Substring(0, i) ] @ getAllStrides (text.Substring(1)) 


type HashTable = System.Collections.Generic.Dictionary<string,int> 

let update (ht : HashTable) strides = 
    List.iter (fun s -> 
      if ht.ContainsKey(s) then ht.[s] <- ht.[s] + 1 else ht.Add(s, 1) 
      ) strides 

let computeStrideFrequencies input = 
    let ht = new HashTable() 
    input |> List.iter (fun i -> update ht (getAllStrides i)) 
    ht 


let solve input = 
    let theBest = input |> computeStrideFrequencies |> Seq.maxBy (fun (KeyValue(k,v)) -> k.Length * v) 
    theBest.Key

解決輸入;; val it：string =「國家」

來源

2015-01-14 17:02:51 BitTickler

算法從字符串列表中找到相同的子字符串

回答

相關問題