從字符串中生成子串的組合

我試圖爲給定的單詞生成所有可能的音節組合。識別什麼是音節的過程在這裏是不相關的，但它是產生所有組合的問題。我認爲這可能是以我認爲的幾行的遞歸方式進行遞歸（儘管其他方式都可以），但我無法正常工作。誰能幫忙？從字符串中生成子串的組合

// how to test a syllable, just for the purpose of this example 
    bool IsSyllable(string possibleSyllable) 
    { 
     return Regex.IsMatch(possibleSyllable, "^(mis|und|un|der|er|stand)$"); 
    } 

    List<string> BreakIntoSyllables(string word) 
    { 
     // the code here is what I'm trying to write 
     // if 'word' is "misunderstand" , I'd like this to return 
     // => {"mis","und","er","stand"},{ "mis","un","der","stand"} 
     // and for any other combinations to be not included 
    }

來源

2011-10-06 mikel

您應該使用Trie爲您的音節編目。或者你可以使用naiver解決方案:-) – xanatos

開始嘗試用這樣的：

var word = "misunderstand"; 

Func<string, bool> isSyllable = 
    t => Regex.IsMatch(t, "^(mis|und|un|der|er|stand)$"); 

var query = 
    from i in Enumerable.Range(0, word.Length) 
    from l in Enumerable.Range(1, word.Length - i) 
    let part = word.Substring(i, l) 
    where isSyllable(part) 
    select part;

這將返回：

misunderstand-results

這是否有助於開頭至少？

編輯：我想遠一點關於這一問題的與這對夫妻的查詢上來：

Func<string, IEnumerable<string[]>> splitter = null; 
splitter = 
    t => 
     from n in Enumerable.Range(1, t.Length - 1) 
     let s = t.Substring(0, n) 
     let e = t.Substring(n) 
     from g in (new [] { new [] { e } }).Concat(splitter(e)) 
     select (new [] { s }).Concat(g).ToArray(); 

var query = 
    from split in (new [] { new [] { word } }).Concat(splitter(word)) 
    where split.All(part => isSyllable(part)) 
    select split;

現在query回報這樣的：

misunderstanding-results2

讓我知道如果那是現在釘了它。

來源

2011-10-06 13:27:49 Enigmativity

謝謝，這看起來很有希望，但它目前給我一個編譯錯誤 - 「'System.Collections.Generic.IEnumerable '不包含'StartWith'的定義，也沒有擴展方法'從...開始'」。 – mikel

@mikel - 啊，對不起。我使用了Reactive Framework擴展方法。當我回到我的電腦時，我會解決它。 – Enigmativity

@mikel - 我已將查詢更改爲使用標準運算符。 – Enigmativity

通常這種類型的問題是使用Tries解決。我將基於我的實施Trie How to create a trie in c#（但請注意，我已將其重寫）。

var trie = new Trie(new[] { "un", "que", "stio", "na", "ble", "qu", "es", "ti", "onable", "o", "nable" }); 
//var trie = new Trie(new[] { "u", "n", "q", "u", "e", "s", "t", "i", "o", "n", "a", "b", "l", "e", "un", "qu", "es", "ti", "on", "ab", "le", "nq", "ue", "st", "io", "na", "bl", "unq", "ues", "tio", "nab", "nqu", "est", "ion", "abl", "que", "stio", "nab" }); 

var word = "unquestionable"; 

var parts = new List<List<string>>(); 

Split(word, 0, trie, trie.Root, new List<string>(), parts); 

// 

public static void Split(string word, int index, Trie trie, TrieNode node, List<string> currentParts, List<List<string>> parts) 
{ 
    // Found a syllable. We have to split: one way we take that syllable and continue from it (and it's done in this if). 
    // Another way we ignore this possible syllable and we continue searching for a longer word (done after the if) 
    if (node.IsTerminal) 
    { 
     // Add the syllable to the current list of syllables 
     currentParts.Add(node.Word); 

     // "covered" the word with syllables 
     if (index == word.Length) 
     { 
      // Here we make a copy of the parts of the word. This because the currentParts list is a "working" list and is modified every time. 
      parts.Add(new List<string>(currentParts)); 
     } 
     else 
     { 
      // There are remaining letters in the word. We restart the scan for more syllables, restarting from the root. 
      Split(word, index, trie, trie.Root, currentParts, parts); 
     } 

     // Remove the syllable from the current list of syllables 
     currentParts.RemoveAt(currentParts.Count - 1); 
    } 

    // We have covered all the word with letters. No more work to do in this subiteration 
    if (index == word.Length) 
    { 
     return; 
    } 

    // Here we try to find the edge corresponding to the current character 

    TrieNode nextNode; 

    if (!node.Edges.TryGetValue(word[index], out nextNode)) 
    { 
     return; 
    } 

    Split(word, index + 1, trie, nextNode, currentParts, parts); 
} 

public class Trie 
{ 
    public readonly TrieNode Root = new TrieNode(); 

    public Trie() 
    { 
    } 

    public Trie(IEnumerable<string> words) 
    { 
     this.AddRange(words); 
    } 

    public void Add(string word) 
    { 
     var currentNode = this.Root; 

     foreach (char ch in word) 
     { 
      TrieNode nextNode; 

      if (!currentNode.Edges.TryGetValue(ch, out nextNode)) 
      { 
       nextNode = new TrieNode(); 
       currentNode.Edges[ch] = nextNode; 
      } 

      currentNode = nextNode; 
     } 

     currentNode.Word = word; 
    } 

    public void AddRange(IEnumerable<string> words) 
    { 
     foreach (var word in words) 
     { 
      this.Add(word); 
     } 
    } 
} 

public class TrieNode 
{ 
    public readonly Dictionary<char, TrieNode> Edges = new Dictionary<char, TrieNode>(); 
    public string Word { get; set; } 

    public bool IsTerminal 
    { 
     get 
     { 
      return this.Word != null; 
     } 
    } 
}

word是你有興趣，parts將包含可能的音節的名單列表（它可能會更正確，使之成爲List<string[]>字符串，但它是很容易做到這一點。而不是parts.Add(new List<string>(currentParts));寫parts.Add(currentParts.ToArray());和更改所有的List<List<string>>到List<string[]>。

我會添加Enigmativity響應的變體全髖關節置換是theretically快於他的，因爲它摒棄錯誤的音節立即代替濾波後後他們。如果你喜歡它，你應該給他+1，因爲沒有他的想法，這種差異t是不可能的。但請注意，它仍然是一個黑客。「正確」的解決方案是使用特里（S）:-)

Func<string, bool> isSyllable = t => Regex.IsMatch(t, "^(un|que|stio|na|ble|qu|es|ti|onable|o|nable)$"); 

Func<string, IEnumerable<string[]>> splitter = null; 
splitter = 
    t => 
     (
     from n in Enumerable.Range(1, t.Length - 1) 
     let s = t.Substring(0, n) 
     where isSyllable(s) 
     let e = t.Substring(n) 
     let f = splitter(e) 
     from g in f 
     select (new[] { s }).Concat(g).ToArray() 
     ) 
     .Concat(isSyllable(t) ? new[] { new string[] { t } } : new string[0][]); 

var parts = splitter(word).ToList();

的解釋：

 from n in Enumerable.Range(1, t.Length - 1) 
     let s = t.Substring(0, n) 
     where isSyllable(s)

我們計算一個字的所有可能的音節，從長度爲1至的長度字 - 1並檢查它是否是音節。我們直接刪除了非音節。整個單詞作爲一個音節將在稍後檢查。

 let e = t.Substring(n) 
     let f = splitter(e)

我們搜索的字符串

 from g in f 
     select (new[] { s }).Concat(g).ToArray()

的剩餘部分的音節和我們鏈「當前」音節找到音節。請注意，我們正在創建許多無用的數組。如果我們接受IEnumerable<IEnumerable<string>>作爲結果，我們可以拿走ToArray。

（我們可以多排一起改寫，刪除許多let，像

 from g in splitter(t.Substring(n)) 
     select (new[] { s }).Concat(g).ToArray()

但我們不會爲清楚起見做）

我們串接「當前」的音節與音節發現。

 .Concat(isSyllable(t) ? new[] { new string[] { t } } : new string[0][]);

在這裏，我們可以重建查詢，以便於一點不使用此Concat和創建空數組，但是這將是一個有點複雜（我們可以把整個lambda函數作爲isSyllable(t) ? new[] { new string[] { t } }.Concat(oldLambdaFunction) : oldLambdaFunction）

在最後，如果整個單詞是一個音節，我們將整個單詞作爲一個音節添加。否則，我們Concat一個空數組（所以沒有Concat）

來源

2011-10-06 13:26:29 xanatos

你可能會遇到擴展這個問題，老實說，我不確定你的數據集有多大，但是基於一個簡單的「這是一個音節嗎？」的解決方案。您需要爲每個單詞調用大約0（n * n）的「音節檢測」例程，其中n =單詞中的字符數（如果這沒有意義，這意味着對於大型數據集它可能會很慢！）。這不會考慮檢測算法的可擴展性，當您添加更多音節時，檢測算法的可擴展性也可能會變慢。。

我知道你已經說過你識別什麼是音節或不是相關的過程，但是讓我們說你可以改變它使它更像自動完成，即通過音節的開頭並讓它告訴你所有從這一點來說，可能的音節將更具可擴展性。如果性能失控，請用trie替換它。

來源

2011-10-06 13:53:39

從字符串中生成子串的組合

回答

相關問題