2013-03-19 33 views
1

我需要在不改變字符順序的情況下將字符串分割成所有可能的方式。 我的理解是這個任務可以看作是標記化或NLP中的詞形化,但我從純字符串搜索的角度來看它更簡單,更強大。考慮,如何拆分字符串並將其子字符串匹配到子字符串列表? - Python

dictionary = ['train','station', 'fire', 'a','trainer','in'] 
str1 = "firetrainstation" 

任務1:如何生成的所有可能的子這樣,我得到:

all_possible_substrings = [['f','iretrainstation'], 
['fo','retrainstation'], ... 
['firetrainstatio','n'], 
['f','i','retrainstation'], ... , ... 
['fire','train','station'], ... , ... 
['fire','tr','a','instation'], ... , ... 
['fire','tr','a','in','station'], ... , ... 
['f','i','r','e','t','r','a','i','n','s','t','a','t','i','o','n'] 

任務2:然後從all_possible_substring,我怎麼能檢查識破並說包含字典中所有元素的子字符串集合是正確的輸出。所需的輸出將是字典中與從左到右匹配最多字符數的子字符串列表。所需的輸出就是:

"".join(desire_substring_list) == str1 and \ 
[i for i desire_substring_list if in dictionary] == len(desire_substring_list) 
#(let's assume, the above condition can be true for any input string since my english 
#language dictionary is very big and all my strings are human language 
#just written without spaces) 

所需的輸出:

'fire','train','station' 

我做了什麼?

對於任務1,我已經做到了這一點,但我知道它不會給我的所有可能的空白插入:

all_possible_substrings.append(" ".join(str1)) 

我已經做到了這一點,但是這不僅會任務2

import re 
seed = ['train','station', 'fire', 'a','trainer','in'] 
str1 = "firetrainstation" 
all_possible_string = [['f','iretrainstation'], 
['fo','retrainstation'], 
['firetrainstatio','n'], 
['f','i','retrainstation'], 
['fire','train','station'], 
['fire','tr','a','instation'], 
['fire','tr','a','in','station'], 
['f','i','r','e','t','r','a','i','n','s','t','a','t','i','o','n']] 
pattern = re.compile(r'\b(?:' + '|'.join(re.escape(s) for s in seed) + r')\b') 
highest_match = "" 
for i in all_possible_string: 
    x = pattern.findall(" ".join(i)) 
    if "".join(x) == str1 and len([i for i in x if i in seed]) == len(x): 
    print " ".join(x) 
+0

請注意,您的字典實際上是一個「列表」。 – mgilson 2013-03-19 01:24:19

+0

此外,我很確定你需要做更多的解釋。爲什麼「foo」,「bar」,「bar」,「str」是所需的輸出? – mgilson 2013-03-19 01:25:39

+0

更新了所需的輸出。 – alvas 2013-03-19 01:35:28

回答

3

在第一部分,你可以寫一個類似的遞歸發生器:

>>> def all_substr(string): 
    for i in range(len(string)): 

     if i == len(string) - 1: 
      yield string 

     first_part = string[0:i+1] 
     second_part = string[i+1:] 

     for j in all_substr(second_part): 
      yield ','.join([first_part, j]) 


>>> for x in all_substr('apple'): 
    print(x) 


a,p,p,l,e 
a,p,p,le 
a,p,pl,e 
a,p,ple 
a,pp,l,e 
a,pp,le 
a,ppl,e 
a,pple 
ap,p,l,e 
ap,p,le 
ap,pl,e 
ap,ple 
app,l,e 
app,le 
appl,e 
apple 
相關問題