Python：找到所有可能的字符組合與字符序列（分詞）

我正在做一些分詞實驗，如下所示。Python：找到所有可能的字符組合與字符序列（分詞）

lst是一個字符序列，而output是所有可能的單詞。

lst = ['a', 'b', 'c', 'd'] 

def foo(lst): 
    ... 
    return output 

output = [['a', 'b', 'c', 'd'], 
      ['ab', 'c', 'd'], 
      ['a', 'bc', 'd'], 
      ['a', 'b', 'cd'], 
      ['ab', 'cd'], 
      ['abc', 'd'], 
      ['a', 'bcd'], 
      ['abcd']]

我在itertools庫檢查combinations和permutations，
也試過combinatorics。
但是，似乎我看着錯誤的一面，因爲這不是純粹的置換和組合...

看來我可以通過使用大量循環來實現這一點，但效率可能很低。所以像['ba', 'dc']或['cd', 'ab']組合是無效的

編輯

詞序是很重要的。

從左至右的順序應始終爲。

編輯

@斯圖爾特的解決方案並不在Python 2.7.6工作

編輯

@斯圖爾特的解決方案確實在Python 2.7.6工作，請參見下面的評論。

來源

2014-12-03 amigcamel

請參閱我的代碼在Python 2.7.3中使用[here]（http://ideone.com/ufVuEm）和Python 3.2.3中的[here]（http://ideone.com/N4y9t7） – Stuart 2014-12-03 10:33:23

itertools.product的確應該能夠幫助你。

這個想法是這樣的： - 考慮A1，A2，...，由板塊分隔的AN。將會有N-1塊平板。如果有板塊存在分段。如果沒有平板，則有一個連接。因此，對於給定的長度爲N的序列，應該有2 ^（N-1）個這樣的組合。

就像下面

import itertools 
lst = ['a', 'b', 'c', 'd'] 
combinatorics = itertools.product([True, False], repeat=len(lst) - 1) 

solution = [] 
for combination in combinatorics: 
    i = 0 
    one_such_combination = [lst[i]] 
    for slab in combination: 
     i += 1 
     if not slab: # there is a join 
      one_such_combination[-1] += lst[i] 
     else: 
      one_such_combination += [lst[i]] 
    solution.append(one_such_combination) 

print solution

來源

2014-12-03 04:23:24 Sriram

我選擇了這個答案，因爲花費最少的時間，但其他的解決方案也有很大的 – amigcamel 2014-12-03 15:02:26

有8個選項，每個鏡像二進制數0到7：

每個0和1代表是否該索引處的2個字母「膠合」在一起。 0表示否，1表示是。

>>> lst = ['a', 'b', 'c', 'd'] 
... output = [] 
... formatstr = "{{:0{}.0f}}".format(len(lst)-1) 
... for i in range(2**(len(lst)-1)): 
...  output.append([]) 
...  s = "{:b}".format(i) 
...  s = str(formatstr.format(float(s))) 
...  lstcopy = lst[:] 
...  for j, c in enumerate(s): 
...   if c == "1": 
...    lstcopy[j+1] = lstcopy[j] + lstcopy[j+1] 
...   else: 
...    output[-1].append(lstcopy[j]) 
...  output[-1].append(lstcopy[-1]) 
... output 
[['a', 'b', 'c', 'd'], 
['a', 'b', 'cd'], 
['a', 'bc', 'd'], 
['a', 'bcd'], 
['ab', 'c', 'd'], 
['ab', 'cd'], 
['abc', 'd'], 
['abcd']] 
>>>

來源

2014-12-03 04:15:05 twasbrillig

您的代碼isn不爲lst = ['a'，'b'，'c'，'d'，'e']工作。 – 2014-12-03 04:23:36

謝謝！修復它以適用於更多字母的情況。即使我沒有得到接受，這是一個有趣的練習:) – twasbrillig 2014-12-03 04:34:04

^哈哈！我贊成你的。我們都有相同的解決方案。你快了！ :-) – Sriram 2014-12-03 04:55:52

#!/usr/bin/env python 
from itertools import combinations 
a = ['a', 'b', 'c', 'd'] 
a = "".join(a) 
cuts = [] 
for i in range(0,len(a)): 
    cuts.extend(combinations(range(1,len(a)),i)) 
for i in cuts: 
    last = 0 
    output = [] 
    for j in i: 
     output.append(a[last:j]) 
     last = j 
    output.append(a[last:]) 
    print(output)

輸出：

zsh 2419 % ./words.py 
['abcd'] 
['a', 'bcd'] 
['ab', 'cd'] 
['abc', 'd'] 
['a', 'b', 'cd'] 
['a', 'bc', 'd'] 
['ab', 'c', 'd'] 
['a', 'b', 'c', 'd']

來源

2014-12-03 04:17:25

我不能upvote你。我在你的回答下寫下評論的理由。你的代碼我小那麼我的，這是很好的，但是，我的機器上的代碼不到風度很好地工作:( – 2014-12-03 04:55:16

這實際上是斯圖爾特的回答，不是我的，但Sririam upvoted礦山所以這一切都很好。 – twasbrillig 2014-12-03 05:06:14

您可以使用遞歸發生器：

def split_combinations(L): 
    for split in range(1, len(L)): 
     for combination in split_combinations(L[split:]): 
      yield [L[:split]] + combination 
    yield [L] 

print (list(split_combinations('abcd')))

編輯。我不確定這會如何擴展長字符串，以及它達到Python遞歸限制的程度。與其他一些答案類似，您也可以使用combinations從itertools開始處理每個可能的組合點。

這些似乎都按預期在Python 2.7（see here）和Python 3.2（here）工作。正如@twasbrillig所說，確保你縮進如圖所示。

來源

2014-12-03 04:26:06 Stuart

我沒有看到！[‘ABC’‘d’]在輸出 – 2014-12-03 04:28:35

我真的看到它 – twasbrillig 2014-12-03 04:37:22

'在文獻[1]：？DEF split_combinations（L）： ...：在範圍（，LEN（L）1）拆分： ...：用於組合在split_combinations（L [分裂：]）： ...：產率[L [：分割]] +組合 ...：收益率[L] ...： In [2]：print（list（split_combinatio ns（'abcd'））） [['a'，'b'，'cd']，['a'，'bcd']，['a'，'bcd']，['abcd']， ['ab'，'cd']，['abcd']，['abcd']]' – 2014-12-03 04:41:17

Python：找到所有可能的字符組合與字符序列（分詞）

回答

相關問題