Tokenizing一個字符串給一些字合併

我使用下面的代碼來標記字符串，從標準輸入讀取。Tokenizing一個字符串給一些字合併

d=[] 
cur = '' 
for i in sys.stdin.readline(): 
    if i in ' .': 
     if cur not in d and (cur != ''): 
      d.append(cur) 
      cur = '' 
    else: 
     cur = cur + i.lower()

這給了我一個不重複的單詞數組。但是，在我的輸出中，有些單詞不會分裂。

我輸入的是

Dan went to the north pole to lead an expedition during summer.

和輸出數組d是

[ '丹'， '去'， '到'， '的'， '北'， '極'，' ''''，'夏季']'

爲什麼tolead在一起？

來源

2013-07-29 rjv

試試這個

d=[] 
cur = '' 
for i in sys.stdin.readline(): 
    if i in ' .': 
     if cur not in d and (cur != ''): 
      d.append(cur) 
     cur = '' # note the different indentation 
    else: 
     cur = cur + i.lower()

來源

2013-07-29 17:50:20 tohava

試試這個：

for line in sys.stdin.readline(): 
    res = set(word.lower() for word in line[:-1].split(" ")) 
    print res

例子：

line = "Dan went to the north pole to lead an expedition during summer." 
res = set(word.lower() for word in line[:-1].split(" ")) 
print res 

set(['north', 'lead', 'expedition', 'dan', 'an', 'to', 'pole', 'during', 'went', 'summer', 'the'])

後的意見，我編輯：這個解決方案保留了輸入順序，並過濾分離

import re 
from collections import OrderedDict 
line = "Dan went to the north pole to lead an expedition during summer." 
list(OrderedDict.fromkeys(re.findall(r"[\w']+", line))) 
# ['Dan', 'went', 'to', 'the', 'north', 'pole', 'lead', 'an', 'expedition', 'during', 'summer']

來源

2013-07-29 17:55:33

應該也可能分裂。只是爲了確保OPs問題是真的。 –

Done with line [： - 1] :) –

嗯，不完全，因爲你可能有多個句子。僅僅因爲它在OP上的例子並不意味着它在野外工作。 –

"to"已經在"d"。所以，你的循環跳過"to"和"lead"之間的空間，但繼續連接;一旦到達下一個空間，它就會看到"tolead"不在d中，因此它會追加它。

更簡單的解決方案;它還可以去除所有形式的標點符號：

>>> import string 
>>> set("Dan went to the north pole to lead an expedition during summer.".translate(None, string.punctuation).lower().split()) 
set(['summer', 'north', 'lead', 'expedition', 'dan', 'an', 'to', 'pole', 'during', 'went', 'the'])

來源

2013-07-29 18:01:32 2rs2ts

Tokenizing一個字符串給一些字合併

回答

相關問題