2013-07-29 37 views
2

我使用下面的代碼來標記字符串,從標準輸入讀取。Tokenizing一個字符串給一些字合併

d=[] 
cur = '' 
for i in sys.stdin.readline(): 
    if i in ' .': 
     if cur not in d and (cur != ''): 
      d.append(cur) 
      cur = '' 
    else: 
     cur = cur + i.lower() 

這給了我一個不重複的單詞數組。但是,在我的輸出中,有些單詞不會分裂。

我輸入的是

Dan went to the north pole to lead an expedition during summer. 

和輸出數組d是

[ '丹', '去', '到', '的', '北', '極',' '''','夏季']'

爲什麼tolead在一起?

回答

3

試試這個

d=[] 
cur = '' 
for i in sys.stdin.readline(): 
    if i in ' .': 
     if cur not in d and (cur != ''): 
      d.append(cur) 
     cur = '' # note the different indentation 
    else: 
     cur = cur + i.lower() 
1

試試這個:

for line in sys.stdin.readline(): 
    res = set(word.lower() for word in line[:-1].split(" ")) 
    print res 

例子:

line = "Dan went to the north pole to lead an expedition during summer." 
res = set(word.lower() for word in line[:-1].split(" ")) 
print res 

set(['north', 'lead', 'expedition', 'dan', 'an', 'to', 'pole', 'during', 'went', 'summer', 'the']) 

後的意見,我編輯:這個解決方案保留了輸入順序,並過濾分離

import re 
from collections import OrderedDict 
line = "Dan went to the north pole to lead an expedition during summer." 
list(OrderedDict.fromkeys(re.findall(r"[\w']+", line))) 
# ['Dan', 'went', 'to', 'the', 'north', 'pole', 'lead', 'an', 'expedition', 'during', 'summer'] 
+1

應該也可能分裂。只是爲了確保OPs問題是真的。 –

+0

Done with line [: - 1] :) –

+0

嗯,不完全,因爲你可能有多個句子。僅僅因爲它在OP上的例子並不意味着它在野外工作。 –

1

"to"已經在"d"。所以,你的循環跳過"to""lead"之間的空間,但繼續連接;一旦到達下一個空間,它就會看到"tolead"不在d中,因此它會追加它。

更簡單的解決方案;它還可以去除所有形式的標點符號:

>>> import string 
>>> set("Dan went to the north pole to lead an expedition during summer.".translate(None, string.punctuation).lower().split()) 
set(['summer', 'north', 'lead', 'expedition', 'dan', 'an', 'to', 'pole', 'during', 'went', 'the']) 
相關問題