將字符串拆分爲單詞和標點符號

我試圖將字符串拆分爲單詞和標點符號，將標點符號添加到拆分生成的列表中。將字符串拆分爲單詞和標點符號

例如：

>>> c = "help, me" 
>>> print c.split() 
['help,', 'me']

我真正想要列表樣子是：

['help', ',', 'me']

所以，我想在從單詞的標點符號分裂空白字符串分割。

我試着先分析字符串，然後再運行分裂：

>>> for character in c: 
...  if character in ".,;!?": 
...    outputCharacter = " %s" % character 
...  else: 
...    outputCharacter = character 
...  separatedPunctuation += outputCharacter 
>>> print separatedPunctuation 
help , me 
>>> print separatedPunctuation.split() 
['help', ',', 'me']

這將產生我想要的結果，但對大文件非常緩慢。

有沒有辦法更有效地做到這一點？

來源

2008-12-14 David A

對於本例（不是一般情況）`c.replace（''，''）。partition（'，'）` – 2016-11-21 08:59:51

這是或多或少地做到這一點：

>>> import re 
>>> re.findall(r"[\w']+|[.,!?;]", "Hello, I'm a string!") 
['Hello', ',', "I'm", 'a', 'string', '!']

的訣竅是，不要去想哪裏拆分字符串，但在標記中包含的內容。

注意事項：

下劃線（_）被認爲是一內單詞字符。替換\ w，如果你不想要的話。
這不適用於字符串中的（單個）引號。
在正則表達式的右半部分放置任何想要使用的標點符號。
在re中沒有明確提到的任何內容都被默默地拋棄了。

來源

2008-12-15 01:53:18 hop

-1

您是否嘗試過使用正則表達式？

http://docs.python.org/library/re.html#re-syntax

順便說。爲什麼你需要第二個「，」？你會知道每個文本寫入之後即

[0]

「」

[1]

「」

所以，如果你想添加的「，「當你使用數組時，你可以在每次迭代後執行它。

來源

2008-12-14 23:34:49

在Perl風格的正則表達式語法中，\b匹配單詞邊界。這對於執行基於正則表達式的拆分應該很方便。

編輯：我已經被hop通知，「空匹配」在Python的re模塊的分割函數中不起作用。我將這裏留下來作爲任何人被這個「功能」難住的信息。

來源

2008-12-15 00:25:08 Svante

只有它不會因爲re.split不能與r'\ b'一起工作... – hop 2008-12-15 01:09:10

這到底是什麼？這是re.split中的錯誤嗎？在Perl中，`split/\ b \ s * /`沒有任何問題。 – Svante 2008-12-15 01:29:34

這是一種文件記錄，re.split（）不會分裂空的匹配......所以，不，不/一個/一個錯誤。 – hop 2008-12-15 01:51:26

我想你可以在NLTK找到所有你可以想象的幫助，特別是因爲你使用的是python。本教程對此問題進行了全面的討論。

來源

2008-12-15 00:34:08 dkretz

以下是對您的實施進行的較小更新。如果你試圖做更詳細的事情，我建議你看看dorfier建議的NLTK。

這可能會稍微快一點，因爲使用''.join（）代替+ =，即known to be faster。

import string 

d = "Hello, I'm a string!" 

result = [] 
word = '' 

for char in d: 
    if char not in string.whitespace: 
     if char not in string.ascii_letters + "'": 
      if word: 
        result.append(word) 
      result.append(char) 
      word = '' 
     else: 
      word = ''.join([word,char]) 

    else: 
     if word: 
      result.append(word) 
      word = '' 
print result 
['Hello', ',', "I'm", 'a', 'string', '!']

來源

2008-12-15 01:05:11 monkut

這是我的項目。

我對我的懷疑在於效率如何，或者它是否抓住了所有情況（注意「!!!」分組在一起;這可能會或可能不會是件好事）。

>>> import re 
>>> import string 
>>> s = "Helo, my name is Joe! and i live!!! in a button; factory:" 
>>> l = [item for item in map(string.strip, re.split("(\W+)", s)) if len(item) > 0] 
>>> l 
['Helo', ',', 'my', 'name', 'is', 'Joe', '!', 'and', 'i', 'live', '!!!', 'in', 'a', 'button', ';', 'factory', ':'] 
>>>

一個明顯的優化將編譯前手（使用re.compile）如果你要一行一行地基礎上做此正則表達式。

來源

2008-12-15 01:30:32

這裏是一個Unicode的版本：

re.findall(r"\w+|[^\w\s]", text, re.UNICODE)

第一替代捕獲的單詞的字符序列（如由unicode的定義，因此「RESUME」不會變成['r', 'sum']）;第二個捕獲單個非單詞字符，忽略空白。

請注意，與頂級答案不同，此處將單引號視爲單獨的標點符號（例如「我是」 - >['I', "'", 'm']）。這似乎是NLP的標準，所以我認爲它是一個功能。

來源

2012-01-19 17:58:09 LaC

我想出了一個辦法來標記使用\b它不需要硬編碼的所有文字和\W+模式：

>>> import re 
>>> sentence = 'Hello, world!' 
>>> tokens = [t.strip() for t in re.findall(r'\b.*?\S.*?(?:\b|$)', sentence)] 
['Hello', ',', 'world', '!']

這裏.*?\S.*?是匹配任何一個模式是不是一個空間和$被添加到如果它是標點符號，則匹配字符串中的最後一個標記。

不過要注意以下 - 這組標點符號由多於一個符號：

>>> print [t.strip() for t in re.findall(r'\b.*?\S.*?(?:\b|$)', '"Oh no", she said')] 
['Oh', 'no', '",', 'she', 'said']

當然，你可以找到與分割這樣的羣體：

>>> for token in [t.strip() for t in re.findall(r'\b.*?\S.*?(?:\b|$)', '"You can", she said')]: 
...  print re.findall(r'(?:\w+|\W)', token) 

['You'] 
['can'] 
['"', ','] 
['she'] 
['said']

來源

2014-04-15 19:11:22 FrauHahnhen

試試這個：

string_big = "One of Python's coolest features is the string format operator This operator is unique to strings" 
my_list =[] 
x = len(string_big) 
poistion_ofspace = 0 
while poistion_ofspace < x: 
    for i in range(poistion_ofspace,x): 
     if string_big[i] == ' ': 
      break 
     else: 
      continue 
    print string_big[poistion_ofspace:(i+1)] 
    my_list.append(string_big[poistion_ofspace:(i+1)]) 
    poistion_ofspace = i+1 

print my_list

來源

2017-04-18 09:03:02

將字符串拆分爲單詞和標點符號

回答

相關問題