需要幫助，分裂python中的字符串

我想使用下面的模式標記字符串。需要幫助，分裂python中的字符串

>>> splitter = re.compile(r'((\w*)(\d*)\-\s?(\w*)(\d*)|(?x)\$?\d+(\.\d+)?(\,\d+)?|([A-Z]\.)+|(Mr)\.|(Sen)\.|(Miss)\.|.$|\w+|[^\w\s])') 
>>> splitter.split("Hello! Hi, I am debating this predicament called life. Can you help me?")

我得到以下輸出。有人能指出我需要糾正嗎？我對整個「無」的問題感到困惑。另外，如果有更好的方法來標記字符串，我真的很感激額外的幫助。

['', 'Hello', None, None, None, None, None, None, None, None, None, None, '', '!', None, None, None, None, None, None, None, None, None, None, ' ', 'Hi', None,None, None, None, None, None, None, None, None, None, '', ',', None, None, None, None, None, None, None, None, None, None, ' ', 'I', None, None, None, None, None, None, None, None, None, None, ' ', 'am', None, None, None, None, None, None,None, None, None, None, ' ', 'debating', None, None, None, None, None, None, None, None, None, None, ' ', 'this', None, None, None, None, None, None, None, None, None, None, ' ', 'predicament', None, None, None, None, None, None, None, None, None, None, ' ', 'called', None, None, None, None, None, None, None, None, None, None, ' ', 'life', None, None, None, None, None, None, None, None, None, None, '', '.', None, None, None, None, None, None, None, None, None, None, ' ', 'Can', None, None, None, None, None, None, None, None, None, None, ' ', 'you', None, None, None, None, None, None, None, None, None, None, ' ', 'help', None, None,None, None, None, None, None, None, None, None, ' ', 'me', None, None, None, None, None, None, None, None, None, None, '', '?', None, None, None, None, None, None, None, None, None, None, '']

，我想輸出是： -

['Hello', '!', 'Hi', ',', 'I', 'am', 'debating', 'this', 'predicament', 'called', 'life', '.', 'Can', 'you', 'help', 'me', '?']

謝謝。

來源

2010-08-03 leba-lev

你認爲正確的輸出應該是什麼？ – 2010-08-03 01:15:40

而不是一個簡單而又太冗長的例子，請告訴我們**什麼規則定義了什麼樣的正確輸出應該是**。 – 2010-08-03 01:51:12

我推薦NLTK的標記化器。然後，你不必擔心繁瑣的正則表達式自己：

>>> import nltk 
>>> nltk.word_tokenize("Hello! Hi, I am debating this predicament called life. Can you help me?") 
['Hello', '!', 'Hi', ',', 'I', 'am', 'debating', 'this', 'predicament', 'called', 'life.', 'Can', 'you', 'help', 'me', '?']

來源

2010-08-03 02:12:04

可能是失去了一些東西，但我beleive像下面將工作：

s = "Hello! Hi, I am debating this predicament called life. Can you help me?" 
s.split(" ")

這是假設您想要的空間。你應該得到的線沿線的東西：

['Hello!', 'Hi,', 'I', 'am', 'debating', 'this', 'predicament', 'called', 'life.', 'Can', 'you', 'help', 'me?']

有了這個，如果你需要一個特定的一塊，你很可能環，雖然它得到你所需要的。

希望這有助於....

來源

2010-08-03 01:19:37

抱歉，我沒有指定我正在瞄準的輸出。我已經重新編輯了我的問題。任何不便敬請諒解。 – 2010-08-03 01:22:25

我沒有明確你需要什麼，但我應該給你足夠的前進。 :-)乾杯！ – 2010-08-03 01:24:51

空格是默認的分隔符，所以你可以調用s.split（）。 – GreenMatt 2010-08-03 01:59:41

re.split作爲tokeniser使用時迅速用完粉撲。優選的是findall（或match在一個循環中）與替代this|that|another|more

>>> s = "Hello! Hi, I am debating this predicament called life. Can you help me?" 
>>> import re 
>>> re.findall(r"\w+|\S", s) 
['Hello', '!', 'Hi', ',', 'I', 'am', 'debating', 'this', 'predicament', 'called', 'life', '.', 'Can', 'you', 'help', 'me', '?'] 
>>>

這定義令牌作爲一個或多個「字」的字符，或一個單獨的字符，這不是空白的圖案。你可能更喜歡[A-Za-z]或[A-Za-z0-9]或其他東西，而不是\w（它允許下劃線）。你甚至可以像r"[A-Za-z]+|[0-9]+|\S"

如果事情像Sen.，Mr.和Miss（發生了什麼事Mrs和Ms？）對你顯著，你的正則表達式應該不一一列舉出來，它應該只是定義在結束令牌.，你應該有一本字典或一組可能的縮寫。

將文本分割成句子很複雜。您可能想看看nltk包，而不是試圖重新發明車輪。

更新：如果您需要/想要區分令牌類型，您可以在沒有（可能很長）if/elif/elif /.../鏈的情況下獲得像這樣的索引或名稱：

>>> s = "Hello! Hi, I we 0 1 987?" 

>>> pattern = r"([A-Za-z]+)|([0-9]+)|(\S)" 
>>> list((m.lastindex, m.group()) for m in re.finditer(pattern, s)) 
[(1, 'Hello'), (3, '!'), (1, 'Hi'), (3, ','), (1, 'I'), (1, 'we'), (2, '0'), (2,  '1'), (2, '987'), (3, '?')] 

>>> pattern = r"(?P<word>[A-Za-z]+)|(?P<number>[0-9]+)|(?P<other>\S)" 
>>> list((m.lastgroup, m.group()) for m in re.finditer(pattern, s)) 
[('word', 'Hello'), ('other', '!'), ('word', 'Hi'), ('other', ','), ('word', 'I'), ('word', 'we'), ('number', '0'), ('number', '1'), ('number', '987'), ('other' 
, '?')] 
>>>

來源

2010-08-03 01:48:01

將評論中的正則表達式詆譭到另一個答案似乎有點諷刺意味，但在這裏使用它們。 – GreenMatt 2010-08-04 01:26:40

原因你得到所有這些None的的是，因爲你有很多括號的組在您通過|分隔正則表達式的。每當你的正則表達式找到一個匹配時，它只匹配|給出的替代方案之一。其他未使用的替代方法中的括號內的組將被設置爲None。根據定義，re.split每次獲得匹配時都會報告所有帶括號的組的值，因此您的結果中有大量None。

你可以很容易地過濾掉這些（例如tokens = [t for t in tokens if t]或類似的東西），但我認爲split並不是你想要的標記化工具。 split只是爲了扔掉空白。如果你真的想用正則表達式來標記某些東西，下面是另一種方法的玩具示例（我甚至不會嘗試打開你正在使用的怪物的包裝...使用re.VERBOSE選項來表達對Ned的熱愛......但希望這個玩具例子會給你的想法）：

tokenpattern = re.compile(r""" 
(?P<words>\w+) # Things with just letters and underscores 
|(?P<numbers>\d+) # Things with just digits 
|(?P<other>.+?) # Anything else 
""", re.VERBOSE)

的(?P<something>...業務可以讓你的名字在下面的代碼識別標記的您正在尋找的類型：

for match in tokenpattern.finditer("99 bottles of beer"): 
    if match.group('words'): 
    # This token is a word 
    word = match.group('words') 
    #... 
    elif match.group('numbers'): 
    number = int(match.group('numbers')): 
    else: 
    other = match.group('other'):

請注意，這仍然是一個重要的使用了一組由|分隔的括號內的組，所以在代碼中會發生同樣的事情：對於每個匹配，將定義一個組，其他組將被設置爲None。這個方法明確地檢查。

來源

2010-08-03 01:49:35

也許他並不意味着它是這樣，但約翰·馬金的評論「str.split不上手的地方」（如部分交易所之後Frank V's answer）是一個挑戰。所以...

the_string = "Hello! Hi, I am debating this predicament called life. Can you help me?" 
tokens = the_string.split() 
punctuation = ['!', ',', '.', '?'] 
output_list = [] 
for token in tokens: 
    if token[-1] in punctuation: 
     output_list.append(token[:-1]) 
     output_list.append(token[-1]) 
    else: 
     output_list.append(token) 
print output_list

這似乎提供了請求的輸出。

當然，John的答案在代碼行數方面更簡單。不過，我有幾點要支持這種解決方案。

我不完全同意Jamie Zawinski的'有些人在遇到問題時想'我知道，我會用正則表達式'。「現在他們有兩個問題。' （他從我讀過的內容中也沒有發現。）我在引用這句話時指出，如果你不習慣正則表達式，那麼正則表達式可能是一件很痛苦的事情。

此外，雖然它通常不會成爲問題，但使用timeit進行測量時，上述解決方案的性能始終優於正則表達式解決方案。上述解決方案（刪除了print語句）約8.9秒;約翰的正則表達式解決方案進來約11.8秒。這涉及在運行頻率爲2.4 GHz的四核雙處理器系統上進行10次100萬次迭代嘗試。

來源

2010-08-04 02:06:22 GreenMatt

需要幫助，分裂python中的字符串

回答

相關問題