在python中分割句子

我想用文字分割句子。在python中分割句子

words = content.lower().split()

這給了我的話像

'evening,', 'and', 'there', 'was', 'morning--the', 'first', 'day.'

列表，並使用此代碼：

def clean_up_list(word_list): 
    clean_word_list = [] 
    for word in word_list: 
     symbols = "[email protected]#$%^&*()_+`{}|\"?><`-=\][';/.,']" 
     for i in range(0, len(symbols)): 
      word = word.replace(symbols[i], "") 
     if len(word) > 0: 
      clean_word_list.append(word)

我得到的是這樣的：

'evening', 'and', 'there', 'was', 'morningthe', 'first', 'day'

，如果你看到單詞「morningthe」在列表中使用詞之間有「 - 」。現在，有什麼辦法可以將它們分成兩個單詞，如"morning","the"？

來源

2017-01-27 Yun Tae Hwang

你需要分割上的所有分隔符，而不僅僅是空白。這在其他StackOverflow問題中已有介紹。 – Prune

http://stackoverflow.com/q/13209288/3865495 – CoconutBandit

可能的重複您需要使用'strip（）'方法刪除行末尾的不需要的符號。即''x - '。strip（'，：'）' - >''x''，但是'x-y'.strip（'，： - '）' - >''x-y''。但是，如果你想使用真正的文本，你需要更復雜的方法......也許NTLK應該是一個好的開始？ – myaut

我會建議一個基於正則表達式的解決方案：

import re 

def to_words(text): 
    return re.findall(r'\w+', text)

這看起來所有的話 - 的字母字符組，忽略符號，分隔符和空白。

>>> to_words("The morning-the evening") 
['The', 'morning', 'the', 'evening']

需要注意的是，如果你遍歷的話，使用re.finditer它返回一個發電機對象可能是更好的，因爲你沒有商店的話整個列表一次。

來源

2017-01-27 22:02:14 FlipTack

另外，您還可以使用具有str.alpha()一起itertools.groupby從字符串中提取字母，隻字爲：

>>> from itertools import groupby 
>>> sentence = 'evening, and there was morning--the first day.' 

>>> [''.join(j) for i, j in groupby(sentence, str.isalpha) if i] 
['evening', 'and', 'there', 'was', 'morning', 'the', 'first', 'day']

PS：基於正則表達式的解決方案是乾淨多了。我已經提到這是實現這一目標的可能替代方案。

具體到OP：如果你想要的是在結果列表上--也分裂，那麼你可能會首先進行拆分前替換連字符'-'與空間' '。因此，你的代碼應該是：

words = content.lower().replace('-', ' ').split()

其中words將持有你想要的值。

來源

2017-01-27 22:05:44

試圖用正則表達式來做到這一點會讓你瘋狂，例如

>>> re.findall(r'\w+', "Don't read O'Rourke's books!") 
['Don', 't', 'read', 'O', 'Rourke', 's', 'books']

絕對看看nltk包。

來源

2017-01-27 22:23:26

除了已經給出的解決方案之外，您還可以改進您的clean_up_list函數以完成更好的工作。

def clean_up_list(word_list): 
    clean_word_list = [] 
    # Move the list out of loop so that it doesn't 
    # have to be initiated every time. 
    symbols = "[email protected]#$%^&*()_+`{}|\"?><`-=\][';/.,']" 

    for word in word_list: 
     current_word = '' 
     for index in range(len(word)): 
      if word[index] in symbols: 
       if current_word: 
        clean_word_list.append(current_word) 
        current_word = '' 
      else: 
       current_word += word[index] 

     if current_word: 
      # Append possible last current_word 
      clean_word_list.append(current_word) 

    return clean_word_list

其實，你可以在for word in word_list:應用塊整個句子來得到相同的結果。

來源

2017-01-27 22:33:06

你也可以這樣做：

import re 

def word_list(text): 
    return list(filter(None, re.split('\W+', text))) 

print(word_list("Here we go round the mulberry-bush! And even---this and!!!this."))

['Here', 'we', 'go', 'round', 'the', 'mulberry', 'bush', 'And', 'even', 'this', 'and', 'this']

來源

2017-01-28 03:45:08

在python中分割句子

回答

相關問題