切割句子的

我有以下類方法：切割句子的

class Trigger(): 

    def getRidOfTrashPerSentence(self, line, stopwords): 
     countWord = 0 
     words = line.split() 
     for word in words: 
      if countWord == 0: 
       if word in stopwords: 
        sep = word 
        lineNew = line.split(sep, 1)[0] 
        countWord = countWord + 1 
        return(lineNew) 

    stopwords = ['regards', 'Regards'] 

    def getRidOfTrash(self, aTranscript): 
     result = [self.getRidOfTrashPerSentence(line, self.stopwords) for line in aTranscript] 
     return(result)

我想實現它在句子切「垃圾」的某些觸發字後像['regards', 'Regards']

所以，當我想插入這樣一個塊：

aTranScript = [ "That's fine, regards Henk", "Allright great"]

我在尋找這樣的輸出：

aTranScript = [ "That's fine, regards", "Allright great"]

然而，當我這樣做：

newFile = Trigger() 
newContent = newFile.getRidOfTrash(aTranScript)

我只得到"That's fine"。

我如何能得到的任何想法都串

來源

2017-02-14 Henk Straten

你如何在拆分後附加分隔符？這裏是一個類似的問題 - http://stackoverflow.com/questions/7866128/python-split-without-removing-the-delimiter – Vinay

我不明白你什麼你Vinay，你能詳細說明一下嗎？ –

你可以做到這一點 - 'lineNew = line.split（SEP，1）[0]' 'lineNew + = sep' – Vinay

這是一個簡單的解決方案：

yourString = 'Hello thats fine, regards Henk' 
yourString.split(', regards')[0]

該代碼將返回：「你好這就是精」

如果你願意，你可以在最後連接'視爲'：

yourString.split（'，regards'）[0] +'，視爲'

來源

2017-02-14 08:52:51 Ika8

@EricDuminil你分辯，改變」，至於爲 '亨克';） – Ika8

如果你錯過了一個特定的詞，你可以連接它.. – Ika8

你會如何適應多個觸發詞？ –

您可以掃描行字，並刪除他們，如果前一個詞是停用詞：

class Trigger(): 

    stopwords = ['regards', 'Regards'] 

    def getRidOfTrashPerSentence(self, line): 
     words = line.split() 
     new_words = [words[0]] 
     for i in range(1, len(words)): 
      if not words[i-1] in self.stopwords: 
       new_words.append(words[i]) 
     return " ".join(new_words) # reconstruct line 

    def getRidOfTrash(self, aTranscript): 
     result = [self.getRidOfTrashPerSentence(line) for line in aTranscript] 
     return(result) 

aTranScript = [ "That's fine, regards Henk", "Allright great"] 
newFile = Trigger() 
newContent = newFile.getRidOfTrash(aTranScript) 
print(newContent)

來源

2017-02-14 09:07:42

正則表達式可以更容易地進行更換。

import re 

stop_words = ['regards', 'cheers'] 

def remove_text_after_stopwords(text, stop_words): 
    pattern = "(%s).*$" % '|'.join(stop_words) 
    remove_trash = re.compile(pattern, re.IGNORECASE) 
    return re.sub(remove_trash, '\g<1>', text) 

print remove_text_after_stopwords("That's fine, regards, Henk", stop_words) 
# That's fine, regards 
print remove_text_after_stopwords("Good, cheers! Paul", stop_words) 
# Good, cheers 
print remove_text_after_stopwords("No stop word here", stop_words) 
# No stop word here

如果你有一個字符串列表，你可以只用一個列表理解應用此方法：作爲獎勵，這樣你就不必寫在你的列表'regards'和'Regards'它是不區分大小寫在每個字符串上。

來源

2017-02-14 09:26:38

回答

相關問題