2017-02-14 82 views
1

我有以下類方法:切割句子的

class Trigger(): 

    def getRidOfTrashPerSentence(self, line, stopwords): 
     countWord = 0 
     words = line.split() 
     for word in words: 
      if countWord == 0: 
       if word in stopwords: 
        sep = word 
        lineNew = line.split(sep, 1)[0] 
        countWord = countWord + 1 
        return(lineNew) 

    stopwords = ['regards', 'Regards'] 

    def getRidOfTrash(self, aTranscript): 
     result = [self.getRidOfTrashPerSentence(line, self.stopwords) for line in aTranscript] 
     return(result) 

我想實現它在句子切「垃圾」的某些觸發字後像['regards', 'Regards']

所以,當我想插入這樣一個塊:

aTranScript = [ "That's fine, regards Henk", "Allright great"] 

我在尋找這樣的輸出:

aTranScript = [ "That's fine, regards", "Allright great"] 

然而,當我這樣做:

newFile = Trigger() 
newContent = newFile.getRidOfTrash(aTranScript) 

我只得到"That's fine"

我如何能得到的任何想法都串

+0

你如何在拆分後附加分隔符? 這裏是一個類似的問題 - http://stackoverflow.com/questions/7866128/python-split-without-removing-the-delimiter – Vinay

+0

我不明白你什麼你Vinay,你能詳細說明一下嗎? –

+0

你可以做到這一點 - 'lineNew = line.split(SEP,1)[0]' 'lineNew + = sep' – Vinay

回答

2

這是一個簡單的解決方案:

yourString = 'Hello thats fine, regards Henk' 
yourString.split(', regards')[0] 

該代碼將返回:「你好這就是精」

如果你願意,你可以在最後連接'視爲':

yourString.split(',regards')[0] +',視爲'

+0

@EricDuminil你分辯,改變」,至於爲 '亨克';) – Ika8

+0

如果你錯過了一個特定的詞,你可以連接它.. – Ika8

+0

你會如何適應多個觸發詞? –

0

您可以掃描行字,並刪除他們,如果前一個詞是停用詞:

class Trigger(): 

    stopwords = ['regards', 'Regards'] 

    def getRidOfTrashPerSentence(self, line): 
     words = line.split() 
     new_words = [words[0]] 
     for i in range(1, len(words)): 
      if not words[i-1] in self.stopwords: 
       new_words.append(words[i]) 
     return " ".join(new_words) # reconstruct line 

    def getRidOfTrash(self, aTranscript): 
     result = [self.getRidOfTrashPerSentence(line) for line in aTranscript] 
     return(result) 

aTranScript = [ "That's fine, regards Henk", "Allright great"] 
newFile = Trigger() 
newContent = newFile.getRidOfTrash(aTranScript) 
print(newContent) 
1

正則表達式可以更容易地進行更換。

import re 

stop_words = ['regards', 'cheers'] 

def remove_text_after_stopwords(text, stop_words): 
    pattern = "(%s).*$" % '|'.join(stop_words) 
    remove_trash = re.compile(pattern, re.IGNORECASE) 
    return re.sub(remove_trash, '\g<1>', text) 

print remove_text_after_stopwords("That's fine, regards, Henk", stop_words) 
# That's fine, regards 
print remove_text_after_stopwords("Good, cheers! Paul", stop_words) 
# Good, cheers 
print remove_text_after_stopwords("No stop word here", stop_words) 
# No stop word here 

如果你有一個字符串列表,你可以只用一個列表理解應用此方法:作爲獎勵,這樣你就不必寫在你的列表'regards''Regards'它是不區分大小寫在每個字符串上。