2013-09-30 67 views
2

我的句子,如清單:基於連詞遞歸組句子

Sentence 1. 
And Sentence 2. 
Or Sentence 3. 
New Sentence 4. 
New Sentence 5. 
And Sentence 6. 

我根據「共同標準」,試圖集團這些句子,例如,如果一個句子有一個共同開始(目前只「和」或「或」),那麼我想將它們分組,使得:

Group 1: 
    Sentence 1. 
    And Sentence 2. 
    Or Sentence 3. 

Group 2: 
    New Sentence 4. 

Group 3: 
    New Sentence 5. 
    And Sentence 6. 

我寫了下面的代碼,它在某種程度上檢測到連續的句子,但不是所有的人。

我該如何遞歸編碼呢?我試圖迭代編碼,但有些情況下它不起作用,我無法弄清楚如何在遞歸中編碼。

tokens = ["Sentence 1.","And Sentence 2.","Or Sentence 3.","New Sentence 4.","New Sentence 5.","And Sentence 6."] 
already_selected = [] 
attachlist = {} 
for i in tokens: 
    attachlist[i] = [] 

for i in range(len(tokens)): 
    if i in already_selected: 
     pass 
    else: 
     for j in range(i+1, len(tokens)): 
      if j not in already_selected: 
       first_word = nltk.tokenize.word_tokenize(tokens[j].lower())[0] 
       if first_word in conjucture_list: 
        attachlist[tokens[i]].append(tokens[j]) 
        already_selected.append(j) 
       else: 
        break 
+0

爲什麼你需要它遞歸?老實說,這是一個傻瓜的差事。 – Veedrac

回答

3
tokens = ["Sentence 1.","And Sentence 2.","Or Sentence 3.", 
      "New Sentence 4.","New Sentence 5.","And Sentence 6."] 
result = list() 
for token in tokens: 
     if not token.startswith("And ") and not token.startswith("Or "): #trailing whitespace because of the cases like "Andy ..." and "Orwell ..." 
      result.append([token]) 
     else: 
      result[-1].append(token) 

結果:

[['Sentence 1.', 'And Sentence 2.', 'Or Sentence 3.'], 
['New Sentence 4.'], 
['New Sentence 5.', 'And Sentence 6.']] 
+0

@unutbu我根據您遇到像「Andy ...」或「Orwell ...」等案例的想法更新了代碼。但我不認爲使用'result = [tokens [:1]]'或處理IndexError是一個好主意,我對格式錯誤的文本不感興趣,因爲在處理這樣的句子時沒有給出問題主體的指示。根據我們的應用,我們可能需要完全排除錯誤或忽略這些語句。 –

0

我有嵌入式迭代器和仿製藥的事情,所以這裏有一個超級通用的方法:

import re 

class split_by: 
    def __init__(self, iterable, predicate=None): 
     self.iter = iter(iterable) 
     self.predicate = predicate or bool 

     try: 
      self.head = next(self.iter) 
     except StopIteration: 
      self.finished = True 
     else: 
      self.finished = False 

    def __iter__(self): 
     return self 

    def _section(self): 
     yield self.head 

     for self.head in self.iter: 
      if self.predicate(self.head): 
       break 

      yield self.head 

     else: 
      self.finished = True 

    def __next__(self): 
     if self.finished: 
      raise StopIteration 

     section = self._section() 
     return section 

[list(x) for x in split_by(tokens, lambda sentence: not re.match("(?i)or|and", sentence))] 
#>>> [['Sentence 1.', 'And Sentence 2.', 'Or Sentence 3.'], ['New Sentence 4.'], ['New Sentence 5.', 'And Sentence 6.']] 

這是更長的時間,但它的O(1)空間複雜性,並選擇一個謂詞。

0

由於輸出只需要單一級別的分組,因此迭代地解決此問題要好得多,而不是遞歸解決。如果您正在尋找遞歸解決方案,請舉例說明任意級別的分組。

def is_conjunction(sentence): 
    return sentence.startswith('And') or sentence.startswith('Or') 

tokens = ["Sentence 1.","And Sentence 2.","Or Sentence 3.", 
      "New Sentence 4.","New Sentence 5.","And Sentence 6."] 
def group_sentences_by_conjunction(sentences): 
    result = [] 
    for s in sentences: 
     if result and not is_conjunction(s): 
      yield result #flush the last group 
      result = [] 
     result.append(s) 
    if result: 
     yield result #flush the rest of the result buffer 

>>> groups = group_sentences_by_conjunction(tokens) 

使用yield語句是更好的時候,可能是你的結果可能不適合在內存中,如存儲在一個文件讀一本書的所有句子。 如果您需要的結果是由於某種原因的列表,將其轉換以

>>> groups_list = list(groups) 

結果:

[['Sentence 1.', 'And Sentence 2.', 'Or Sentence 3.'], ['New Sentence 4.'], ['New Sentence 5.', 'And Sentence 6.']] 

如果你需要組號碼,使用enumerate(groups)

is_conjunction將具有與其他答案中提到的相同的問題。根據需要進行修改以符合您的標準。