基於連詞遞歸組句子

我的句子，如清單：基於連詞遞歸組句子

Sentence 1. 
And Sentence 2. 
Or Sentence 3. 
New Sentence 4. 
New Sentence 5. 
And Sentence 6.

我根據「共同標準」，試圖集團這些句子，例如，如果一個句子有一個共同開始（目前只「和」或「或」），那麼我想將它們分組，使得：

Group 1: 
    Sentence 1. 
    And Sentence 2. 
    Or Sentence 3. 

Group 2: 
    New Sentence 4. 

Group 3: 
    New Sentence 5. 
    And Sentence 6.

我寫了下面的代碼，它在某種程度上檢測到連續的句子，但不是所有的人。

我該如何遞歸編碼呢？我試圖迭代編碼，但有些情況下它不起作用，我無法弄清楚如何在遞歸中編碼。

tokens = ["Sentence 1.","And Sentence 2.","Or Sentence 3.","New Sentence 4.","New Sentence 5.","And Sentence 6."] 
already_selected = [] 
attachlist = {} 
for i in tokens: 
    attachlist[i] = [] 

for i in range(len(tokens)): 
    if i in already_selected: 
     pass 
    else: 
     for j in range(i+1, len(tokens)): 
      if j not in already_selected: 
       first_word = nltk.tokenize.word_tokenize(tokens[j].lower())[0] 
       if first_word in conjucture_list: 
        attachlist[tokens[i]].append(tokens[j]) 
        already_selected.append(j) 
       else: 
        break

來源

2013-09-30 user2830018

爲什麼你需要它遞歸？老實說，這是一個傻瓜的差事。 – Veedrac

tokens = ["Sentence 1.","And Sentence 2.","Or Sentence 3.", 
      "New Sentence 4.","New Sentence 5.","And Sentence 6."] 
result = list() 
for token in tokens: 
     if not token.startswith("And ") and not token.startswith("Or "): #trailing whitespace because of the cases like "Andy ..." and "Orwell ..." 
      result.append([token]) 
     else: 
      result[-1].append(token)

結果：

[['Sentence 1.', 'And Sentence 2.', 'Or Sentence 3.'], 
['New Sentence 4.'], 
['New Sentence 5.', 'And Sentence 6.']]

來源

2013-10-01 00:07:12

@unutbu我根據您遇到像「Andy ...」或「Orwell ...」等案例的想法更新了代碼。但我不認爲使用'result = [tokens [：1]]'或處理IndexError是一個好主意，我對格式錯誤的文本不感興趣，因爲在處理這樣的句子時沒有給出問題主體的指示。根據我們的應用，我們可能需要完全排除錯誤或忽略這些語句。 –

我有嵌入式迭代器和仿製藥的事情，所以這裏有一個超級通用的方法：

import re 

class split_by: 
    def __init__(self, iterable, predicate=None): 
     self.iter = iter(iterable) 
     self.predicate = predicate or bool 

     try: 
      self.head = next(self.iter) 
     except StopIteration: 
      self.finished = True 
     else: 
      self.finished = False 

    def __iter__(self): 
     return self 

    def _section(self): 
     yield self.head 

     for self.head in self.iter: 
      if self.predicate(self.head): 
       break 

      yield self.head 

     else: 
      self.finished = True 

    def __next__(self): 
     if self.finished: 
      raise StopIteration 

     section = self._section() 
     return section 

[list(x) for x in split_by(tokens, lambda sentence: not re.match("(?i)or|and", sentence))] 
#>>> [['Sentence 1.', 'And Sentence 2.', 'Or Sentence 3.'], ['New Sentence 4.'], ['New Sentence 5.', 'And Sentence 6.']]

這是更長的時間，但它的O(1)空間複雜性，並選擇一個謂詞。

來源

2013-10-01 00:12:30 Veedrac

由於輸出只需要單一級別的分組，因此迭代地解決此問題要好得多，而不是遞歸解決。如果您正在尋找遞歸解決方案，請舉例說明任意級別的分組。

def is_conjunction(sentence): 
    return sentence.startswith('And') or sentence.startswith('Or') 

tokens = ["Sentence 1.","And Sentence 2.","Or Sentence 3.", 
      "New Sentence 4.","New Sentence 5.","And Sentence 6."] 
def group_sentences_by_conjunction(sentences): 
    result = [] 
    for s in sentences: 
     if result and not is_conjunction(s): 
      yield result #flush the last group 
      result = [] 
     result.append(s) 
    if result: 
     yield result #flush the rest of the result buffer 

>>> groups = group_sentences_by_conjunction(tokens)

使用yield語句是更好的時候，可能是你的結果可能不適合在內存中，如存儲在一個文件讀一本書的所有句子。如果您需要的結果是由於某種原因的列表，將其轉換以

>>> groups_list = list(groups)

結果：

[['Sentence 1.', 'And Sentence 2.', 'Or Sentence 3.'], ['New Sentence 4.'], ['New Sentence 5.', 'And Sentence 6.']]

如果你需要組號碼，使用enumerate(groups)。

is_conjunction將具有與其他答案中提到的相同的問題。根據需要進行修改以符合您的標準。

來源

2013-10-01 08:48:42 IceArdor

基於連詞遞歸組句子

回答

相關問題