如何合併句子對象？

所以我建立了一個句子標記器，它將段落分成句子，單詞和字符......這些都是數據類型。但句子系統是一個兩階段系統，因爲像'。。「。扔掉它，感覺它一次只能寫一個字母，但如果它沒有空格，那麼它工作正常。如何合併句子對象？

所以輸出有點拼接起來，但如果我可以做一些輔助處理，它將完美工作。所以這就是我的問題出現的地方......我不知道如何編寫一個系統，使我可以將沒有結束句子標點符號的每個句子追加到前面的句子中，而不會丟失一些東西。

這裏是一個什麼樣的輸出看起來像什麼，我需要它看起來像例子：

被剪接一些句子...

，並具有持續

這不能混淆美國

在那

最後一句...

一個縮寫結束了句子！

因此，句法對象不以句子正常結尾定界符結尾，即'。'，'？'，'！'需要附加到下一個句子......直到存在句子分隔符的真實結尾的句子。另一件讓這件事情變得艱難的事情是'。。「。算作延續，而不是句子的結束。所以這也需要附加。

這是怎麼需要是：

被拼接有些句子......並有延續。

這不能由U.S.A.

混淆在這最後的一句話......的縮寫結束了一句！

這裏是我正在同代碼：

last = [] 
merge = [] 
for s in stream: 
     if last: 
      old = last.pop() 
      if '.' not in old.as_utf8 and '?' not in old.as_utf8 and '!' not in old.as_utf8: 

       new = old + s 
       merge.append(new) 
      else: 
       merge.append(s) 
      last.append(s)

所以有這種方法的幾個問題...

只追加1句到另一個，但它如果有2個或3個需要添加，則不會追加。
如果它沒有任何標點符號，它將放下第一句。
它不處理'。。「。作爲延續。我知道在這件事上我沒有爲此做任何事情，那是因爲我不完全確定如何解決這個問題，句子以縮寫結尾，因爲我可以計算出多少個'。'。在句子中，但它會被'美國'真正拋棄。因爲那個計數爲3個週期。

，所以我寫了一個__add__方法來了句類，所以你可以做sentence + sentence和工作的方式來添加一個到另一個。

任何幫助將不勝感激這一點。並讓我知道是否有任何不明之處，我會盡我所能來安置它。

來源

2013-06-21 AlexW.H.B.

你能澄清你想要遞歸嗎？你想要一個遞歸函數或者任何能夠完成這項工作的東西？ – Wolph

它不一定需要隱性......但我可能已經使用了這個詞太鬆散......我的意思是我不想合併相同的知覺，直到它達到一個真正的感知力中斷。這樣做的方法不一定是隱性的。我更新了標題，以免誤導。 –

@WoLpH我認爲他指的是牙齦退化 –

好的，這是一些工作代碼。這大致是你需要的嗎？我還不太滿意，它看起來有點難看，但我想知道這是否是正確的方向。

words = '''Some sentence that is spliced... 
and has a continuation. 
this cannot be confused by U.S.A. 
In that 
last sentence... 
an abbreviation ended the sentence!'''.split() 

def format_sentence(words): 
    output = [] 

    for word in words: 
     if word.endswith('...') or not word.endswith('.'): 
      output.append(word) 
      output.append(' ') 
     elif word.endswith('.'): 
      output.append(word) 
      output.append('\n') 
     else: 
      raise ValueError('Unexpected result from word: %r' % word) 

    return ''.join(output) 

print format_sentence(words)

輸出：

Some sentence that is spliced... and has a continuation. 
this cannot be confused by U.S.A. 
In that last sentence... an abbreviation ended the sentence!

來源

2013-06-21 22:36:18 Wolph

它會工作，但是......也許我對此有點不清楚，但是我使用了一個類結構......因爲在這裏有一個字符類，一個字類和一個句子類，並且由於這個層次結構它給系統帶來了一些困難。一個是你可以追加的唯一方法是使用+運算符。但除此之外，你相信我們正走在正確的軌道上。我非常感謝幫助。下面是一個例子：句法數據類型如下：[<__ main __。字符對象在0x024FB350>，<__ main __。字符對象在0x024FB370>] ..它基本上是一個字符列表。 –

我必須保留char對象而不僅僅是讓這個事情變得更簡單的原因是因爲我正在解析EPUB doc與這個，我必須保持html完好無損，所以爲了做到這一點，我必須做一個班級結構。 –

dang你的代碼工作得很好......也許我可以嘗試對它進行一些修改以適應我的系統。我非常感謝你的幫助。 –

這種「算法」試圖使輸入的意義，而不依賴於行結束，以便它應該一些輸入正確的工作就像

born in the U. 
S.A.

該代碼適合集成到狀態機中 - 循環只記住當前的短語，並將完成的短語「推」到列表中，並一次吞下一個字。分割空白區是很好的。

通知歧義的情況下，＃5：不能被可靠地消除（以及有可能有這樣的歧義也行尾也許組合兩者 ...）

# Sample decoded data 
decoded = [ 'Some', 'sentence', 'that', 'is', 'spliced.', '.', '.', 
    'and', 'has', 'a', 'continuation.', 
    'this', 'cannot', 'be', 'confused', 'by', 'U.', 'S.', 'A.', 'or', 'U.S.A.', 
    'In', 'that', 'last', 'sentence...', 
    'an', 'abbreviation', 'ended', 'the', 'sentence!' ] 

# List of phrases 
phrases = [] 

# Current phrase 
phrase = '' 

while decoded: 
    word = decoded.pop(0) 
    # Possibilities: 
    # 1. phrase has no terminator. Then we surely add word to phrase. 
    if not phrase[-1:] in ('.', '?', '!'): 
     phrase += ('' if '' == phrase else ' ') + word 
     continue 
    # 2. There was a terminator. Which? 
    # Say phrase is dot-terminated... 
    if '.' == phrase[-1:]: 
     # BUT it is terminated by several dots. 
     if '..' == phrase[-2:]: 
      if '.' == word: 
       phrase += '.' 
      else: 
       phrase += ' ' + word 
      continue 
     # ...and word is dot-terminated. "by U." and "S.", or "the." and ".". 
     if '.' == word[-1:]: 
      phrase += word 
      continue 
     # Do we have an abbreviation? 
     if len(phrase) > 3: 
      if '.' == phrase[-3:-2]: 
       # 5. We have an ambiguity, we solve using capitals. 
       if word[:1].upper() == word[:1]: 
        phrases.append(phrase) 
        phrase = word 
        continue 
       phrase += ' ' + word 
       continue 
     # Something else. Then phrase is completed and restarted. 
     phrases.append(phrase) 
     phrase = word 
     continue 
    # 3. Another terminator. 
     phrases.append(phrase) 
     phrase = word 
     continue 

phrases.append(phrase) 

for p in phrases: 
    print ">> " + p

輸出：

>> Some sentence that is spliced... and has a continuation. 
>> this cannot be confused by U.S.A. or U.S.A. 
>> In that last sentence... an abbreviation ended the sentence!

來源

2013-06-21 22:51:25 LSerni

我真的很感激你花時間幫助我。我會玩你的代碼。再次感謝很多。 –

這是我最終使用的代碼，它的工作原理很好......這主要是基於WoLpH代碼，所以非常感謝！

output = stream[:1] 
    for line in stream: 
      if output[-1].as_utf8.replace(' ', '').endswith('...'):  
       output[-1] += line 

      elif not output[-1].as_utf8.replace(' ', '').endswith('.') and not output[-1].as_utf8.replace(' ', '').endswith('?') and not output[-1].as_utf8.replace(' ', '').endswith('!') and not output[-1].as_utf8.replace(' ', '').endswith('"') and not output[-1].as_utf8.replace(' ', '')[-1].isdigit(): 
       if output[-1] != line: 
        output[-1] += line 

      else: 
       if output[-1] != line: 
        output.append(line) 

    return output

來源

2013-06-24 20:14:50

如何合併句子對象？

回答

相關問題