使用不正確的分隔符和連接詞標準化文本

假設我有一堆帶有噪聲的類似字符串，主要是錯誤地連接/斷開連接。如：使用不正確的分隔符和連接詞標準化文本

"Once more unto the breach, dear friends. Once more!" 
"Once more unto the breach , dearfriends. Once more!" 
"Once more unto the breach, de ar friends. Once more!" 
"Once more unto the breach, dear friends. Once more!"

我該如何將每個人都歸一化爲同一組詞？即

["once" "more" "unto" "the" "breach" "dear" "friends" "once" "more"]

謝謝！

來源

2012-11-21 konr

你總是知道你想要的句子應該是什麼樣子嗎？ – RonaldBarzell

不幸的是不是 – konr

嗯，到目前爲止你做了什麼？ – RonaldBarzell

這裏有幾個指針。我想你最終必須編寫一套例程/函數來解決所遇到的各種不規範問題。

好消息是，您可以逐步添加到「修復程序」集合中，並不斷改進解析器。

我不得不做類似的事情，我發現this post by Peter Norvig非常有用。（請注意，它是用Python編寫的。）

此功能特別具有您需要的想法：分割，刪除，轉置和插入不規則單詞以「糾正」它們。

def edits1(word): 
    splits  = [(word[:i], word[i:]) for i in range(len(word) + 1)] 
    deletes = [a + b[1:] for a, b in splits if b] 
    transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1] 
    replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b] 
    inserts = [a + c + b  for a, b in splits for c in alphabet] 
    return set(deletes + transposes + replaces + inserts)

以上是從一個片段弱勢族羣的spelling corrector

即使原來的樣子，其核心思想是適用於你的情況下，你不能使用代碼：你拿令牌（「字」），這是你的情況中的不規則詞，嘗試不同的調整，看看它是否屬於已知和接受的單詞的大詞典。

希望有所幫助。

來源

2012-11-21 19:02:39

有點瘋狂的想法，我只是在暗示它，因爲我正在教導我將在本週向我的學生提出的算法。

刪除句子中的所有空格，例如de ar friends變成dearfriends。存在一個二次時間線性空間動態規劃算法，將非空白字符串分解爲最可能的單詞序列。該算法的討論是here和here ---第二個解決方案是我的想法。這裏的假設是你有一個很好的模型，它是一個單詞，並且需要不斷的時間來查詢該模型。

來源

2012-11-22 20:41:20 mbatchkarov

使用不正確的分隔符和連接詞標準化文本

回答

相關問題