如何才能最好地確定單詞的正確大小寫？

我有一個數據庫包含只包含大寫字母的句子。該數據庫是技術性的，包含醫療術語，我想對其進行標準化，以使大寫字母（接近）符合用戶的期望。達到此目的的最佳方法是什麼？是否有免費的數據集供我用來幫助這個過程？如何才能最好地確定單詞的正確大小寫？

來源

2011-10-09 Mike

醫學術語將是艱難的。 –

這是特定語言，順便說一句。你的數據是英文嗎？ –

@Alex Yep，全英文。 – Mike

搜尋工作在truecasing：http://en.wikipedia.org/wiki/Truecasing

這將是很容易產生，如果你有正常的市值獲得類似的醫療數據自己的數據集。利用一切資源並使用映射到原始文本來訓練/測試您的算法。

來源

2011-10-10 08:57:05 aab

的一種方法是使用Python自然語言工具包（NLTK）來推斷從POS標記大寫，例如：

import nltk, re 

def truecase(text): 
    truecased_sents = [] # list of truecased sentences 
    # apply POS-tagging 
    tagged_sent = nltk.pos_tag([word.lower() for word in nltk.word_tokenize(text)]) 
    # infer capitalization from POS-tags 
    normalized_sent = [w.capitalize() if t in ["NN","NNS"] else w for (w,t) in tagged_sent] 
    # capitalize first word in sentence 
    normalized_sent[0] = normalized_sent[0].capitalize() 
    # use regular expression to get punctuation right 
    pretty_string = re.sub(" (?=[\.,'!?:;])", "", ' '.join(normalized_sent)) 
    return pretty_string

這不會是完美的，尤其是因爲我不知道你是什麼數據完全看起來像，但也許你可以得到這樣的想法：

>>> text = "Clonazepam Has Been Approved As An Anticonvulsant To Be Manufactured In 0.5mg, 1mg And 2mg Tablets. It Is The Generic Equivalent Of Roche Laboratories' Klonopin." 
>>> truecase(text) 
"Clonazepam has been approved as an anticonvulsant to be manufactured in 0.5mg, 1mg and 2mg Tablets. It is the generic Equivalent of Roche Laboratories' Klonopin."

來源

2011-10-10 10:32:48 tobigue

偉大的解決方案。你也可能會發現這個api很有趣。 [textacy]（https://pypi.python.org/pypi/textacy） – Pramit

最簡單的方法是使用基於ngrams的拼寫校正算法。

您可以使用，例如LingPipe SpellChecker。您可以找到用於預測單詞空格的源代碼，類似於可以預測大小寫的操作。

來源

2011-10-10 13:55:13 yura

如何才能最好地確定單詞的正確大小寫？

回答

相關問題