結合常見搭配的NLP過程

我有一個語料庫，我在R上使用tm包（並且鏡像Python中NLTK中的相同腳本）。我正在使用unigrams，但希望某種解析器能夠將通常位於同一位置的單詞合併爲一個單詞 - 即，我希望在我的瀏覽器中單獨停止看到「New」和「York」。數據集，當它們一起出現時，看到這個特定的對被表示爲「紐約」，就好像這是一個單詞一樣，並且與其他單詞一起。結合常見搭配的NLP過程

這個過程叫做什麼，將有意義的常見n-gram轉化爲與unigrams相同的基礎？這不是一件事嗎？最後，tm_map會是什麼樣子？

mydata.corpus <- tm_map(mydata.corpus, fancyfunction)

和/或蟒蛇？

來源

2013-12-20 Mittenchops

這就是所謂的搭配的發現。典型的方法首先用POS標籤進行過濾，然後計算互信息並報告MI超過某個閾值的所有二值碼。 *統計自然語言處理基礎*專門討論這個問題。 –

也許[命名實體識別]（http://en.wikipedia.org/wiki/Named-entity_recognition）？ – arturomp

NLTK鏈接到搭配：http://nltk.org/howto/collocations.html和@larsman提到的章節：http://nlp.stanford.edu/fsnlp/promo/colloc.pdf – arturomp

我最近有一個similar question，並搭配

玩耍了這是我選擇，以確定對搭配詞解：

from nltk import word_tokenize 
from nltk.collocations import * 

text = <a long text read in as string string> 

tokenized_text = word_tokenize(text) 

bigram_measures = nltk.collocations.BigramAssocMeasures(tokenized_text) 
finder = BigramCollocationFinder.from_words() 
scored = finder.score_ngrams(bigram_measures.raw_freq) 

sorted(scored, key=lambda s: s[1], reverse=True)

來源

2017-03-20 17:41:30 Sylvia

結合常見搭配的NLP過程

回答

相關問題