在問候NLTK爲搭配提取的例子的使用,看看下面的指南:A how-to guide by nltk on collocations extraction
就BNC語料庫讀者而言,所有的信息都在文檔中。
from nltk.corpus.reader.bnc import BNCCorpusReader
from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder
# Instantiate the reader like this
bnc_reader = BNCCorpusReader(root="/path/to/BNC/Texts", fileids=r'[A-K]/\w*/\w*\.xml')
#And say you wanted to extract all bigram collocations and
#then later wanted to sort them just by their frequency, this is what you would do.
#Again, take a look at the link to the nltk guide on collocations for more examples.
list_of_fileids = ['A/A0/A00.xml', 'A/A0/A01.xml']
bigram_measures = BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(bnc_reader.words(fileids=list_of_fileids))
scored = finder.score_ngrams(bigram_measures.raw_freq)
print(scored)
的輸出將是這個樣子:
[(('of', 'the'), 0.004902261167963723), (('in', 'the'),0.003554139346773699),
(('.', 'The'), 0.0034315828175746064), (('Gift', 'Aid'), 0.0019609044671854894),
((',', 'and'), 0.0018996262025859428), (('for', 'the'), 0.0018383479379863962), ... ]
如果你想用分數來排序,你可以嘗試這樣的事情
sorted_bigrams = sorted(bigram for bigram, score in scored)
print(sorted_bigrams)
由於:
[('!', 'If'), ('!', 'Of'), ('!', 'Once'), ('!', 'Particularly'), ('!', 'Raising'),
('!', 'YOU'), ('!', '‘'), ('&', 'Ealing'), ('&', 'Public'), ('&', 'Surrey'),
('&', 'TRAINING'), ("'", 'SPONSORED'), ("'S", 'HOME'), ("'S", 'SERVICE'), ... ]
你的目的是什麼?你必須使用NLTK嗎?我不太熟悉Python並且從不使用NLTK,但是我使用Stanford Core NLP在Java中處理了BNC。我的目標是建立一個正確的語料庫來解析以獲得單詞對之間的依賴關係。所以,從BNC的xml文件開始,我用xml解析器重新創建了每個句子。然後我用Core NLP處理每個句子。 如果你的目標只是導入語料庫,老實說我不能迴應你,但在最後的例子中,你仍然可以創建XML文本的txt格式,並將其傳遞給python,並最終通過字符串處理它。 –
@ s.dallapalma你好。我不需要使用NLTK,但我確實需要能夠使用某些庫來查找單詞的「搭配」。我看着斯坦福核心NLP,但被告知它沒有一個Collocations功能。 – jason