如何在文本中找到搭配，python

如何在文本中找到搭配？搭配是一組經常異常出現的單詞序列。 python有內置的func bigrams返回單詞對。如何在文本中找到搭配，python

>>> bigrams(['more', 'is', 'said', 'than', 'done']) 
[('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')] 
>>>

剩下的是根據單詞的頻率找到更頻繁發生的bigrams。任何想法如何把它放在代碼中？

來源

2010-11-08 Gusto

你將不得不定義*更經常*。你的意思是統計意義嗎？ – 2010-11-08 22:12:55

Python沒有這樣的內建，也沒有任何標準庫中的那個名字。 – 2010-11-08 22:17:35

請使用nltk庫http://nltk.googlecode.com/svn/trunk/doc/api/nltk.collocations-module.html – 2010-11-08 22:17:59

做嘗試NLTK。您將主要感興趣的nltk.collocations.BigramCollocationFinder，但這裏是一個快速演示告訴你如何開始：

>>> import nltk 
>>> def tokenize(sentences): 
...  for sent in nltk.sent_tokenize(sentences.lower()): 
...   for word in nltk.word_tokenize(sent): 
...    yield word 
... 

>>> nltk.Text(tkn for tkn in tokenize('mary had a little lamb.')) 
<Text: mary had a little lamb ....> 
>>> text = nltk.Text(tkn for tkn in tokenize('mary had a little lamb.'))

在這個小細分市場沒有，但這裏有：

>>> text.collocations(num=20) 
Building collocations list

來源

2010-11-08 23:10:44

它能夠在unicode文本上工作嗎？我得到一個錯誤： UnicodeEncodeError：'ascii'編解碼器不能編碼位置0-8的字符：序號不在範圍內（128）' – Gusto 2010-11-09 23:03:58

Unicode對大多數操作都可以正常工作。 'nltk.Text'可能有問題，因爲它只是一個幫助語言學專業學生寫的幫手類，有時會被抓到。它主要用於演示目的。 – 2010-11-10 18:10:58

下面是一些代碼，它將小寫單詞列表並返回所有bigrams列表以及它們各自的計數，從最高計數開始。不要將此代碼用於大型列表。

from itertools import izip 
words = ["more", "is", "said", "than", "done", "is", "said"] 
words_iter = iter(words) 
next(words_iter, None) 
count = {} 
for bigram in izip(words, words_iter): 
    count[bigram] = count.get(bigram, 0) + 1 
print sorted(((c, b) for b, c in count.iteritems()), reverse=True)

（words_iter被引入到避免抄襲的話整個列表，你會在izip(words, words[1:])

來源

2010-11-08 22:36:55

不錯的工作，但你的代碼是爲了另一個目的 - 我只需要搭配（沒有任何數量或類似的）。最後我需要返回10個colloc-s（'collocations [：10]'）和它們的總數，使用'len（搭配）' – Gusto 2010-11-08 22:52:07

你實際上沒有很好地定義你實際需要的東西。也許給一些例子輸入一些例子輸出。 – 2010-11-08 22:54:37

import itertools 
from collections import Counter 
words = ['more', 'is', 'said', 'than', 'done'] 
nextword = iter(words) 
next(nextword) 
freq=Counter(zip(words,nextword)) 
print(freq)

來源

2010-11-08 23:09:50

並置是一個令牌序列，在解析時可以更好地將其視爲單個令牌。「紅鯡魚」的含義不能從其組成部分推導出來。從語料庫中派生出一組有用的搭配涉及通過一些統計（n-gram頻率，互信息，對數似然等）對n-gram進行排序，然後進行明智的手動編輯。你似乎忽略

點：

（1）胼必須是相當大......試圖從一個句子得到的搭配，你似乎暗示是沒有意義的。

（2）n可以大於2 ......例如，分析20世紀中國歷史的文本將會拋出「毛澤東」和「謝東」這樣的「重要」人物。

你究竟在努力實現什麼？你目前寫了什麼代碼？

來源

2010-11-09 00:11:14

同意Tim McNamara關於使用nltk和unicode的問題。然而，我很喜歡這個文本類 - 有一種黑客可以用來獲取搭配列表，我發現它看着source code。顯然，無論何時調用collocations方法，它都會將其保存爲類變量！

import nltk 
    def tokenize(sentences): 
     for sent in nltk.sent_tokenize(sentences.lower()): 
      for word in nltk.word_tokenize(sent):     
       yield word 


    text = nltk.Text(tkn for tkn in tokenize('mary had a little lamb.')) 
    text.collocations(num=20) 
    collocations = [" ".join(el) for el in list(text._collocations)]

享受！

來源

2017-12-18 12:13:56

如何在文本中找到搭配，python

回答

相關問題