2017-02-24 44 views
0

我試圖從txt文件中的文本中找到Bi-gram頻率。到目前爲止,它的工作原理,但它統計的數字和symbols.Here是我的代碼:蟒蛇 - 忽略Bigram頻率中的數字和符號

import nltk 
from nltk.collocations import * 
import prettytable 




file = open('tweets.txt').read() 
tokens = nltk.word_tokenize(file) 


pt = prettytable.PrettyTable(['Words', 'Counts']) 
pt.align['Words'] = 'l' 
pt.align['Counts'] = 'r' 



bgs = nltk.bigrams(tokens) 
fdist = nltk.FreqDist(bgs) 

for row in fdist.most_common(100): 
    pt.add_row(row) 
print pt 


Below is the code output: 
+------------------------------------+--------+ 
| Words        | Counts | 
+------------------------------------+--------+ 
| ('https', ':')      | 1615 | 
| ('!', '#')       | 445 | 
| ('Thank', 'you')     | 386 | 
| ('.', '``')      | 358 | 
| ('.', 'I')       | 354 | 
| ('.', 'Thank')      | 337 | 
| ('``', '@')      | 320 | 
| ('&', 'amp')      | 290 | 

有沒有辦法忽略數字和符號(如,:)!?由於文本是推文,我想忽略數字和符號,#和s的除外#

回答

0

bigrams的fdist是包含bigram元組和tuple整數的元組的元組,因此我們需要訪問bigram元組,並保留除了bigram的數量外我們需要的元組。嘗試:

import nltk 
from nltk.probability import FreqDist 
from nltk.util import ngrams 
from pprint import pprint 

def filter_most_common_bigrams(mc_bigrams_counts): 
    filtered_mc_bigrams_counts = [] 
    for mc_bigram_count in mc_bigrams_counts: 
     bigram, count = mc_bigram_count 
     #print (bigram, count) 
     if all([gram.isalpha() for gram in bigram]) or bigram[0] in "#@" and bigram[1].isalpha(): 
      filtered_mc_bigrams_counts.append((bigram, count)) 
    return tuple(filtered_mc_bigrams_counts) 

text = """Is there a way to ignore numbers and symbols (like !,.,?,:)? 
Since the text are tweets, I want to ignore numbers and symbols, except for the #'s and @'s 
https: !# . Thank you . `` 12 hi . 1st place 1 love 13 in @twitter # twitter""" 

tokenized_text = nltk.word_tokenize(text) 
bigrams = ngrams(tokenized_text, 2) 
fdist = FreqDist(bigrams) 
mc_bigrams_counts = fdist.most_common(100)  
pprint (filter_most_common_bigrams(mc_bigrams_counts)) 

的代碼的關鍵部分是:

if all([gram.isalpha() for gram in bigram]) or bigram[0] in "#@" and bigram[1].isalpha(): 
    filtered_mc_bigrams_counts.append((bigram, count)) 

這就驗證了在兩字組所有1克包括字母,或者,可替代地,所述第一兩字組是#或@符號第二個二元組由字母組成。它只追加那些滿足這些條件的元素,並且在包含bigram的fdist數的元組內進行。

結果:

((('to', 'ignore'), 2), 
(('and', 'symbols'), 2), 
(('ignore', 'numbers'), 2), 
(('numbers', 'and'), 2), 
(('for', 'the'), 1), 
(('@', 'twitter'), 1), 
(('Is', 'there'), 1), 
(('text', 'are'), 1), 
(('a', 'way'), 1), 
(('Thank', 'you'), 1), 
(('want', 'to'), 1), 
(('Since', 'the'), 1), 
(('I', 'want'), 1), 
(('#', 'twitter'), 1), 
(('the', 'text'), 1), 
(('are', 'tweets'), 1), 
(('way', 'to'), 1), 
(('except', 'for'), 1), 
(('there', 'a'), 1))