0
我試圖從txt文件中的文本中找到Bi-gram頻率。到目前爲止,它的工作原理,但它統計的數字和symbols.Here是我的代碼:蟒蛇 - 忽略Bigram頻率中的數字和符號
import nltk
from nltk.collocations import *
import prettytable
file = open('tweets.txt').read()
tokens = nltk.word_tokenize(file)
pt = prettytable.PrettyTable(['Words', 'Counts'])
pt.align['Words'] = 'l'
pt.align['Counts'] = 'r'
bgs = nltk.bigrams(tokens)
fdist = nltk.FreqDist(bgs)
for row in fdist.most_common(100):
pt.add_row(row)
print pt
Below is the code output:
+------------------------------------+--------+
| Words | Counts |
+------------------------------------+--------+
| ('https', ':') | 1615 |
| ('!', '#') | 445 |
| ('Thank', 'you') | 386 |
| ('.', '``') | 358 |
| ('.', 'I') | 354 |
| ('.', 'Thank') | 337 |
| ('``', '@') | 320 |
| ('&', 'amp') | 290 |
有沒有辦法忽略數字和符號(如,:)!?由於文本是推文,我想忽略數字和符號,#和s的除外#