用於Python中1Gb文本文件的字頻率計算

我正在計算大小爲1.2 GB的文本文件的文字頻率，大約爲1.2億字。我正在使用下面的Python代碼。但它給我一個記憶錯誤。有沒有解決方案？用於Python中1Gb文本文件的字頻率計算

這裏是我的代碼：

import re 
# this one in honor of 4th July, or pick text file you have!!!!!!! 
filename = 'inputfile.txt' 
# create list of lower case words, \s+ --> match any whitespace(s) 
# you can replace file(filename).read() with given string 
word_list = re.split('\s+', file(filename).read().lower()) 
print 'Words in text:', len(word_list) 
# create dictionary of word:frequency pairs 
freq_dic = {} 
# punctuation marks to be removed 
punctuation = re.compile(r'[.?!,":;]') 
for word in word_list: 
    # remove punctuation marks 
    word = punctuation.sub("", word) 
    # form dictionary 
    try: 
     freq_dic[word] += 1 
    except: 
     freq_dic[word] = 1 

print 'Unique words:', len(freq_dic) 
# create list of (key, val) tuple pairs 
freq_list = freq_dic.items() 
# sort by key or word 
freq_list.sort() 
# display result 
for word, freq in freq_list: 
    print word, freq

這裏是錯誤的，我收到：

Traceback (most recent call last): 
    File "count.py", line 6, in <module> 
    word_list = re.split('\s+', file(filename).read().lower()) 
    File "/usr/lib/python2.7/re.py", line 167, in split 
    return _compile(pattern, flags).split(string, maxsplit) 
MemoryError

來源

2013-02-03 Jyotiska

的問題就在這裏開始：

file(filename).read()

這讀取整個文件轉換成字符串。相反，如果您逐行或逐塊處理文件，則不會遇到內存問題。

with open(filename) as f: 
    for line in f:

你也可以受益於使用collections.Counter來計算單詞的頻率。

In [1]: import collections 

In [2]: freq = collections.Counter() 

In [3]: line = 'Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod' 

In [4]: freq.update(line.split()) 

In [5]: freq 
Out[5]: Counter({'ipsum': 1, 'amet,': 1, 'do': 1, 'sit': 1, 'eiusmod': 1, 'consectetur': 1, 'sed': 1, 'elit,': 1, 'dolor': 1, 'Lorem': 1, 'adipisicing': 1})

並計算一些的話，

In [6]: freq.update(line.split()) 

In [7]: freq 
Out[7]: Counter({'ipsum': 2, 'amet,': 2, 'do': 2, 'sit': 2, 'eiusmod': 2, 'consectetur': 2, 'sed': 2, 'elit,': 2, 'dolor': 2, 'Lorem': 2, 'adipisicing': 2})

一個collections.Counter是dict一個子類，所以你可以與你已經熟悉的方式來使用它。另外，它還有一些有用的計數方法，如most_common。

來源

2013-02-03 15:57:28 unutbu

的問題是，你正在試圖將整個文件讀入內存中。解決方案是逐行讀取文件，計算每行的字數，並對結果進行求和。

來源

2013-02-03 15:56:00 ChrisBlom

用於Python中1Gb文本文件的字頻率計算

回答

相關問題