正如安蒂提到的,你應該更喜歡python3,離開這一切煩人 python2垃圾在你身後。以下腳本適用於python2和python3。
要讀取/寫入文件,請使用io模塊中的open
函數,這是 python2/python3兼容。 Allways使用with
統計來打開文件等資源。 with
用於包裝在Python Context Manager中的塊的執行。文件描述符具有上下文管理器實現,並將在離開with
塊時自動關閉。
不依賴於蟒,如果你想讀一個文本文件,你應該知道 編碼這個文件的讀它正確的(如果您不確定嘗試utf-8
第一)。除此之外,正確的UTF-8簽名爲utf-8
,模式U
爲 。
#!/usr/bin/env python
# -*- coding: utf-8; mode: python -*-
from nltk.util import ngrams
import collections
import io, sys
def main(inFile, outFile):
with io.open(inFile, encoding="utf-8") as i:
sixgrams = ngrams(i.read().split(), 2)
result = collections.Counter(sixgrams)
templ = "%-10s %s\n"
with io.open(outFile, "w", encoding="utf-8") as o:
o.write(templ % (u"count", u"words"))
o.write(templ % (u"-" * 10, u"-" * 30))
# Sorting might be expensive. Before sort, filter items you don't want
# to handle, btw. place *count* in front of the tuple.
filtered = [ (c, w) for w, c in result.items() if c > 1]
filtered.sort(reverse=True)
for count, item in filtered:
o.write(templ % (count, " ".join(item)))
if __name__ == '__main__':
sys.exit(main("text.txt", "out_text.txt"))
與輸入text.txt
文件:
At eight o'clock on Thursday morning and Arthur didn't feel very good
he missed 100 € on Thursday morning. The Euro symbol of 100 € is here
to test the encoding of non ASCII characters, because encoding errors
do occur only on Thursday morning.
我得到以下output_text
:
count words
---------- ------------------------------
3 on Thursday
2 Thursday morning.
2 100 €
如果你是完整的初學者到Python,特別是因爲它似乎你正在做的NLP ,我建議你切換到Python 3徹底! –