0
我希望能夠以一種方式幫助我從.txt文件中讀取內容(將它們視爲單個文檔)和確定每條推文的tf-idf。Python代碼來確定tt文件中的每條推文的tf-idf
# -*- coding: utf-8 -*-
from __future__ import division, unicode_literals
import math
from textblob import TextBlob as tb
def tf(word, blob):
return blob.words.count(word)/len(blob.words)
def n_containing(word, bloblist):
return sum(1 for blob in bloblist if word in blob)
def idf(word, bloblist):
return math.log(len(bloblist)/(1 + n_containing(word, bloblist)))
def tfidf(word, blob, bloblist):
return tf(word, blob) * idf(word, bloblist)
document1 = tb("""RT @brides: These are 5 hidden jobs no one one tells about one maids-of-honor one about. You're welcome: jobs http://t.co/qybBewFDre
This brides week on brides twitter: One new brides follower via http://t.co/0NP5Wz70Op""")
document2 = tb("""Python, from the Greek word (Ï€Ïθων/Ï€Ïθωνας), is a genus of
nonvenomous pythons[2] found in Africa and Asia. Currently, 7 species are
recognised.[2] A member of this genus, P. reticulatus, is among the longest
snakes known.""")
document3 = tb("""The Colt Python is a .357 Magnum caliber revolver formerly
manufactured by Colt's Manufacturing Company of Hartford, Connecticut.
It is sometimes referred to as a "Combat Magnum".[1] It was first introduced
in 1955, the same year as Smith & Wesson's M29 .44 Magnum. The now discontinued
Colt Python targeted the premium revolver market segment. Some firearm
collectors and writers such as Jeff Cooper, Ian V. Hogg, Chuck Hawks, Leroy
Thompson, Renee Smeets and Martin Dougherty have described the Python as the
finest production revolver ever made.""")
bloblist = [document1, document2, document3]
for i, blob in enumerate(bloblist):
print("Top words in document {}".format(i + 1))
scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
for word, score in sorted_words[:3]:
print("Word: {}, TF-IDF: {}".format(word, round(score, 5)))
謝謝您的建議。但是這並不能解決我的問題。讓我再嘗試一次。我有一個.txt文件的twitter feed,看起來像:Line 1 Twitter Feed 1 - @gracieataylor來找我\t Line 2 Twitter Feed 2 - #TempeChen FYI //百日咳會反彈回來,另有1000多這樣的推文。 \t \t \t \t \t 我希望能夠逐行確定每條推文的tf-idf。 – Deepayan 2015-02-05 20:34:06
看這裏:https://docs.python.org/2/library/stdtypes.html#string-methods。如果python標準的字符串方法對你不夠用。 U可以使用正則表達式和python re模塊:https://docs.python.org/2/library/re.html – grzgrzgrz3 2015-02-06 09:24:01