2015-02-05 29 views
0

我希望能夠以一種方式幫助我從.txt文件中讀取內容(將它們視爲單個文檔)和確定每條推文的tf-idf。Python代碼來確定tt文件中的每條推文的tf-idf

# -*- coding: utf-8 -*- 
from __future__ import division, unicode_literals 
import math 
from textblob import TextBlob as tb 

def tf(word, blob): 
    return blob.words.count(word)/len(blob.words) 

def n_containing(word, bloblist): 
    return sum(1 for blob in bloblist if word in blob) 

def idf(word, bloblist): 
    return math.log(len(bloblist)/(1 + n_containing(word, bloblist))) 

def tfidf(word, blob, bloblist): 
    return tf(word, blob) * idf(word, bloblist) 

document1 = tb("""RT @brides: These are 5 hidden jobs no one one tells about one maids-of-honor one about. You're welcome: jobs http://t.co/qybBewFDre 
This brides week on brides twitter: One new brides follower via http://t.co/0NP5Wz70Op""") 

document2 = tb("""Python, from the Greek word (Ï€Ïθων/Ï€Ïθωνας), is a genus of 
nonvenomous pythons[2] found in Africa and Asia. Currently, 7 species are 
recognised.[2] A member of this genus, P. reticulatus, is among the longest 
snakes known.""") 

document3 = tb("""The Colt Python is a .357 Magnum caliber revolver formerly 
manufactured by Colt's Manufacturing Company of Hartford, Connecticut. 
It is sometimes referred to as a "Combat Magnum".[1] It was first introduced 
in 1955, the same year as Smith & Wesson's M29 .44 Magnum. The now discontinued 
Colt Python targeted the premium revolver market segment. Some firearm 
collectors and writers such as Jeff Cooper, Ian V. Hogg, Chuck Hawks, Leroy 
Thompson, Renee Smeets and Martin Dougherty have described the Python as the 
finest production revolver ever made.""") 

bloblist = [document1, document2, document3] 
for i, blob in enumerate(bloblist): 
    print("Top words in document {}".format(i + 1)) 
    scores = {word: tfidf(word, blob, bloblist) for word in blob.words} 
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True) 
    for word, score in sorted_words[:3]: 
     print("Word: {}, TF-IDF: {}".format(word, round(score, 5))) 

回答

0

我不知道我是否正確理解了你。

file_names = ['file1.txt','file2.txt'] 
#open files 
files = map(open,file_names) 
#read files 
documents = [file.read() for file in files] 
#close files 
[file.close() for file in files] 
#create blobs 
bloblist = map(tb,documents) 

更多關於讀取和寫入文件u能在這裏找到:https://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files

可以從這樣的文件解析你的字符串:

example_string ="""Twitter feed 1: foo 
Twitter feed 2: bar 
Twitter feed 3: foobar 
""" 

#parsing using python string methods: 
lines_list = example_string.split('\n') 
for line in lines_list: 
    msg_start_poz = line.find(':') + 1 
    tweet_msg = line[msg_start_poz:] 
    print tweet_msg 

#using regular expressions 
pattern = re.compile('^Twitter feed [0-9]+:(.*?)$',re.MULTILINE) 
for tweet in re.finditer(pattern,example_string): 
    print tweet.group(1) 
+0

謝謝您的建議。但是這並不能解決我的問題。讓我再嘗試一次。我有一個.txt文件的twitter feed,看起來像:Line 1 Twitter Feed 1 - @gracieataylor來找我\t Line 2 Twitter Feed 2 - #TempeChen FYI //百日咳會反彈回來,另有1000多這樣的推文。 \t \t \t \t \t 我希望能夠逐行確定每條推文的tf-idf。 – Deepayan 2015-02-05 20:34:06

+0

看這裏:https://docs.python.org/2/library/stdtypes.html#string-methods。如果python標準的字符串方法對你不夠用。 U可以使用正則表達式和python re模塊:https://docs.python.org/2/library/re.html – grzgrzgrz3 2015-02-06 09:24:01