嘿,大家我知道這已經問過幾次了,但我很難用python查找文檔頻率。我試圖找到TF-IDF,然後找到他們和查詢之間的餘弦分數,但我堅持查找文檔頻率。這是我到目前爲止有:使用Python查找文檔頻率
#includes
import re
import os
import operator
import glob
import sys
import math
from collections import Counter
#number of command line argument checker
if len(sys.argv) != 3:
print 'usage: ./part3_soln2.py "path to folder in quotation marks" query.txt'
sys.exit(1)
#Read in the directory to the files
path = sys.argv[1]
#Read in the query
y = sys.argv[2]
querystart = re.findall(r'\w+', open(y).read().lower())
query = [Z for Z in querystart]
Query_vec = Counter(query)
print Query_vec
#counts total number of documents in the directory
doccounter = len(glob.glob1(path,"*.txt"))
if os.path.exists(path) and os.path.isfile(y):
word_TF = []
word_IDF = {}
TFvec = []
IDFvec = []
#this is my attempt at finding IDF
for filename in glob.glob(os.path.join(path, '*.txt')):
words_IDF = re.findall(r'\w+', open(filename).read().lower())
doc_IDF = [A for A in words_IDF if len(A) >= 3 and A.isalpha()]
word_IDF = doc_IDF
#psudocode!!
"""
for key in word_idf:
if key in word_idf:
word_idf[key] =+1
else:
word_idf[key] = 1
print word_IDF
"""
#goes to that directory and reads in the files there
for filename in glob.glob(os.path.join(path, '*.txt')):
words_TF = re.findall(r'\w+', open(filename).read().lower())
#scans each document for words greater or equal to 3 in length
doc_TF = [A for A in words_TF if len(A) >= 3 and A.isalpha()]
#this assigns values to each term this is my TF for each vector
TFvec = Counter(doc_TF)
#weighing the Tf with a log function
for key in TFvec:
TFvec[key] = 1 + math.log10(TFvec[key])
#placed here so I dont get a command line full of text
print TFvec
#Error checker
else:
print "That path does not exist"
我使用Python 2和到目前爲止,我真的沒有任何想法如何算一個術語多少文件出現在我能找到的文檔總數但我真的很難找到一個術語出現的文檔數量。我只是要創建一個大型字典,它包含所有文檔中的所有術語,這些術語稍後可能在查詢需要這些術語時提取。感謝您給我的任何幫助。
是否有一個原因,你試圖自己實現這個而不是使用庫:http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html –
我讀了一個但是我必須記錄tf和idf值,並認爲如果我自己實現它,會更容易。另外,我將在一個包含大約100個文本文件的目錄中閱讀,所以我再次認爲它比使用scikit更容易 – Sean
此外,我將不得不在晚些時候爲tfidf做cosin。 scikit也有這個功能嗎? – Sean