Python 2.7：使用字典創建tf：idf腳本

-1

我想編寫一個使用字典來獲取tf：idf（ratio？）的腳本。Python 2.7：使用字典創建tf：idf腳本

的想法是有腳本使用os.walk查找目錄中的所有.txt文件和子目錄：

files = [] 
for root, dirnames, filenames in os.walk(directory): 
    for filename in fnmatch.filter(filenames, '*.txt'): 
     files.append(os.path.join(root, filename))

它然後使用列表中找到所有的話多少次它們出現：

def word_sort(filename3): 
    with open(filename3) as f3: 
     passage = f3.read() 
    stop_words = "THE OF A TO AND IS IN YOU THAT IT THIS YOUR AS AN BUT FOR".split() 
    words = re.findall(r'\w+', passage) 
    cap_words = [word.upper() for word in words if word.upper() not in stop_words] 
    word_sort = Counter(cap_words) 
    return word_sort 

term_freq_per_file = {} 
for file in files: 
    term_freq_per_file[file] = (word_sort(file))

它結束了像這樣的詞典：

'/home/seb/Learning/ex15_sample.txt': Counter({'LOTS': 2, 'STUFF': 2, 'HAVE': 1, 
            'I': 1, 'TYPED': 1, 'INTO': 1, 'HERE': 1, 
             'FILE': 1, 'FUN': 1, 'COOL': 1,'REALLY': 1}),

在我的腦海裏這給我每個文件的詞頻。

我該如何去尋找真正的tf？

如何查找idf？

通過TF我的意思是詞頻，它是一個字（項）出現了多少次的文檔

TF（T）的=（次項t號碼出現在文檔中）/（總數文件中的術語）。

而以色列國防軍我的意思是倒排文檔頻率，其中文檔頻率是在這個詞多少文檔出現

IDF（T）= log_e（文檔總數/與它項t的文檔數）。

爲了澄清，我的問題是如何提取這些值並將它們放入公式中，我知道它們在那裏，但我不知道如何提取它們並進一步使用它們。

我已決定把在哪些文件已經用這個詞包含另一個字典，因爲這樣的：通過第一字典這樣的迭代

{word : (file1, file2, file3)}

：

for file in tfDic: 
    word = tfDic[file][Counter] 
    for word in tfDic: 
     if word not in dfDic.keys(): 
      dfDic.setdefault(word,[]).append(file) 
     if word in dfDic.keys(): 
      dfDic[word].append(file)

問題出在這條線上：

word = tfDic[file][Counter]

我認爲它會'導航'它的單詞，但我注意到，單詞是計數器字典中的鍵是tfDic（文件）的值。

我的問題是，如何告訴它遍歷單詞（「計數器」字典的鍵）？

來源

2014-08-27 Sebastian

你可以解釋你所期望的'tf'得更爲清晰和'idf'是，什麼他們對你來說意味着...... – 2014-08-27 14:01:39

他們是用某些詞加權的嗎？ – 2014-08-27 14:03:14

通過查看字典，您已經擁有了「文檔中出現次數t的次數」，「文檔總數」和「文檔數量t」。那麼您的問題是：「如何獲得文檔中的術語總數？」？ – Kevin 2014-08-27 14:15:48

如果你想堅持你當前的數據結構，你必須仔細研究每個文件的每個文件的整個結構，以便計算它的idf。

# assume the term you are looking for is in the variable term 
df = 0 
for file in files: 
    if term in term_freq_per_file[file]: 
     df += 1 
idf = math.log(len(files)/df)

此答案的早期版本包含替代數據結構的草圖，但這可能已足夠。

來源

2014-08-27 15:00:58 tripleee

我已經用完全不同的代替了我的答案。請刷新。 – tripleee 2014-08-28 10:53:56

您可能想要刪除現在已過時的評論，就像我對我的評論一樣。（點擊右邊的小灰色X，當你將鼠標懸停在它上面時，可見） – tripleee 2014-08-28 10:54:21

謝謝，我怎麼知道在「if」之後要放什麼？我得到一個錯誤，說'名詞'沒有定義。這是讓我困惑的事情之一。我是否需要改變自己的功能，以便「反應」到「術語」或「單詞」？ – Sebastian 2014-08-28 11:24:41

（終於）

我決定回去和改變我的字計數公式，以便不用：

word_sort = Counter(cap_words)

我已經通過詞語的列表迭代並提出我自己的字典，他們有多少次出現：

word_sort = {} 
for term in cap_words: 
    word_sort[term] = cap_words.count(term)

因此而不是一個子詞典（計數器）每一次，我結束了本作tfDic：

'/home/seb/Learning/ex17output.txt': {'COOL': 1, 
            'FILE': 1, 
            'FUN': 1, 
            'HAVE': 1, 
            'HERE': 1, 
            'I': 1, 
            'INTO': 1, 
            'LOTS': 2, 
            'REALLY': 1, 
            'STUFF': 2, 
            'TYPED': 1},

，然後我通過tfDic [文件]中的鑰匙，創建一個保存在什麼文件給定字的信息已被用於其它字典迭代：

for file in tfDic: 
word = tfDic[file].keys() 
for word in tfDic[file]: 
    if word not in dfDic.keys(): 
     dfDic.setdefault(word,[]).append(file) 
    if word in dfDic.keys(): 
     dfDic[word].append(file)

最終的結果是例如：

'HERE': ['/home/seb/Learning/ex15_sample.txt', 
     '/home/seb/Learning/ex15_sample.txt', 
     '/home/seb/Learning/ex17output.txt'],

現在我打算只是'提取'這些值，並將它們放入前面提到的公式中。

來源

2014-08-29 10:37:37 Sebastian

'Counter'只是'dict'的一個子類，所以它有相同的方法。我同意在輸出中有'Counter'有點誤導;出於你的目的，它確實只是一個字典，你應該忽略'Counter'標識符。 – tripleee 2014-08-29 11:20:12

除非這是關於tf-idf如何工作的學習練習，否則我建議使用內置的scikit-learn類來完成此操作。

首先，爲每個文件創建一個計數字典數組。那麼你的計數字典數組中喂DictVectorizer，再喂輸出稀疏矩陣TfidfTransformer

from sklearn.feature_extraction import DictVectorizer from sklearn.feature_extraction.text import TfidfTransformer dv = DictVectorizer() D = [{'foo': 1, 'bar': 2}, {'foo': 3, 'baz': 1}] X = dv.fit_transform(D) tv = TfidfTransformer() tfidf = tv.fit_transform(X) print(tfidf.to_array())

來源

2017-01-15 16:24:05

Python 2.7：使用字典創建tf：idf腳本

回答

相關問題