1

我對Python很新穎,所以我確定這很簡單,我沒有做,但我無法弄清楚。我爲我的語料庫中的每個文檔創建了相似性矩陣,並且我想將它們分配迴帶有文檔名稱鍵的字典,以跟蹤每個文檔之間的相似性。將每個文檔的相似性矩陣動態分配給數組以便導出到JSON

但是,它始終將最後一個矩陣分配給每個鍵,而不是相應的鍵矩陣。

import pandas as pd 
import numpy as np 
import nltk 
import string 
from collections import Counter 
from nltk.corpus import stopwords 
from nltk.stem.porter import PorterStemmer 

from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.metrics.pairwise import cosine_similarity 
import json 
import os 

path = "stories/" 
token_dict = {} 
stemmer = PorterStemmer() 

def tokenize(text): 
    tokens = nltk.word_tokenize(text) 
    stems = stem_tokens(tokens, stemmer) 
    return stems 

def stem_tokens(tokens, stemmer): 
    stemmed_words = [] 
    for token in tokens: 
     stemmed_words.append(stemmer.stem(token)) 
    return stemmed_words 


for subdir, dirs, files in os.walk(path): 
    for file in files: 
     file_path = subdir + os.path.sep + file 
     with open(file_path, "r", encoding = "utf-8") as file: 
      story = file 
      text = story.read() 
      lowers = text.lower() 
      map = str.maketrans('', '', string.punctuation) 
      no_punctuation = lowers.translate(map) 
      token_dict[file.name.split("\\", 1)[1]] = no_punctuation 

tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english') 
tfs = tfidf.fit_transform(token_dict.values()) 

termarray = tfs.toarray() 
nparray = np.array(termarray) 
rows, cols = nparray.shape 

similarity = [] 
for document in docdict: 
    for row in range(0, rows-1): 
     similarity = cosine_similarity(tfs[row:row+1], tfs) 
     docdict[document] = similarity 

一切正常,直到分配回來。

這產生的字典:

{'98ststory1.txt': array([[ 0.10586559, 0.04742287, 0.02478352, 0.06587952, 0.12907377, 
     0.07661095, 0.06941533, 0.05443182, 0.06616549, 0.0266565 , 
     0.04640984, 0.03356339, 0.02529364, 0.08210173, 0.16172138, 
     0.05594719, 0.10231466, 0.03556236, 0.18374215, 0.0588386 , 
     0.16857304, 0.08866461, 0.12510476, 0.07107058, 0.0751615 , 
     0.06371055, 0.16820855, 0.07926561, 0.02590006, 0.03690054, 
     0.01513446, 0.04677632, 0.11693509, 1.  , 0.06086615]]), 
'alfredststory1.txt': array([[ 0.10586559, 0.04742287, 0.02478352, 0.06587952, 0.12907377, 
     0.07661095, 0.06941533, 0.05443182, 0.06616549, 0.0266565 , 
     0.04640984, 0.03356339, 0.02529364, 0.08210173, 0.16172138, 
     0.05594719, 0.10231466, 0.03556236, 0.18374215, 0.0588386 , 
     0.16857304, 0.08866461, 0.12510476, 0.07107058, 0.0751615 , 
     0.06371055, 0.16820855, 0.07926561, 0.02590006, 0.03690054, 
     0.01513446, 0.04677632, 0.11693509, 1.  , 0.06086615]]), 
'alfredststory2.txt': array([[ 0.10586559, 0.04742287, 0.02478352,  0.06587952, 0.12907377, 
     0.07661095, 0.06941533, 0.05443182, 0.06616549, 0.0266565 , 
     0.04640984, 0.03356339, 0.02529364, 0.08210173, 0.16172138, 
     0.05594719, 0.10231466, 0.03556236, 0.18374215, 0.0588386 , 
     0.16857304, 0.08866461, 0.12510476, 0.07107058, 0.0751615 , 
     0.06371055, 0.16820855, 0.07926561, 0.02590006, 0.03690054, 
     0.01513446, 0.04677632, 0.11693509, 1.  , 0.06086615]]) 

的文件中的每一個被分配給第二到最後一個文檔。雖然這只是一個簡單的問題,但真正的問題在於它們被分配了相同的矩陣。

,我得到了一個文件的矩陣如下:

array([[ 1.  , 0.07015725, 0.01593837, 0.05618977, 0.03892873, 
     0.02434279, 0.06029888, 0.02261425, 0.03531677, 0.02975444, 
     0.01835854, 0.02145624, 0.00985163, 0.03645598, 0.0497407 , 
     0.04482995, 0.06677013, 0.03153055, 0.10919878, 0.12029462, 
     0.07255828, 0.05499581, 0.06330188, 0.04719668, 0.08909685, 
     0.04484428, 0.06725359, 0.04453039, 0.02381673, 0.02639529, 
     0.01012012, 0.0218679 , 0.09989828, 0.10586559, 0.01535069]]) 

如果這是每個文件的第一個文件的相應的相似之處。我想要的是一本字典,看起來像這樣:

{ 
    story1: 
      { 
       story1: 1., 
       story2: 0.07015725, 
       story3: 0.01593837, 
       story4: 0.05618977... 
      } 
    story2: 
      { 
       story1: ... 
      } 
} 

..等等。

的採樣數據集看起來像這樣:

story1 = """Four other streets were renamed in Cork at the turn of the last century to celebrate this event: Wolfe Tone St. (Previously Fair Lane), John Philpot Curran St. (Philpot’s Lane), Emmet (Nelson’s) Place and Sheare’s (Nile) St.""" 
story2 = """Oliver Plunkett Street was originally named George's Street after George I, the then reigning King of Great Britain and Ireland. In 1920, during the Burning of Cork, large parts of the street were destroyed by British troops.""" 
story3 = """Alfred Street is a connecting Street between Kent Train Station and MacCurtain Street. Present Cork city centre signage uses letters inspired by the book of Kells. This has been an inspiration for many typefaces in the past, including the Petrie's 'B' typface, and Monotype's 'Column Cille', which was widely used for school textbooks.""" 

運行通過腳本,如下這將產生相似的矩陣:

[[ 1.   0.05814422 0.06032458]] 
[[ 0.05814422 1.   0.21323354]] 
[[ 0.06032458 0.21323354 1.  ]] 

當每個這些是1 * n矩陣對應每個文件的相似之處。我想這變成是一本字典,讓我看到一張原稿的特定相似對方的文件是這樣的:

{ 
    story1: { 
       story1: 1., 
       story2: 0.05814422, 
       story3: 0.06032458 
      }, 
    story2: { 
       story1: 0.05814422, 
       story2: 1., 
       story3: 0.21323354 
      }, 
    story3: { 
       story1: 0.06032458, 
       story2: 0.21323354, 
       story3: 1. 
      } 
} 

我敢肯定,這是一個基本的問題,但我的Python的數據結構的知識缺乏,任何幫助將不勝感激!

+0

請提供一個小(3-5行),但可重複的數據集(文本/ CSV格式)和期望的數據集。這將幫助人們重現您的問題並更好地理解它。請參閱[如何使重複性好大熊貓的例子(http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – MaxU

+0

我只是把更多一點,現在有我希望澄清我想要做的事情! – thetrainfiasco

+0

是的,現在很清楚,謝謝!我已經添加了一個答案 - 請檢查... – MaxU

回答

0

假設你有相似之處以下矩陣:

sim = cosine_similarity(tfs) 

In [261]: sim 
Out[261]: 
array([[ 1.  , 0.09933054, 0.08911641], 
     [ 0.09933054, 1.  , 0.27252107], 
     [ 0.08911641, 0.27252107, 1.  ]]) 

注:我們不需要循環計算使用Pandas module相似之處

的矩陣,我們可以做到以下幾點:

In [262]: df = pd.DataFrame(sim, 
          columns=list(token_dict.keys()), 
          index=list(token_dict.keys())) 

DataFrame:

In [263]: df 
Out[263]: 
      story1 story2 story3 
story1 1.000000 0.099331 0.089116 
story2 0.099331 1.000000 0.272521 
story3 0.089116 0.272521 1.000000 

現在,我們可以很容易地轉換數據框與dict

In [264]: df.to_dict() 
Out[264]: 
{'story1': {'story1': 1.0000000000000009, 
    'story2': 0.099330538266243495, 
    'story3': 0.089116410701360893}, 
'story2': {'story1': 0.099330538266243495, 
    'story2': 0.99999999999999911, 
    'story3': 0.27252107037687257}, 
'story3': {'story1': 0.089116410701360893, 
    'story2': 0.27252107037687257, 
    'story3': 1.0}} 

,或者直接到JSON:

df.to_json('/path/to/file.json') 
+0

男人,絕對驚人。非常感謝,我知道這是有點基本的。那是我衝倒頭小時 – thetrainfiasco

+0

@ JonnyO'Mahony,很高興我能幫助:)當你加入一個樣本數據集和所需的/生成的數據集(並解釋你想怎麼處理你的數據 - 我的意思是你的代碼) - 它變得非常清楚。所以,我建議你在未來的要求SciPy的/ sklearn/numpy的/熊貓/機器學習問題時始終遵循這一規則 - 這增加的概率極大地得到答案;-)有一個好的一天! – MaxU