我對Python很新穎,所以我確定這很簡單,我沒有做,但我無法弄清楚。我爲我的語料庫中的每個文檔創建了相似性矩陣,並且我想將它們分配迴帶有文檔名稱鍵的字典,以跟蹤每個文檔之間的相似性。將每個文檔的相似性矩陣動態分配給數組以便導出到JSON
但是,它始終將最後一個矩陣分配給每個鍵,而不是相應的鍵矩陣。
import pandas as pd
import numpy as np
import nltk
import string
from collections import Counter
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import json
import os
path = "stories/"
token_dict = {}
stemmer = PorterStemmer()
def tokenize(text):
tokens = nltk.word_tokenize(text)
stems = stem_tokens(tokens, stemmer)
return stems
def stem_tokens(tokens, stemmer):
stemmed_words = []
for token in tokens:
stemmed_words.append(stemmer.stem(token))
return stemmed_words
for subdir, dirs, files in os.walk(path):
for file in files:
file_path = subdir + os.path.sep + file
with open(file_path, "r", encoding = "utf-8") as file:
story = file
text = story.read()
lowers = text.lower()
map = str.maketrans('', '', string.punctuation)
no_punctuation = lowers.translate(map)
token_dict[file.name.split("\\", 1)[1]] = no_punctuation
tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english')
tfs = tfidf.fit_transform(token_dict.values())
termarray = tfs.toarray()
nparray = np.array(termarray)
rows, cols = nparray.shape
similarity = []
for document in docdict:
for row in range(0, rows-1):
similarity = cosine_similarity(tfs[row:row+1], tfs)
docdict[document] = similarity
一切正常,直到分配回來。
這產生的字典:
{'98ststory1.txt': array([[ 0.10586559, 0.04742287, 0.02478352, 0.06587952, 0.12907377,
0.07661095, 0.06941533, 0.05443182, 0.06616549, 0.0266565 ,
0.04640984, 0.03356339, 0.02529364, 0.08210173, 0.16172138,
0.05594719, 0.10231466, 0.03556236, 0.18374215, 0.0588386 ,
0.16857304, 0.08866461, 0.12510476, 0.07107058, 0.0751615 ,
0.06371055, 0.16820855, 0.07926561, 0.02590006, 0.03690054,
0.01513446, 0.04677632, 0.11693509, 1. , 0.06086615]]),
'alfredststory1.txt': array([[ 0.10586559, 0.04742287, 0.02478352, 0.06587952, 0.12907377,
0.07661095, 0.06941533, 0.05443182, 0.06616549, 0.0266565 ,
0.04640984, 0.03356339, 0.02529364, 0.08210173, 0.16172138,
0.05594719, 0.10231466, 0.03556236, 0.18374215, 0.0588386 ,
0.16857304, 0.08866461, 0.12510476, 0.07107058, 0.0751615 ,
0.06371055, 0.16820855, 0.07926561, 0.02590006, 0.03690054,
0.01513446, 0.04677632, 0.11693509, 1. , 0.06086615]]),
'alfredststory2.txt': array([[ 0.10586559, 0.04742287, 0.02478352, 0.06587952, 0.12907377,
0.07661095, 0.06941533, 0.05443182, 0.06616549, 0.0266565 ,
0.04640984, 0.03356339, 0.02529364, 0.08210173, 0.16172138,
0.05594719, 0.10231466, 0.03556236, 0.18374215, 0.0588386 ,
0.16857304, 0.08866461, 0.12510476, 0.07107058, 0.0751615 ,
0.06371055, 0.16820855, 0.07926561, 0.02590006, 0.03690054,
0.01513446, 0.04677632, 0.11693509, 1. , 0.06086615]])
的文件中的每一個被分配給第二到最後一個文檔。雖然這只是一個簡單的問題,但真正的問題在於它們被分配了相同的矩陣。
,我得到了一個文件的矩陣如下:
array([[ 1. , 0.07015725, 0.01593837, 0.05618977, 0.03892873,
0.02434279, 0.06029888, 0.02261425, 0.03531677, 0.02975444,
0.01835854, 0.02145624, 0.00985163, 0.03645598, 0.0497407 ,
0.04482995, 0.06677013, 0.03153055, 0.10919878, 0.12029462,
0.07255828, 0.05499581, 0.06330188, 0.04719668, 0.08909685,
0.04484428, 0.06725359, 0.04453039, 0.02381673, 0.02639529,
0.01012012, 0.0218679 , 0.09989828, 0.10586559, 0.01535069]])
如果這是每個文件的第一個文件的相應的相似之處。我想要的是一本字典,看起來像這樣:
{
story1:
{
story1: 1.,
story2: 0.07015725,
story3: 0.01593837,
story4: 0.05618977...
}
story2:
{
story1: ...
}
}
..等等。
的採樣數據集看起來像這樣:
story1 = """Four other streets were renamed in Cork at the turn of the last century to celebrate this event: Wolfe Tone St. (Previously Fair Lane), John Philpot Curran St. (Philpot’s Lane), Emmet (Nelson’s) Place and Sheare’s (Nile) St."""
story2 = """Oliver Plunkett Street was originally named George's Street after George I, the then reigning King of Great Britain and Ireland. In 1920, during the Burning of Cork, large parts of the street were destroyed by British troops."""
story3 = """Alfred Street is a connecting Street between Kent Train Station and MacCurtain Street. Present Cork city centre signage uses letters inspired by the book of Kells. This has been an inspiration for many typefaces in the past, including the Petrie's 'B' typface, and Monotype's 'Column Cille', which was widely used for school textbooks."""
運行通過腳本,如下這將產生相似的矩陣:
[[ 1. 0.05814422 0.06032458]]
[[ 0.05814422 1. 0.21323354]]
[[ 0.06032458 0.21323354 1. ]]
當每個這些是1 * n矩陣對應每個文件的相似之處。我想這變成是一本字典,讓我看到一張原稿的特定相似對方的文件是這樣的:
{
story1: {
story1: 1.,
story2: 0.05814422,
story3: 0.06032458
},
story2: {
story1: 0.05814422,
story2: 1.,
story3: 0.21323354
},
story3: {
story1: 0.06032458,
story2: 0.21323354,
story3: 1.
}
}
我敢肯定,這是一個基本的問題,但我的Python的數據結構的知識缺乏,任何幫助將不勝感激!
請提供一個小(3-5行),但可重複的數據集(文本/ CSV格式)和期望的數據集。這將幫助人們重現您的問題並更好地理解它。請參閱[如何使重複性好大熊貓的例子(http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – MaxU
我只是把更多一點,現在有我希望澄清我想要做的事情! – thetrainfiasco
是的,現在很清楚,謝謝!我已經添加了一個答案 - 請檢查... – MaxU