1

我在Python 3中試圖用手分析情感分析,並且使用TDF-IDF矢量化工具與單詞袋模型來矢量化文檔。Python中的文檔矢量化表示法

因此,對於任何熟悉這一點的人來說,很明顯所得到的矩陣表示是稀疏的。

這是我的代碼片段。首先,文件。

tweets = [('Once you get inside you will be impressed with the place.',1),('I got home to see the driest damn wings ever!',0),('An extensive menu provides lots of options for breakfast.',1),('The flair bartenders are absolutely amazing!',1),('My first visit to Hiro was a delight!',1),('Poor service, the waiter made me feel like I was stupid every time he came to the table.',0),('Loved this place.',1),('This restaurant has great food',1), 
     ('Honeslty it did not taste THAT fresh :(',0),('Would not go back.',0), 
     ('I was shocked because no signs indicate cash only.',0), 
     ('Waitress was a little slow in service.',0), 
     ('did not like at all',0),('The food, amazing.',1), 
     ('The burger is good beef, cooked just right.',1), 
     ('They have horrible attitudes towards customers, and talk down to each one when customers do not enjoy their food.',0), 
     ('The cocktails are all handmade and delicious.',1),('This restaurant has terrible food',0), 
     ('Both of the egg rolls were fantastic.',1),('The WORST EXPERIENCE EVER.',0), 
     ('My friend loved the salmon tartar.',1),('Which are small and not worth the price.',0), 
     ('This is the place where I first had pho and it was amazing!!',1), 
     ('Horrible - do not waste your time and money.',0),('Seriously flavorful delights, folks.',1), 
     ('I loved the bacon wrapped dates.',1),('I dressed up to be treated so rudely!',0), 
     ('We literally sat there for 20 minutes with no one asking to take our order.',0), 
     ('you can watch them preparing the delicious food! :)',1),('In the summer, you can dine in a charming outdoor patio - so very delightful.',1)] 

X_train, y_train = zip(*tweets) 

和下面的代碼來向量化文檔。

tfidfvec = TfidfVectorizer(lowercase=True) 
vectorized = tfidfvec.fit_transform(X_train) 

print(vectorized) 

當我打印vectorized時,它不輸出正常矩陣。相反,這: enter image description here

如果我沒有錯,這必須是一個稀疏矩陣表示。但是,我無法理解其格式,以及每個術語的含義。

此外,還有30個文件。所以,這解釋了第一列的0-29。如果這是趨勢,那麼我猜測第二列是單詞的索引,最後一個值是tf-idf?當我輸入我的問題時,它只是讓我感到震驚,但如果我錯了,請和我糾正。

有沒有經驗的人能幫助我更好地理解它?

回答

1

是的,從技術上說,前兩個元組表示行列位置,第三列是該位置的值。所以它基本上顯示了非零值的位置和值。