k表示使用python的結構化數據 - 多於一列

如何在結構化數據中的多列上做k？k表示使用python的結構化數據 - 多於一列

在下面1列（名稱）及其所完成的例子

tfidf_matrix = tfidf_vectorizer.fit_transform（df_new [「名」]）

這裏僅使用名字，但說我們想用的名字和國家，我是否應該將國家添加到同一專欄如下？

df_new['name'] = df_new['name'] + " " + df_new['country'] 
tfidf_matrix = tfidf_vectorizer.fit_transform(df_new['name'])

它從代碼的角度工作，現在還在試圖理解的結果（其實我有噸列）的數據，但我不知道這是否是適合當有多個列

正道

import os 
import pandas as pd 
import re 
import numpy as np 

df = pd.read_csv('sample-data.csv') 


def split_description(string): 
    # name 
    string_split = string.split(' - ',1) 
    name = string_split[0] 

    return name 


df_new = pd.DataFrame() 
df_new['name'] = df.loc[:,'description'].apply(lambda x: split_description(x)) 
df_new['id'] = df['id'] 


def remove(name): 
    new_name = re.sub("[0-9]", '', name) 
    new_name = ' '.join(new_name.split()) 
    return new_name 

df_new['name'] = df_new.loc[:,'name'].apply(lambda x: remove(x)) 



from sklearn.feature_extraction.text import TfidfVectorizer 


tfidf_vectorizer = TfidfVectorizer(
            use_idf=True, 
            stop_words = 'english', 
            ngram_range=(1,4), min_df = 0.01, max_df = 0.8) 


tfidf_matrix = tfidf_vectorizer.fit_transform(df_new['name']) 

print (tfidf_matrix.shape) 
print (tfidf_vectorizer.get_feature_names()) 


from sklearn.metrics.pairwise import cosine_similarity 
dist = 1.0 - cosine_similarity(tfidf_matrix) 
print (dist) 


from sklearn.cluster import KMeans 
num_clusters = range(1,20) 

KM = [KMeans(n_clusters=k, random_state = 1).fit(tfidf_matrix) for k in num_clusters]

來源

2017-10-05 Naresh MG

KMeans處理二維數據。您是否嘗試過在原始數據集上使用Kmeans（沒有將它們合併到單個列中），並將它們轉換爲數字列（如單熱編碼或二值化） –

thx您的評論，我還沒有嘗試過，但我有很多專欄，如果我最終使用了30多列，你認爲這是一條路嗎？（其中一些是描述，編碼不起作用） –

對於具有文本的列，tfidf是好的，對於分類列，單熱編碼將是好的。不管你有多少列，除非你有非常少的數據（行）。如果行數足夠大，則這是基本的做法。一旦分析了數據，就可以應用其他高級特徵選擇和工程技術。 –

不，這是適合多列的錯誤方法。您基本上只是簡單地將多個特徵卡在一起，並期望它的行爲正確，就好像kmeans作爲單獨的特徵應用於這些多列一樣。

您需要使用其他方法，如Vectorizor和Pipelines以及tfidifVectorizor在多列上執行此操作。你可以check out this link瞭解更多信息。

此外，您可以check out this answer爲您的問題可能的替代解決方案。

來源

2017-10-05 07:12:54

k表示使用python的結構化數據 - 多於一列

回答

相關問題