2017-01-27 156 views
0

我有DF具有三列的數據幀(逆文檔頻率),如下所示:計算IDF上的熊貓數據幀

DocumentID Words    Region 
1    ['A','B','C']  ['Canada'] 
2    ['A','X','D']  ['India', 'USA', 'Canada'] 
3    ['B','C','X']  ['Canada'] 

我想要計算IDF對每個字中的「詞」列即我想要生成一個輸出,每個單詞都有'A','B','C'等字樣以及相應的IDF值。

+1

有幾個有據可查,維護和使用NLP圖書館在那裏。你可能已經安裝了一對夫婦。說實話,爲什麼你這樣使用'DataFrame'對我來說毫無意義。列表的DataFrames幾乎總是一個跡象,你正在接近這個錯誤的方式。 –

回答

-1
list_words = [] 
list_regions = [] 

for words in df['Words']: 

    for word in words: 

     list_words.append(word) 

for regions in df['Region']: 

    for region in regions: 

     list_regions.append(region) 

IDF_words = pd.DataFrame([], columns=['words','IDF']) 
IDF_regions = pd.DataFrame([], columns=['regions','IDF']) 

IDF_words['words'] = sorted(set(list_words)) 
IDF_regions['regions'] = sorted(set(list_regions)) 

IDF_words['IDF'] = IDF_words['words'].map(lambda x: list_words.count(x)/float(len(list_words))) 
IDF_regions['IDF'] = IDF_regions['regions'].map(lambda x: list_regions.count(x)/float(len(list_regions))) 

希望它有助於兄弟!
如果它不請給予好評/馬克答道:)
和平

+0

也許對OP:區域與idf [w]有什麼關係? – gerowam

+0

@epattaro TypeError:難以置信的類型:'list' – ComplexData

+0

它在這裏完美運行。你有沒有改變那些可能導致這種情況的東西?重要的是要注意list.append(...)之前沒有相同的值。 – epattaro

0

這裏有一個略少特定版本。假設你想IDF標準1/DF定義,你可以通過在Words列的各「文件」迭代:

from collections import defaultdict 

# Assuming the Words column is represented as you presented it: 
words = [['A','B','C'], 
     ['A','X','D'], 
     ['B','C','X']] 

# to store intermediate counts: 
idf = defaultdict(float) 
for doc in words: 
    for w in doc: 
     idf[w] += 1 

# Compute IDF as 1/df : 
idf = {k:(1/v) for (k,v) in idf.items()} #<- {'A': 0.5, 'B': 0.5,'C': 0.5, 'D': 1.0, 'X': 0.5} 
vocab = idf.keys() # Note that the vocab is also accessible now.