添加雙字母組的大熊貓數據幀

我有這樣的雙字母組的列表：添加雙字母組的大熊貓數據幀

[['a','b'],['e', ''f']]

現在我想這些二元語法與它們的頻率添加到數據幀是這樣的：

b f 
a|1 0 
e|0 1

我試着用下面的代碼做這件事，但是這會引發一個錯誤，因爲索引還不存在。有沒有一種快速的方法來處理真正的大數據？（如200000的雙字母組）

matrixA = pd.DataFrame() 

# Put the counts in a matrix 
for elem in grams: 
    tag1, tag2 = elem[0], elem[1] 
    matrixA.loc[tag1, tag2] += 1

來源

2016-03-01 maxmijn

from collections import Counter 

bigrams = [[['a','b'],['e', 'f']], [['a','b'],['e', 'g']]] 
pairs = [] 
for bg in bigrams: 
    pairs.append((bg[0][0], bg[0][1])) 
    pairs.append((bg[1][0], bg[1][1])) 
c = Counter(pairs) 

>>> pd.Series(c).unstack() # optional: .fillna(0) 
    b f g 
a 2 NaN NaN 
e NaN 1 1

上面的是直覺。這可以用一行生成器表達式包裝，如下所示：

pd.Series(Counter((bg[i][0], bg[i][1]) for bg in bigrams for i in range(2))).unstack()

來源

2016-03-01 17:36:20 Alexander

您可以使用Counter從集合包。請注意，我將列表的內容更改爲元組而不是列表。這是因爲計數器鍵（如字典鍵）必須是可散列的。

from collections import Counter 

l = [('a','b'),('e', 'f')] 
index, cols = zip(*l) 
df = pd.DataFrame(0, index=index, columns=cols) 
c = Counter(l) 

for (i, c), count in c.items(): 
    df.loc[i, c] = count

來源

2016-03-01 16:36:12 Alex

添加雙字母組的大熊貓數據幀

回答

相關問題