使用Pandas group迭代和修改數據框由

我正在處理1的大數組，需要系統地從數組的各部分中刪除0。大陣列由許多較小的陣列組成，對於每個較小的陣列，我需要用0系統地替換其上部和下部三角形。例如，我們有與由索引值指示的5個陣列的陣列（所有子陣列具有相同的列數）：使用Pandas group迭代和修改數據框由

 0 1 2 
0 1.0 1.0 1.0 
1 1.0 1.0 1.0 
1 1.0 1.0 1.0 
2 1.0 1.0 1.0 
2 1.0 1.0 1.0 
2 1.0 1.0 1.0 
3 1.0 1.0 1.0 
3 1.0 1.0 1.0 
3 1.0 1.0 1.0 
3 1.0 1.0 1.0 
4 1.0 1.0 1.0 
4 1.0 1.0 1.0 
4 1.0 1.0 1.0 
4 1.0 1.0 1.0 
4 1.0 1.0 1.0

我想在每一組行中其上和下三角形進行修改這樣所產生的基質是：

 0 1 2 
0 1.0 1.0 1.0 
1 1.0 1.0 0.0 
1 0.0 1.0 1.0 
2 1.0 0.0 0.0 
2 0.0 1.0 0.0 
2 0.0 0.0 1.0 
3 1.0 0.0 0.0 
3 1.0 1.0 0.0 
3 0.0 1.0 1.0 
3 0.0 0.0 1.0 
4 1.0 0.0 0.0 
4 1.0 1.0 0.0 
4 1.0 1.0 1.0 
4 0.0 1.0 1.0 
4 0.0 0.0 1.0

目前我只使用numpy的實現這個結果數組，但我想我可以用熊貓分組加快步伐。實際上，我的數據集非常大，幾乎有500,000行。 numpy代碼如下：

import numpy as np 

candidateLengths = np.array([1,2,3,4,5]) 
centroidLength =3 

smallPaths = [min(l,centroidLength) for l in candidateLengths] 

# This is the k_values of zeros to delete. To be used in np.tri 
k_vals = list(map(lambda smallPath: centroidLength - (smallPath), smallPaths)) 
maskArray = np.ones((np.sum(candidateLengths), centroidLength)) 

startPos = 0 
endPos = 0 
for canNo, canLen in enumerate(candidateLengths): 
    a = np.ones((canLen, centroidLength)) 
    a *= np.tri(*a.shape, dtype=np.bool, k=k_vals[canNo]) 
    b = np.fliplr(np.flipud(a)) 
    c = a*b 

    endPos = startPos + canLen 

    maskArray[startPos:endPos, :] = c 

    startPos = endPos 

print(maskArray)

當我在我的真實數據集上運行它時，它需要將近5-7秒才能執行。我認爲這歸結於這個巨大的循環。我如何使用熊貓分組來達到更高的速度？由於

來源

2017-06-03 kPow989

新建答案

def tris(n, m): 
    if n < m: 
     a = np.tri(m, n, dtype=int).T 
    else: 
     a = np.tri(n, m, dtype=int) 
    return a * a[::-1, ::-1] 

idx = np.append(df.index.values, -1) 
w = np.append(-1, np.flatnonzero(idx[:-1] != idx[1:])) 
c = np.diff(w) 
df * np.vstack([tris(n, 3) for n in c]) 

    0 1 2 
0 1.0 1.0 1.0 
1 1.0 1.0 0.0 
1 0.0 1.0 1.0 
2 1.0 0.0 0.0 
2 0.0 1.0 0.0 
2 0.0 0.0 1.0 
3 1.0 0.0 0.0 
3 1.0 1.0 0.0 
3 0.0 1.0 1.0 
3 0.0 0.0 1.0 
4 1.0 0.0 0.0 
4 1.0 1.0 0.0 
4 1.0 1.0 1.0 
4 0.0 1.0 1.0 
4 0.0 0.0 1.0

老回答

我定義了一些輔助三角函數

def tris(n, m): 
    if n < m: 
     a = np.tri(m, n, dtype=int).T 
    else: 
     a = np.tri(n, m, dtype=int) 
    return a * a[::-1, ::-1] 

def tris_df(df): 
    n, m = df.shape 
    return pd.DataFrame(tris(n, m), df.index, df.columns)

然後

df * df.groupby(level=0, group_keys=False).apply(tris_df) 

    0 1 2 
0 1.0 1.0 1.0 
1 1.0 1.0 0.0 
1 0.0 1.0 1.0 
2 1.0 0.0 0.0 
2 0.0 1.0 0.0 
2 0.0 0.0 1.0 
3 1.0 0.0 0.0 
3 1.0 1.0 0.0 
3 0.0 1.0 1.0 
3 0.0 0.0 1.0 
4 1.0 0.0 0.0 
4 1.0 1.0 0.0 
4 1.0 1.0 1.0 
4 0.0 1.0 1.0 
4 0.0 0.0 1.0

來源

2017-06-03 18:13:38 piRSquared

嗨@piRSquared謝謝，爲此。我認爲您提供的解決方案與我原先編寫的for循環相比較慢。我認爲，應用程序的應用程序非常像for循環。如果你使用candidateLengths = np.random.randint（1,7，size = 300000）來嘗試它，我發現我的代碼在6秒內執行。謝謝！ – kPow989

@ user3063482試試。 – piRSquared

嗨，我感謝我的時間，你的新功能返回3.74s而我的5.34s！這工作得很好。感謝幫助！ – kPow989

使用Pandas group迭代和修改數據框由

回答

相關問題