運行總數連續相同的值

如何獲得熊貓系列中連續運行的1個運行總數？例如，s = pd.Series([5, 1, 4, 1, 1, 2, 3, 1, 1, 1, 4])。我想獲得pd.Series([0, 1, 0, 1, 2, 0, 0, 1, 2, 3, 0])。運行總數連續相同的值

（熊貓0.18.0）

來源

2016-03-26 max

你可以嘗試用groupby比較cumcount與s1 != 1cumsum：

print s1.groupby((s1 != 1).cumsum()).cumcount() 
0  0 
1  1 
2  0 
3  1 
4  2 
5  0 
6  0 
7  1 
8  2 
9  3 
10 0 
dtype: int64

更好的解釋：

df = pd.DataFrame(s1, columns=['orig']) 
df['not1'] = s1 != 1 
df['cumsum'] = (s1 != 1).cumsum() 
df['cumcount'] = s1.groupby((s1 != 1).cumsum()).cumcount() 
#s1.groupby((s1 != 1).cumsum()).cumcount() is same as: 
df['cumcount1'] = df.groupby('cumsum')['orig'].cumcount() 
print df 
    orig not1 cumsum cumcount cumcount1 
0  5 True  1   0   0 
1  1 False  1   1   1 
2  3 True  2   0   0 
3  4 True  3   0   0 
4  1 False  3   1   1 
5  1 False  3   2   2 
6  2 True  4   0   0 
7  3 True  5   0   0 
8  1 False  5   1   1 
9  1 False  5   2   2 
10  1 False  5   3   3 
11  4 True  6   0   0

或者：

print (s1 == 1) * (s1.groupby((s1 != s1.shift()).cumsum()).cumcount() + 1) 
0  0 
1  1 
2  0 
3  1 
4  2 
5  0 
6  0 
7  1 
8  2 
9  3 
10 0 
dtype: int64

說明：

df = pd.DataFrame(s1, columns=['orig']) 
df['compare_shift'] = s1 != s1.shift() 
df['cumsum'] = (s1 != s1.shift()).cumsum() 
df['cumcount'] = s1.groupby((s1 != s1.shift()).cumsum()).cumcount() + 1 
df['cumcount1'] = df.groupby('cumsum')['orig'].cumcount() + 1 
df['is1'] = (s1 == 1) 
#True in converted to 1, False to 0 
df['fin'] = (s1 == 1) * (s1.groupby((s1 != s1.shift()).cumsum()).cumcount() + 1) 
print df 
    orig compare_shift cumsum cumcount cumcount1 is1 fin 
0  5   True  1   1   1 False 0 
1  1   True  2   1   1 True 1 
2  3   True  3   1   1 False 0 
3  4   True  4   1   1 False 0 
4  1   True  5   1   1 True 1 
5  1   False  5   2   2 True 2 
6  2   True  6   1   1 False 0 
7  3   True  7   1   1 False 0 
8  1   True  8   1   1 True 1 
9  1   False  8   2   2 True 2 
10  1   False  8   3   3 True 3 
11  4   True  9   1   1 False 0

來源

2016-03-26 06:11:15 jezrael

我假定它需要'穿過行經過一個完整循環'，一個用於'cumsum'，一個用於'groupby'，一個用於'cumcount'（S1 = 1！）。與一種能夠一次完成所有事情的（假設）熊貓方法相比，它需要進行4次傳遞的事實是否會減慢速度？（當然，我知道即使是這樣，它仍然比純python循環要快得多。） – max

我認爲它更快/更好，因爲使用熊貓函數雖然4次通過。 – jezrael

不是pretiest方式（可能不是最優的），但下面能夠完成任務（約4.5倍比其他循環答案更快）：

s = pd.Series([5, 1, 4, 1, 1, 2, 3, 1, 1, 1, 4]) 

def consecutive_n(s, n=1): 
    a = s[s==n].cumsum()[s.index].fillna(0)/n 
    b = a[a.diff() > 1] 
    c = (b.rank() - b)[s.index].fillna(0).cumsum() 
    return (a + c).apply(lambda x: np.where(x<0, 0, x)).astype(int) 

>>> consecutive_n(s, n=1) 
0  0 
1  1 
2  0 
3  1 
4  2 
5  0 
6  0 
7  1 
8  2 
9  3 
10 0 
dtype: int64

關於中間值的一些解釋：
a：在整個系列中第1次出現。
c：當一個不同的數字顯示在1（或n）之間時，必須向a添加多少「重置」發生次數。返回值：應用lambda忽略由a + c產生的負數。

編輯：略有改變代碼，以便它可以用於任何正整數。例如：

>>> t = pd.Series([1, 2, 3, 1, 4, 2, 2, 3, 2, 2, 2, 1]) 
>>> consecutive_n(t, 2) 
0  0 
1  1 
2  0 
3  0 
4  0 
5  1 
6  2 
7  0 
8  1 
9  2 
10 3 
11 0 
dtype: int64

來源

2016-03-26 05:16:07

運行總數連續相同的值

回答

相關問題