python pandas從一系列布爾值中獲取索引邊界

我試圖根據一些特性來剪切視頻。我目前的策略是針對每個幀的pandas系列布爾值，這些布爾值由時間戳索引。 True保留它，False轉儲它。python pandas從一系列布爾值中獲取索引邊界

由於我計劃剪切視頻，我需要從這個列表中提取邊界，以便我可以告訴fmpeg開始和結束我想從主視頻中提取的部分。

塗總結：

我有一個pandas系列，看起來像這樣：

acquisitionTs 
0.577331  False 
0.611298  False 
0.645255  False 
0.679218  False 
0.716538  False 
0.784453  True 
0.784453  True 
0.818417  True 
0.852379  True 
0.886336  True 
0.920301  True 
0.954259  False 
      ... 
83.393376 False 
83.427345 False 
dtype: bool

（截斷提出的理由，但時間戳通常在0開始）

，我需要得到True序列的邊界，所以在這個例子中我應該得到[[t_0,t_1],[t_2,t_3]n, ... [t_2n-1,t_2n]]，t_0 = 0.784453和t_1 = 0.920301如果我有n不同的序列True在我的熊貓系列。

現在probleme看起來很簡單，其實你可以只通過一個移動的序列和做出XOR之間得到布爾的列表，True是對邊界

e = df.shift(periods=1, freq=None, axis=0)^df 
print(e[e].index)

（與df是一個熊貓系列）還有一些工作要做，比如確定第一個元素是上升沿還是下降沿，但是這個黑客行得通。

然而，這似乎並不pythonic。事實上，這個問題非常簡單，我相信在pandas,numpy或者甚至python之間必須有一個預構建的函數，它可以很好地適用於單個函數調用，而不是像上面那樣的破解。 groupby功能似乎很有前途，但我從未使用過。

這樣做的最好方法是？

來源

2016-08-12 Clément Pinard

我會使用一個Dataframe而不是一個Series（它實際上也適用於一個Series）。

df 
    acquisitionTs Value 
0  0.577331 False 
1  0.611298 False 
2  0.645255 False 
3  0.679218 False 
4  0.716538 False 
5  0.784453 True 
6  0.784453 True 
7  0.818417 False 
8  0.852379 True 
9  0.886336 True 
10  0.920301 True 
11  0.954259 False

，我會做：

df[df.Value.diff().fillna(False)] 
    acquisitionTs Value 
5  0.784453 True 
7  0.818417 False 
8  0.852379 True 
11  0.954259 False

所以，當你知道第值爲false在這裏，你知道0-4是假，然後將其各項指標在開關（5,7,8- ，11）

groupby函數不會幫助你我認爲，因爲它會失去你的True/False值的順序（在我的例子中你將有2個組，而不是5個）。

來源

2016-08-12 12:37:46 jrjc

使用資源，而不是引入無關的依賴性的好處。 –

感謝您的回答！然而，你的代碼似乎並不知道第一個元素，它可以是True或False，所以你會以與你第一次想要的相反的方式結束。一個簡單的修復方法是將第一行插入到結果中（如果它爲True）（最後一行也是如此）感謝您的幫助！編輯：其實我們可以只看結果的第一個（也是最後一個）元素的值，它告訴邊緣是上升還是下降，所以起初沒有問題。 –

你可以使用scipy.ndimage.label識別的True S上的集羣：

In [102]: ts 
Out[102]: 
0.069347 False 
0.131956 False 
0.143948 False 
0.224864 False 
0.242640  True 
0.372599 False 
0.451989 False 
0.462090 False 
0.579956  True 
0.588791  True 
0.603638 False 
0.625107 False 
0.642565 False 
0.708547 False 
0.730239 False 
0.741652 False 
0.747126  True 
0.783276  True 
0.896705  True 
0.942829  True 
Name: keep, dtype: bool 

In [103]: groups, nobs = ndimage.label(ts); groups 
Out[103]: array([0, 0, 0, 0, 1, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 0, 3, 3, 3, 3], dtype=int32)

一旦你擁有了groups陣列，您可以使用groupby/agg找到相關時間：

result = (df.loc[df['group'] != 0] 
       .groupby('group')['times'] 
       .agg({'start':'first','end':'last'}))

例如，

import numpy as np 
import pandas as pd 
import scipy.ndimage as ndimage 
np.random.seed(2016) 

def make_ts(N, ngroups): 
    times = np.random.random(N) 
    times = np.sort(times) 
    idx = np.sort(np.random.randint(N, size=(ngroups,))) 
    arr = np.zeros(N) 
    arr[idx] = 1 
    arr = arr.cumsum() 
    arr = (arr % 2).astype(bool) 
    ts = pd.Series(arr, index=times, name='keep') 
    return ts 

def find_groups(ts): 
    groups, nobs = ndimage.label(ts) 
    df = pd.DataFrame({'times': ts.index, 'group': groups}) 
    result = (df.loc[df['group'] != 0] 
       .groupby('group')['times'] 
       .agg({'start':'first','end':'last'})) 
    return result 

ts = make_ts(20, 5) 
result = find_groups(ts)

產生

  start  end 
group      
1  0.242640 0.242640 
2  0.579956 0.588791 
3  0.747126 0.942829

要獲取的開始和結束時間爲一個列表的列表，你可以使用：

In [125]: result.values.tolist() 
Out[125]: 
[[0.24264034406127022, 0.24264034406127022], 
[0.5799564094638113, 0.5887908182432907], 
[0.747126, 0.9428288694956402]]

使用ndimage.label很方便，但要注意的是，還可以計算此無scipy：

def find_groups_without_scipy(ts): 
    df = pd.DataFrame({'times': ts.index, 'group': (ts.diff() == True).cumsum()}) 
    result = (df.loc[df['group'] % 2 == 1] 
       .groupby('group')['times'] 
       .agg({'start':'first','end':'last'})) 
    return result

這裏的主要想法是找到使用(ts.diff() == True).cumsum()的True的羣集的標籤。 ts.diff() == True與ts.shift()^ts的結果相同，但速度稍快。以累積和（即主叫cumsum）對待True爲等於1且如False等於0，因此每次True遇到由1的累積和增加。因此每個羣集被標記有不同數目：

In [111]: (ts.diff() == True).cumsum() 
Out[111]: 
0.069347 0 
0.131956 0 
0.143948 0 
0.224864 0 
0.242640 1 
0.372599 2 
0.451989 2 
0.462090 2 
0.579956 3 
0.588791 3 
0.603638 4 
0.625107 4 
0.642565 4 
0.708547 4 
0.730239 4 
0.741652 4 
0.747126 5 
0.783276 5 
0.896705 5 
0.942829 5 
Name: keep, dtype: int64

來源

2016-08-12 12:34:19 unutbu

python pandas從一系列布爾值中獲取索引邊界

回答

相關問題