2017-08-16 24 views
1

我想知道熊貓數據框中的哪些列有不連續的數據。所謂「不連續」,我的意思是,在再次獲得一些價值之前,這些值從某個值變爲零。熊貓的方式來查找不連續的數據

[0,0,0,1,2,3,4,5,0,0,0] # continuous 
[0,0,0,1,2,0,4,5,0,0,0] # not continuous 

我已經設法實現了一些代碼,可以做到這一點,使用循環遍歷數據幀的每一列。我做了以下工作片段來說明我的意思:

import numpy as np 
import pandas as pd 

def find_discontinuous(series): 
    switch = 0 
    for index,val in series.iteritems(): 
     # print(val, end=" ") 
     if switch==0 and val==0: 
      # print("still zero") 
      continue 
     elif switch==0 and val!=0: 
      switch = 1 
     if switch==1 and val==0: 
      # print("back to zero") 
      switch = 2 
      continue 
     if switch==2 and val!=0: 
      # print("supposed to be zero") 
      return "not continuous" 
    return "continuous" 

data = np.array([[0,1,2,3,4,5,0], 
       [0,1,2,0,4,5,0]]) 
df = pd.DataFrame(data,columns=list(range(7)),index=list(range(2))).transpose() 

for column in df.columns: 
    series = df.loc[:,column] 
    res = find_discontinuous(series) 
    print(column,res) 

輸出:

0 continuous 
1 not continuous 

我讀的地方,它可能是不正確的使用for循環通過熊貓數據幀,因爲它遍歷是慢的。什麼是熊貓的方式來實現同樣的事情?

+0

那麼,什麼不是不連續的,被認爲是連續的?像所有的零都會連續? – Divakar

回答

1

你只需要檢查第一次變化與零之間的變化和最後一次變化到零之間,中間沒有零點:

def is_continuous(series): 
    id_first_true = (series > 0).idxmax() 
    id_last_true = (series > 0)[::-1].idxmax() 
    return all((series>0).loc[id_first_true:id_last_true] == True) 
1

您可以apply變換dfSeries與列名的索引和Boolean價值Continuous

df.apply(lambda y: not(any(map(lambda x: x[1] == 0 and x[0]>0 and x[2]>0, zip(reversed(y), reversed(y[:-1]), reversed(y[:-2])))))) 

或者您可以使用您的功能與apply

df.apply(find_discontinuous) 
#0  continuous 
#1 not continuous