2017-04-13 71 views
1

previous question合併DataFrames,我是問如何從這個數據幀source匹配值:與排序標準

 car_id  lat  lon 
0 100  10.0 15.0 
1 100  12.0 10.0 
2 100  13.0 09.0 
3 110  23.0 08.0 
4 110  13.0 09.0 
5 110  12.0 10.0 
6 110  12.0 02.0 
7 120  11.0 11.0 
8 120  12.0 10.0 
9 120  13.0 09.0 
10 120  14.0 08.0 
11 130  12.0 10.0 

,只保留那些COORDS在這第二個數據幀coords

 lat  lon 
0 12.0 10.0 
1 13.0 09.0 

但這一次我想匹配每個car_id誰得到:

  • 都具有相同的順序

的從coords

  • 值,使所產生的數據幀result是:

     car_id 
    1 100 
    2 120 
    
    # 110 has all the values from coords, but not in the same order 
    # 130 doesn't have all the values from coords 
    

    有沒有辦法在一個量化的方式來實現這一結果,避免經歷了很多循環和條件?

  • 回答

    1

    計劃

    • 我們會groupby'car_id'和評估每個子集
    • innermerge後,我們應該看到兩件事情
      1. 產生的合併數據框應該具有相同的值coords
      2. 產生的合併數據框應該面面俱到



    def duper(df): 
        m = df.merge(coords) 
        c = pd.concat([m, coords]) 
        # we put the merged rows first and those are 
        # the ones we'll keep after `drop_duplicates(keep='first')` 
        # `keep='first'` is the default, so I don't pass it 
        c1 = (c.drop_duplicates().values == coords.values).all() 
    
        # if `keep=False` then I drop all duplicates. If I got 
        # everything in `coords` this should be empty 
        c2 = c.drop_duplicates(keep=False).empty 
        return c1 & c2 
    
    source.set_index('car_id').groupby(level=0).filter(duper).index.unique().values 
    
    array([100, 120]) 
    

    輕微替代

    def duper(df): 
        m = df.drop('car_id', 1).merge(coords) 
        c = pd.concat([m, coords]) 
        c1 = (c.drop_duplicates().values == coords.values).all() 
        c2 = c.drop_duplicates(keep=False).empty 
        return c1 & c2 
    
    source.groupby('car_id').filter(duper).car_id.unique() 
    
    1

    這是不漂亮,但如果你做了這樣的事情是什麼:

    df2 = DataFrame(df, copy=True) 
    df2[['lat2', 'lon2']] = df[['lat', 'lon']].shift(-1) 
    df2.set_index(['lat', 'lon', 'lat2', 'lon2'], inplace=True) 
    print(df2.loc[(12, 10, 13, 9)].reset_index(drop=True)) 
    
        car_id 
    0  100 
    1  120 
    

    ,這將是一般情況下:

    raw_data = {'car_id': [100, 100, 100, 110, 110, 110, 110, 120, 120, 120, 120, 130], 
          'lat': [10, 12, 13, 23, 13, 12, 12, 11, 12, 13, 14, 12], 
          'lon': [15, 10, 9, 8, 9, 10, 2, 11, 10, 9, 8, 10], 
          } 
    df = pd.DataFrame(raw_data, columns = ['car_id', 'lat', 'lon']) 
    
    raw_data = { 
          'lat': [10, 12, 13], 
          'lon': [15, 10, 9], 
          } 
    
    coords = pd.DataFrame(raw_data, columns = ['lat', 'lon']) 
    
    def submatch(df, match): 
        df2 = DataFrame(df['car_id']) 
        for x in range(match.shape[0]): 
         df2[['lat{}'.format(x), 'lon{}'.format(x)]] = df[['lat', 'lon']].shift(-x) 
    
        n = match.shape[0] 
        cols = [item for sublist in 
         [['lat{}'.format(x), 'lon{}'.format(x)] for x in range(n)] 
         for item in sublist] 
    
        df2.set_index(cols, inplace=True) 
        return df2.loc[tuple(match.stack().values)].reset_index(drop=True) 
    
    print(submatch(df, coords)) 
    
        car_id 
    0  100 
    
    +0

    什麼是這個答案原來的DF? –