熊貓比較

我試圖簡化熊貓和python的語法，當執行一個基本的熊貓操作。熊貓比較

我有4列：

A_ID
a_score
B_ID
b_score

我創建了一個新的標籤稱爲基於以下DOC_TYPE：

一個> = B，DOC_TYPE：一個
B> A，DOC_TYPE：乙

林在如何在大熊貓其中存在，但b計算掙扎不，在這個那麼情況就需要成爲標籤。現在它返回else語句或b。我需要創建2個額外的比較，其規模可能是有效的，因爲我已經比較過之前的數據。尋找如何改進它。

df = pd.DataFrame({ 
     'a_id': ['A', 'B', 'C', 'D', '', 'F', 'G'], 
     'a_score': [1, 2, 3, 4, '', 6, 7], 
     'b_id': ['a', 'b', 'c', 'd', 'e', 'f', ''], 
     'b_score': [0.1, 0.2, 3.1, 4.1, 5, 5.99, None], 

    }) 
    print df 
    # Replace empty string with NaN 
    m_score = r['a_score'] >= r['b_score'] 
    m_doc = (r['a_id'].isnull() & r['b_id'].isnull()) 
    df = df.apply(lambda x: x.str.strip() if isinstance(x, str) else x).replace('', np.nan) 
    # Calculate higher score 
    df['doc_id'] = df.apply(lambda df: df['a_id'] if df['a_score'] >= df['b_score'] else df['b_id'], axis=1) 
    # Select type based on higher score 
    r['doc_type'] = numpy.where(m_score, 'a', 
          numpy.where(m_doc, numpy.nan, 'b'))  

    # Additional lines looking for improvement: 
    df['doc_type'].loc[(df['a_id'].isnull() & df['b_id'].notnull())] = 'b' 
    df['doc_type'].loc[(df['a_id'].notnull() & df['b_id'].isnull())] = 'a' 
    print df

來源

2017-02-17 spicyramen

你需要在現實中DOC_ID？或者它只是你的處理代碼的一部分？ – Psidom

只是處理代碼的一部分，我們現在可以忽略它。 – spicyramen

使用numpy.where，假設你的邏輯是：

都存在，則DOC_TYPE將成爲一個具有更高的分數;
一個缺少，doc_type將是一個不爲空;
兩者都缺失，doc_type將爲空;

增加了額外的優勢情況下，在最後一行：

import numpy as np 

df = df.replace('', np.nan) 
df['doc_type'] = np.where(df.b_id.isnull() | (df.a_score >= df.b_score), 
          np.where(df.a_id.isnull(), None, 'a'), 'b') 
df

來源

2017-02-17 19:50:59 Psidom

使用申請方法在大熊貓與自定義功能，在您的數據幀嘗試：

import pandas as pd 
import numpy as np 

df = pd.DataFrame({ 
     'a_id': ['A', 'B', 'C', 'D', '', 'F', 'G'], 
     'a_score': [1, 2, 3, 4, '', 6, 7], 
     'b_id': ['a', 'b', 'c', 'd', 'e', 'f', ''], 
     'b_score': [0.1, 0.2, 3.1, 4.1, 5, 5.99, None], 

    }) 

df = df.replace('',np.NaN) 

def func(row): 
    if np.isnan(row.a_score) and np.isnan(row.b_score): 
     return np.NaN 
    elif np.isnan(row.b_score) and not(np.isnan(row.a_score)): 
     return 'a' 
    elif not(np.isnan(row.b_score)) and np.isnan(row.a_score): 
     return 'a' 
    elif row.a_score>=row.b_score: 
     return 'a' 
    elif row.b_score>row.a_score: 
     return 'b' 

df['doc_type'] = df.apply(func,axis=1)

可以使功能複雜，因爲你需要和包括比較任何量和添加更多如果您需要，可以稍後再處理。

來源

2017-02-17 19:30:38

嗨Gaurav，第7行（索引6）你的邏輯不起作用，它返回None，它應該返回'a'，因爲有一個值爲a_id和a_score。上面描述的是同樣的問題。 – spicyramen

不確定我是否完全理解所有條件或者是否存在任何特定的邊界情況，但我認爲只需在列上執行np.argmax並在完成時交換'a'或'b'的值：

In [21]: import numpy as np 

In [22]: df['doc_type'] = pd.Series(np.argmax(df[["a_score", "b_score"]].values, axis=1)).replace({0: 'a', 1: 'b'}) 

In [23]: df 
Out[23]: 
    a_id a_score b_id b_score doc_type 
0 A  1 a  0.10  a 
1 B  2 b  0.20  a 
2 C  3 c  3.10  b 
3 D  4 d  4.10  b 
4   2 e  5.00  b 
5 F   f  5.99  a 
6 G  7   NaN  a

來源

2017-02-17 19:47:48

回答

相關問題