大熊貓創建列是與標準

鑑於以下數據幀中的其他數據幀行：大熊貓創建列是與標準

import pandas as pd 
import numpy as np 
pos = pd.DataFrame({'Station(s)':[',1,2,,','0,1,2,3,4'], 
        'Position':['Contractor','President'], 
        'Site(s)':['A,B','A'], 
        'Item(s)':['1','1,2'] 
        }) 

pos[['Position','Site(s)','Station(s)','Item(s)']] 

pos 

    Position Site(s)  Station(s) Item(s) 
0 Contractor A,B   ,1,2,,  1 
1 President A   0,1,2,3,4 1,2

和

sd = pd.DataFrame({'Site':['A','B','B','C','A','A'], 
        'Station(s)':[',1,2,,',',1,2,,',',,,,',',1,2,,','0,1,2,,',',,2,,'], 
        'Item 1':[1,1,0,0,1,0], 
        'Item 2':[1,0,0,1,1,1]}) 
sd[['Site','Station(s)','Item 1','Item 2']] 

sd 

    Site Station(s) Item 1 Item 2 
0 A  ,1,2,,  1 1 
1 B  ,1,2,,  1 0 
2 B  ,,,,   0 0 
3 C  ,1,2,,  0 1 
4 A  0,1,2,,  1 1 
5 A  ,,2,,   0 1

我想這個落得：

Contractor President Site(s)  Station(s) Item 1 Item 2 
0  1   1   A   ,1,2,,  1  1 
1  1   0   B   ,1,2,,  1  0 
2  0   0   B   ,,,,   0  0 
3  0   0   C   ,1,2,,  0  1 
4  0   1   A   0,1,2,,  1  1 
5  1   1   A   ,,2,,  0  1 

results = pd.DataFrame({'Contractor':[1,1,0,0,0,1], 
        'President':[1,0,0,0,1,1], 
        'Site(s)':['A','B','B','C','A','A'], 
        'Station(s)':[',1,2,,',',1,2,,',',,,,',',1,2,,','0,1,2,,',',,2,,'], 
        'Item 1':[1,1,0,0,1,0], 
        'Item 2':[1,0,0,1,1,1]}) 
results[['Contractor','President','Site(s)','Station(s)','Item 1','Item 2']]

基於此邏輯：

對於每種立場：

在sd中用該位置的名稱創建一個新列。
使其值等於1，其中滿足以下條件的每行（否則爲0用於其它行）：

一個。 sd ['Site']在pos ['Site（s）']中包含至少1個值。

b。 SD [站（S）「]包含在POS發現至少有1號[站（S）」]，但沒有額外的數字

我開始用這個，但被及時打回來就範：

for i in pos['Position']: 
    sd[i]= 1 if lambda x: 'x' if x for x in pos['Site(s)'] if x in sd['Site']

來源

2016-05-16 Dance Party

的由於數據被存儲的方式 - 在逗號分隔值串 - 代碼需要通過另一數據幀通過行迭代，挑開的值，迭代和挑除了它的值，然後比較兩個等......我沒有看到一種方法來真正改善這種情況，只要輸入保留逗號分隔值。

鑑於制約因素，我認爲su79eu7k's answer是相當不錯的。

但是，如果你購買到的想法，「整潔數據」（PDF）是更好的 - 如果你讓我們改變的出發點是在整潔格式DataFrames - 然後有一個不同的方法這可能是更高性能的，特別是當sd有很多行時。使用sd.apply(check, axis=1)的問題在於，它的底層是使用Python循環遍歷sd的行。對於，每行調用check一次可能會相對較慢，相當於Panda更快的矢量化方法（如merge或groupby）需要優勢。但是，要使用merge和groupby，您需要數據爲整齊格式。

因此，假設代替pos和sd，我們從tidypos和tidysd開始。（在此帖子的末尾，你會發現它轉換pos和sd他們整齊當量的可運行的例子。）

In [238]: tidypos 
Out[238]: 
    Position Site Station 
0 Contractor A  1 
1 Contractor A  2 
2 Contractor B  1 
3 Contractor B  2 
4 President A  0 
5 President A  1 
6 President A  2 
7 President A  3 
8 President A  4 

In [239]: tidysd 
Out[239]: 
    index Site Station 
0  0 A  1 
1  0 A  2 
2  1 B  1 
3  1 B  2 
4  3 C  1 
5  3 C  2 
6  4 A  0 
7  4 A  1 
8  4 A  2 
9  5 A  2

tidypos和tidysd包含相同的信息pos和sd（忽略Items因爲他們在這個問題上不起作用。）區別主要在於tidypos和tidysd中的每一行對應一個「觀察」。每個觀察都是獨立的。基本上，這歸結爲簡單地分割逗號分隔的值，以便每個值在單獨的行上結束。

現在，我們可以基於共同的列，Site和Station加入兩個DataFrames：

In [241]: merged = pd.merge(tidysd, tidypos, how='left'); merged 
Out[241]: 
    index Site Station Position 
0  0 A  1 Contractor 
1  0 A  1 President 
2  0 A  2 Contractor 
3  0 A  2 President 
4  1 B  1 Contractor 
5  1 B  2 Contractor 
6  3 C  1   NaN 
7  3 C  2   NaN 
8  4 A  0 President 
9  4 A  1 Contractor 
10  4 A  1 President 
11  4 A  2 Contractor 
12  4 A  2 President 
13  5 A  2 Contractor 
14  5 A  2 President

現在，在merged每一行代表的tidysd行和列的tidypos 之間的匹配。因此，行的存在意味着在sd['Site']和pos['Site']之間存在匹配，此外，在和和tidypos['Station']之間的匹配。換句話說，對於該行， sd['Station(s)']必須包含在pos['Station()']中找到的數字。唯一我們不確定的標準是 sd['Station(s)']中是否有額外的數字出現在pos['Station()']中。

，因爲每個這樣的行對應於不同的Station我們可以發現，從爲每個index 和Position計數merged的行數。如果此數字等於該index的可能Station s的總數，則 sd['Station(s)']不包含「額外數字」。

我們可以使用groupby/nunique來算的Stations每個index和Position數量：

In [256]: pos_count = merged.groupby(['index', 'Position'])['Station'].nunique().unstack(); pos_count 
Out[256]: 
Position Contractor President 
index       
0    2.0  2.0 
1    2.0  NaN 
4    2.0  3.0 
5    1.0  1.0

，我們可以指望的Station S中的總數爲每個index：所以

In [243]: total_count = tidysd.groupby(['index'])['Station'].nunique(); total_count 
Out[243]: 
index 
0 2 
1 2 
3 2 
4 3 
5 1 
Name: Station, dtype: int64

最後，我們可以分配1和0到Contractor和President列，基於th Ë標準(pos_count[col] == total_count)：

pos_count = pos_count.reindex(total_count.index, fill_value=0) 
for col in pos_count: 
    pos_count[col] = (pos_count[col] == total_count).astype(int) 
pos_count = pos_count.reindex(sd.index, fill_value=0) 
# Position Contractor President 
# 0     1   1 
# 1     1   0 
# 2     0   0 
# 3     0   0 
# 4     0   1 
# 5     1   1

如果你真的願意，你就可以串聯這個結果原來的sd產生精確期望的結果：再次

In [246]: result = pd.concat([sd, pos_count], axis=1); result 
Out[246]: 
    Item 1 Item 2 Site Station(s) Contractor President 
0  1  1 A  ,1,2,,   1   1 
1  1  0 B  ,1,2,,   1   0 
2  0  0 B  ,,,,   0   0 
3  0  1 C  ,1,2,,   0   0 
4  1  1 A 0,1,2,,   0   1 
5  0  1 A  ,,2,,   1   1

但是，如果買成的想法，數據應該是整潔的，你應該避免將多行數據打包成逗號分隔的字符串。

如何整理向上pos和sd：

您可以使用矢量串的方法，.str.findall和.str.split到轉換逗號分隔的字符串值列表。然後使用列表解析遍歷行和列表來構建tidypos和 tidysd。

全部放在一起，

import itertools as IT 
import pandas as pd 

pos = pd.DataFrame({'Station(s)':[',1,2,,','0,1,2,3,4'], 
        'Position':['Contractor','President'], 
        'Site(s)':['A,B','A'], 
        'Item(s)':['1','1,2']}) 

sd = pd.DataFrame({'Site':['A','B','B','C','A','A'], 
        'Station(s)':[',1,2,,',',1,2,,',',,,,',',1,2,,','0,1,2,,',',,2,,'], 
        'Item 1':[1,1,0,0,1,0], 
        'Item 2':[1,0,0,1,1,1]}) 

mypos = pos.copy() 
mypos['Station(s)'] = mypos['Station(s)'].str.findall(r'(\d+)') 
mypos['Site(s)'] = mypos['Site(s)'].str.split(r',') 
tidypos = pd.DataFrame(
    [(row['Position'], site, station) 
    for index, row in mypos.iterrows() 
    for site, station in IT.product(
      *[row[col] for col in ['Site(s)', 'Station(s)']])], 
    columns=['Position', 'Site', 'Station']) 

mysd = sd[['Site', 'Station(s)']].copy() 
mysd['Station(s)'] = mysd['Station(s)'].str.findall(r'(\d+)') 

tidysd = pd.DataFrame(
    [(index, row['Site'], station) 
    for index, row in mysd.iterrows() 
    for station in row['Station(s)']], 
    columns=['index', 'Site', 'Station']) 

merged = pd.merge(tidysd, tidypos, how='left') 
pos_count = merged.groupby(['index', 'Position'])['Station'].nunique().unstack() 
total_count = tidysd.groupby(['index'])['Station'].nunique() 
pos_count = pos_count.reindex(total_count.index, fill_value=0) 
for col in pos_count: 
    pos_count[col] = (pos_count[col] == total_count).astype(int) 
pos_count = pos_count.reindex(sd.index, fill_value=0) 
result = pd.concat([sd, pos_count], axis=1) 
print(result)

產生

Item 1 Item 2 Site Station(s) Contractor President 
0  1  1 A  ,1,2,,   1   1 
1  1  0 B  ,1,2,,   1   0 
2  0  0 B  ,,,,   0   0 
3  0  1 C  ,1,2,,   0   0 
4  1  1 A 0,1,2,,   0   1 
5  0  1 A  ,,2,,   1   1

來源

2016-05-17 00:13:05 unutbu

令人難以置信。再次感謝。整齊的數據信息非常感謝。不幸的是，我從提供數據的組織處收到「不整潔」狀態的數據，但我肯定會將有關整齊數據的信息傳遞給它們。 –

我大致試過了，你可以改進下面的代碼。

sd['Contractor'] = 0 
sd['President'] = 0 

def check(x): 
    for p in pos['Position'].tolist(): 
     if x['Site'] in pos.set_index('Position').loc[p, 'Site(s)'].split(','): 
      ss = pd.Series(x['Station(s)'].split(',')).replace('', np.nan).dropna() 
      ps = pd.Series(pos.set_index('Position').loc[p, 'Station(s)'].split(',')).replace('', np.nan).dropna() 
      if not ss.empty and ss.isin(ps).all(): 
       x[p] = 1 

    return x 

print sd.apply(check, axis=1) 


    Item 1 Item 2 Site Station(s) Contractor President 
0  1  1 A  ,1,2,,   1   1 
1  1  0 B  ,1,2,,   1   0 
2  0  0 B  ,,,,   0   0 
3  0  1 C  ,1,2,,   0   0 
4  1  1 A 0,1,2,,   0   1 
5  0  1 A  ,,2,,   1   1

來源

2016-05-16 02:03:06 su79eu7k

爲了獲得更好的性能，'ps'可以的'check'提升外（由check'的'全局或默認參數）所以它只被計算一次，而不是每次調用'check'被調用。 – unutbu

大熊貓創建列是與標準

回答

相關問題