大熊貓重新安排一個數據幀

我有一個數據幀如下：大熊貓重新安排一個數據幀

Honda [edit] 
Accord (4 models) 
Civic (4 models) 
Pilot (3 models) 
Toyota [edit] 
Prius (4 models) 
Highlander (3 models) 
Ford [edit] 
Explorer (2 models)

我期待重塑它這樣，我得到一個導致2列數據幀如下：

Honda  Accord 
Honda  Civic 
Honda  Pilot 
Toyota Prius 
Toyota Highlander

等。我試圖str.split試圖拆分編輯，但沒有成功。任何建議都非常感謝！ Python新手在這裏......非常抱歉，如果這已被解決之前。謝謝！

到目前爲止，我試過

 maker=car['T'].str.extract('(.*\[edit\])', expand=False).str.replace('\[edit\]',"")

這給了我莊家名單：本田，豐田和福特。然而，我堅持要找到一種方法來提取製造商之間的模型，以創建2列DF。

來源

2017-01-04 Karthik

你能告訴我們你已經試過什麼最後的列使用rename和對應表

另外一個襯墊改名？把代碼放在問題中。 – Lucas

訣竅是先提取車柱，然後讓製造商。

import pandas as pd 
import numpy as np 

df['model'] = df['T'].apply(lambda x: x.split(
    '(')[0].strip() if x.count('(') > 0 else np.NaN) 

df['maker'] = df['T'].apply(lambda x: x.split('[')[0].strip(
) if x.count('[') > 0 else np.NaN).fillna(method="ffill") 

df = df.dropna().drop('T', axis=1).reindex(
    columns=['maker', 'model']).reset_index(drop=True)

代碼的第一行用分裂和帶字符串操作如果條目包含'('提取所有的車，它分配NaN否則，我們使用NaN，使我們可以找到生產商後，刪除這些行。在這個階段，數據幀df將是：

+----+-----------------------+------------+ 
| | T      | model  | 
|----+-----------------------+------------| 
| 0 | Honda [edit]   | nan  | 
| 1 | Accord (4 models)  | Accord  | 
| 2 | Civic (4 models)  | Civic  | 
| 3 | Pilot (3 models)  | Pilot  | 
| 4 | Toyota [edit]   | nan  | 
| 5 | Prius (4 models)  | Prius  | 
| 6 | Highlander (3 models) | Highlander | 
| 7 | Ford [edit]   | nan  | 
| 8 | Explorer (2 models) | Explorer | 
+----+-----------------------+------------+

第二行則相同，但爲'['記錄，這裏的NaNs被用來填補了利用fillna 空壺細胞在這個階段，數據幀df將是：

+----+-----------------------+------------+---------+ 
| | T      | model  | maker | 
|----+-----------------------+------------+---------| 
| 0 | Honda [edit]   | nan  | Honda | 
| 1 | Accord (4 models)  | Accord  | Honda | 
| 2 | Civic (4 models)  | Civic  | Honda | 
| 3 | Pilot (3 models)  | Pilot  | Honda | 
| 4 | Toyota [edit]   | nan  | Toyota | 
| 5 | Prius (4 models)  | Prius  | Toyota | 
| 6 | Highlander (3 models) | Highlander | Toyota | 
| 7 | Ford [edit]   | nan  | Ford | 
| 8 | Explorer (2 models) | Explorer | Ford | 
+----+-----------------------+------------+---------+

第三行丟棄多餘的記錄和重新排列列以及重置索引

| | maker | model  | 
|----+---------+------------| 
| 0 | Honda | Accord  | 
| 1 | Honda | Civic  | 
| 2 | Honda | Pilot  | 
| 3 | Toyota | Prius  | 
| 4 | Toyota | Highlander | 
| 5 | Ford | Explorer |

編輯：

更「pandorable」版本（我喜歡一個襯墊）

df = df['T'].str.extractall('(.+)\[|(.+)\(').apply(
    lambda x: x.ffill() 
    if x.name==0 
    else x).dropna(subset=[1]).reset_index(
    drop=True).rename(columns={1:'Model',0:'Maker'})

上述作品如下 extractall會返回一個數據幀有兩列;列0對應於使用第一組'(.+)\['提取的正則表達式中的組，即製造商記錄以;和列1，對應於第二組，即'(.+)\('，apply被用於遍歷列，名爲0的列將被修改以通過ffill向前傳播'製造商'值並且列1將保持不變。 dropna然後與子集1一起使用以刪除列1中的值爲NaN,reset_index用於刪除多指數extractall生成的所有行。（FUNC））

df['T'].apply(lambda line: [line.split('[')[0],None] if line.count('[') 
         else [None,line.split('(')[0].strip()] 
      ).apply(pd.Series 
        ).rename(
          columns={0:'Maker',1:'Model'} 
         ).apply(
         lambda col: col.ffill() if col.name == 'Maker' 
         else col).dropna(
            subset=['Model'] 
            ).reset_index(drop=True)

來源

2017-01-04 07:43:46 sgDysregulation

我認爲你可以在'lambda'的'if'語句的'else'部分中使用'None'而不是'np.NaN'。我還沒有測試過，但 – sgDysregulation

完美的謝謝大家！兩者都像一個魅力:) – Karthik

您可以使用extract和ffill。然後，通過boolean indexing和掩碼通過str.contains由drop除去其中包含[edit]行，然後reset_index爲創造獨特index和最後刪除原始列col：

df['model'] = df.col.str.extract('(.*)\[edit\]', expand=False).ffill() 
df['type'] = df.col.str.extract('([A-Za-z]+)', expand=False) 
df = df[~df.col.str.contains('\[edit\]')].reset_index(drop=True).drop('col', axis=1) 
print (df) 
    model  type 
0 Honda  Accord 
1 Honda  Civic 
2 Honda  Pilot 
3 Toyota  Prius 
4 Toyota Highlander 
5 Ford  Explorer

另一種解決方案使用extract和where由條件和最後使用創建新列再次boolean indexing：

df['type'] = df.col.str.extract('([A-Za-z]+)', expand=False) 
df['model'] = df['type'].where(df.col.str.contains('\[edit\]')).ffill() 
df = df[df.type != df.model].reset_index(drop=True).drop('col', axis=1) 
print (df) 
     type model 
0  Accord Honda 
1  Civic Honda 
2  Pilot Honda 
3  Prius Toyota 
4 Highlander Toyota 
5 Explorer Ford

編輯：

如果需要type在文本spaces，使用replace從（到結束的所有值，也s\+刪除空格：

print (df) 
          col 
0     Honda [edit] 
1    Accord (4 models) 
2    Civic (4 models) 
3    Pilot (3 models) 
4     Toyota [edit] 
5    Prius (4 models) 
6   Highlander (3 models) 
7     Ford [edit] 
8 Ford Expedition XL (2 models) 

df['model'] = df.col.str.extract('(.*)\[edit\]', expand=False).ffill() 
df['type'] = df.col.str.replace(r'\s+\(.+$', '') 
df = df[~df.col.str.contains('\[edit\]')].reset_index(drop=True).drop('col', axis=1) 
print (df) 
    model    type 
0 Honda    Accord 
1 Honda    Civic 
2 Honda    Pilot 
3 Toyota    Prius 
4 Toyota   Highlander 
5 Ford Ford Expedition XL

來源

2017-01-04 06:23:11 jezrael

確切地說，col是列的名稱，我試着更多地解釋它，給我一秒鐘。 – jezrael

感謝您的信息。幾個問題：1）你能否在第三個陳述中多加說明一下？另外我認爲，col是指數據框df中的列名col，是否正確？最後，如果我們有一個類似福特Expedition XL的車型，我該如何解釋這個空間？謝謝！ – Karthik

對於屢次評論感到抱歉，由於某種原因，當我刷新時，它沒有顯示你的回覆:) – Karthik

大熊貓重新安排一個數據幀

回答

相關問題