需要提高Python和熊貓數據操作的效率

我有三個數據，需要根據給出的指令進行合併。需要提高Python和熊貓數據操作的效率

第一個數據是'Energy Indicators.xls'，它是來自聯合國2013年的indicators of energy supply and renewable electricity production列表，應該放入一個名爲'energy'的DataFrame中。

在放入DataFrame之前，必須從數據文件中排除頁腳和標題信息以及前兩列，因爲它們是不必要的。

列標籤的其餘部分應該被改變，如：

['Country', 'Energy Supply', 'Energy Supply per Capita', '% Renewable']

缺失數據應被反映爲np.NaN值。

下列國家的名單必須被重命名：

「大韓民國」：「韓國」，

「美利堅合衆國」：「美國」，

「英國大不列顛及北愛爾蘭「：」英國「，

」中國香港特別行政區「：」香港「。

也有幾個國家的名字中有數字和/或括號。他們也需要被刪除。

這部分是如下完成的：

import pandas as pd 
import numpy as np 

energy = pd.read_excel('Energy Indicators.xls',skiprows=17,skip_footer=38 
       ,parse_cols =[2,3,4,5]) 
energy.columns = ['Country', 'Energy Supply', 'Energy Supply per Capita' 
         ,'% Renewable'] 
energy.set_index('Country',inplace=True) 
energy.replace('...', np.nan,inplace=True) 
energy.set_index(energy.index.str.replace('\s*\(.*?\)\s*','') 
          .str.replace('\d+',''),inplace=True) 

energy.rename(index={"Republic of Korea": "South Korea", 
      "United States of America": "United States", 
      "United Kingdom of Great Britain and Northern Ireland": "United Kingdom", 
      "China, Hong Kong Special Administrative Region": "Hong Kong"} 
      ,inplace=True)

下一頁數據是從該文件「world_bank.csv」，這是從World Bank含有國家從1960年到GDP 2015年一個csv GDP數據。

頭必須被跳過，並重新命名下列國家的名單必須作如下更名爲：「韓國，衆議員」

：「韓國」，

「伊朗伊斯蘭共和國」：「伊朗「，

」中國香港特別行政區「：」香港「。

該部分的代碼在下面提供。

GDP=pd.read_csv('world_bank.csv',skiprows=4) 
GDP.replace({'Country Name': {'Korea, Rep.': 'South Korea', 
       'Iran, Islamic Rep.': 'Iran', 
       'Hong Kong SAR, China': 'Hong Kong'}},inplace=True) 
GDP.set_index('Country Name',inplace=True) 
GDP.rename(index={'Country Name':'Country'},inplace=True)

最後的數據是'scimagojr-3.xlsx'，它根據他們的期刊貢獻對各國進行排名。沒有額外的工作，爲他們操縱和代碼如下寫：

ScimEn=pd.read_excel('scimagojr-3.xlsx') 
ScimEn.set_index('Country',inplace=True)

僅使用了近10年來GDP數據（2006- 2015年），只有加入使用國名的交叉點處的三個數據集Scimagojr'Rank'排名前15位的國家（排名1至15）。

該數據幀的指數應該是國家的名字，列應該是：

['Rank', 'Documents', 'Citable documents', 'Citations', 'Self-citations', 'Citations per document', 'H index', 'Energy Supply', 'Energy Supply per Capita', '% Renewable', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015']。

這部分做如下：

df=pd.merge(ScimEn.iloc[0:15], 
    pd.merge(energy,GDP[['2006', '2007', '2008', '2009', '2010', '2011' 
    ,'2012','2013','2014','2015']] 
    ,left_index=True, right_index=True),left_index=True 
    ,right_index=True)

因此，值得關注的是，雖然它的作品，我需要找到未來的大數據集的一個更有效的方法。有什麼辦法可以做到嗎？

謝謝。

來源

2017-10-08 Gökhan Kesler

這是一個良好的書面問題，但太長了！跳到最後，看起來你只是問如何更有效地進行3-way合併，有可能也可能不是更好的方式 - 有時大數據的合併很慢，並且你可以做的不多。但是如果你想有一個很好的機會來接收有用的答案，你需要大大減少這個問題，把重點放在問題的核心上（這裏只是一個三方合併，我可以用一眼就能看到） – JohnE

這裏是你如何做一個三路的一行代碼合併：

df1 = data1.set_index('country') 
df2 = data2.set_index('country') 
df3 = data3.set_index('country') 

new_df = pd.concat([df1, df2, df3], axis=1)

來源

2017-10-09 03:38:56 thecheech

需要提高Python和熊貓數據操作的效率

回答

相關問題