2015-06-18 142 views
2

我想從我的熊貓數據框中刪除一些行df。它看起來像這樣,有180行和2745列。我想擺脫那些有curv_typPYC_RTYCIF_RT的行。我也想擺脫geo\time專欄。我從一個CSV文件中提取這些數據,並必須認識到,curv_typ,maturity,bonds,geo\time和下面的字符,例如PYC_RT,Y1,GBAAA,EA都在一列:在Python中分割數據幀列

curv_typ,maturity,bonds,geo\time 2015M06D16 2015M06D15 2015M06D11 \ 
0     PYC_RT,Y1,GBAAA,EA  -0.24  -0.24  -0.24 
1    PYC_RT,Y1,GBA_AAA,EA  -0.02  -0.03  -0.10 
2    PYC_RT,Y10,GBAAA,EA   0.94   0.92   0.99 
3    PYC_RT,Y10,GBA_AAA,EA   1.67   1.70   1.60 
4    PYC_RT,Y11,GBAAA,EA   1.03   1.01   1.09 

我決定嘗試拆分此列,然後下降所產生的各列,但我在代碼df_new = pd.DataFrame(df['curv_typ,maturity,bonds,geo\time'].str.split(',').tolist(), df[1:]).stack()

import os 
import urllib2 
import gzip 
import StringIO 
import pandas as pd 

baseURL = "http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=" 
filename = "data/irt_euryld_d.tsv.gz" 
outFilePath = filename.split('/')[1][:-3] 

response = urllib2.urlopen(baseURL + filename) 
compressedFile = StringIO.StringIO() 
compressedFile.write(response.read()) 

compressedFile.seek(0) 

decompressedFile = gzip.GzipFile(fileobj=compressedFile, mode='rb') 

with open(outFilePath, 'w') as outfile: 
    outfile.write(decompressedFile.read()) 

#Now have to deal with tsv file 
import csv 

with open(outFilePath,'rb') as tsvin, open('ECB.csv', 'wb') as csvout: 
    tsvin = csv.reader(tsvin, delimiter='\t') 
    writer = csv.writer(csvout) 
    for data in tsvin: 
     writer.writerow(data) 


csvout = 'C:\Users\Sidney\ECB.csv' 
#df = pd.DataFrame.from_csv(csvout) 
df = pd.read_csv('C:\Users\Sidney\ECB.csv', delimiter=',', encoding="utf-8-sig") 
print df 
df_new = pd.DataFrame(df['curv_typ,maturity,bonds,geo\time'].str.split(',').tolist(), df[1:]).stack() 

編輯的最後一行得到的錯誤KeyError: 'curv_typ,maturity,bonds,geo\time':從reptilicus的答案我用下面的代碼:

#Now have to deal with tsv file 
import csv 

outFilePath = filename.split('/')[1][:-3] #As in the code above, just put here for reference 
csvout = 'C:\Users\Sidney\ECB.tsv' 
outfile = open(csvout, "w") 
with open(outFilePath, "rb") as f: 
    for line in f.read(): 
     line.replace(",", "\t") 
     outfile.write(line) 
outfile.close() 

df = pd.DataFrame.from_csv("ECB.tsv", sep="\t", index_col=False) 

我仍然得到和以前一樣的確切輸出。

謝謝

+0

看起來像你需要以不同的方式讀取它。它看起來像curve_type,成熟度,債券,地理時間應該都有自己的專欄。試試DataFrame.from_csv()也 – reptilicus

+0

@reptilicus謝謝你。但是,當使用'df = pd.DataFrame.from_csv(csvout)'而不是'pd.read_csv'時,我得到相同的錯誤。我失去了如何處理這個問題。 – user131983

+0

哦,我認爲它在地理\時間也許,當你讀它時,可能會搞亂那一列 – reptilicus

回答

1

是CSV格式是可怕的,也有逗號和製表符分隔的數據在那裏。

會首先擺脫逗號:

tr ',' '\t' <irt_euryld_d.tsv> test.tsv 

如果您不能使用tr可以做它在Python:

outfile = open("outfile.tsv", "w") 
with open("irt_euryld_d.tsz", "rb") as f: 
    for line in f.read(): 
     line.replace(",", "\t") 
     outfile.write(line) 
outfile.close() 

然後可以很好地在大熊貓加載它:

In [9]: df = DataFrame.from_csv("test.tsv", sep="\t", index_col=False) 

In [10]: df 
Out[10]: 
    curv_typ maturity bonds geo\time 2015M06D17 2015M06D16 \ 
0  PYC_RT  Y1 GBAAA  EA  -0.23  -0.24 
1  PYC_RT  Y1 GBA_AAA  EA  -0.05  -0.02 
2  PYC_RT  Y10 GBAAA  EA   0.94   0.94 
3  PYC_RT  Y10 GBA_AAA  EA   1.66   1.67 
In [11]: df[df["curv_typ"] != "PYC_RT"] 
Out[11]: 
    curv_typ maturity bonds geo\time 2015M06D17 2015M06D16 \ 
60 YCIF_RT  Y1 GBAAA  EA  -0.22  -0.23 
61 YCIF_RT  Y1 GBA_AAA  EA   0.04   0.08 
62 YCIF_RT  Y10 GBAAA  EA   2.00   1.97 
+0

謝謝。但是有沒有辦法用腳本中的選項卡替換逗號,因爲我需要自動化整個過程?編輯 – user131983

+1

,只需在Python腳本中執行 – reptilicus

+0

謝謝。我使用了您提供的代碼,只對文件名進行了更改,但仍然以與以前完全相同的格式獲取輸出。我編輯了問題以顯示我使用的代碼。你知道這可能是爲什麼嗎? – user131983