2013-10-18 86 views
1

我想在熊貓中導入一個csv文件,但它會引發錯誤。當在記事本中打開++的數據格式與第一行是列名如下:熊貓read_csv文件導入錯誤

"End Customer Organization ID,End Customer Organization Name,End Customer Top Parent Organization ID,End Customer Top Parent Organization Name,Reseller Top Parent ID,Reseller Top Parent Name,Business,Rev Sum Division,Rev Sum Category,Product Family,Version,Pricing Level,Summary Pricing Level,Detail Pricing Level,MS Sales Amount,MS Sales Licenses,Fiscal Year,Sales Date" 
"11027676,Baroda Western Uttar Pradesh Gramin Bankgfhgfnjgfnmjmhgmghmghmghmnghnmghnmhgnmghnghngh,4078446,Bank Of Barodadfhhgfjyjtkyukujkyujkuhykluiluilui;iooi';po'fserwefvegwegf,1809012,""Hcl Infosystems Ltd - Partnerdghftrutyhb frhywer5y5tyu6ui7iukluyj,lgjmfgnhfrgweffw"",Server & CALsdgrgrfgtrhytrnhjdgthjtyjkukmhjmghmbhmgfngdfbndfhtgh,SQL Server & CALdfhtrhtrgbhrghrye5y45y45yu56juhydsgfaefwe,SQL CALdhdfthtrutrjurhjethfdehrerfgwerweqeadfawrqwerwegtrhyjuytjhyj,SQL CALdtrye45y3t434tjkabcjkasdhfhasdjkcbaksmjcbfuigkjasbcjkasbkdfhiwh,2005,Openfkvgjesropiguwe90fujklascnioawfy98eyfuiasdbcvjkxsbhg,Open Lklbjdfoigueroigbjvwioergyuiowerhgosdhvgfoisdhyguiserhguisrh,""Open Stddfm,vdnoghioerivnsdflierohgushdfovhsiodghuiohdbvgsjdhgouiwerho"",125.85,1,FY07,12/28/2006" 
"12835756,Uttam Strips Pvt Ltd,12835756,Uttam Strips Pvt Ltd,12565538,Redington C/O Fortis Financial Services Ltd,MBS,Dynamics ERP,Dynamics NAV,Dynamics NAV Business Essentials,Non-specific,Other,MBS SA,MBS New Customer Enhanc. Def,0,0,FY09,9/15/2008" 
"12233135,Bhagwan Singh Tondon,12233135,Bhagwan Singh Tondon,2652941,H B S Systems Pvt Ltd,Server & CAL,SQL Server & CAL,SQL CAL,SQL CAL,Non-specific,Open,Open L&SA,Deferred Open L&SA - New,0,0,FY09,9/15/2008" 
"11602305,Maya Academy Of Advanced Cinematics,9750934,Maya Entertainment Ltd,336146,Embee Software Pvt Ltd,Server & CAL,Windows Server & CAL,Windows Server HPC,Windows Compute Cluster Server,Non-specific,Open,Open V/MYO - Rec,OLV Perpet L&SA Recur-Def,0,0,FY09,9/25/2008" 
"13336009,Remiel Softech Solution Pvt Ltd,13336009,Remiel Softech Solution Pvt Ltd,13335482,Redington C/O Remiel Softech Solutions Pvt Ltd,MBS,Dynamics ERP,Dynamics NAV,Dynamics NAV Business Essentials,Non-specific,Other,MBS SA,MBS New Customer Enhanc. Def,0,0,FY09,12/23/2008" 
"7872800,Science Application International Corporation,2839760,GOVERNMENT OF KARNATAKA,10237455,Cubic Computing P.L,Server & CAL,SQL Server & CAL,SQL Server Standard,SQL Server Standard Edition,Non-specific,Open,Open SA/UA,Deferred Open SA - Renewal,0,0,FY09,1/15/2009" 
"13096361,Pratham Software Pvt Ltd,13096361,Pratham Software Pvt Ltd,10133086,Krap Computer,Information Worker,Office,Office Standard/Basic,Office Standard,2007,Open,Open L,Open Std,7132.44,28,FY09,9/24/2008" 
"12192276,Texmo Precision Castings,12192276,Texmo Precision Castings,4059430,Quadra Systems. - Partner,Server & CAL,Windows Server & CAL,Windows Standard Server,Windows Server Standard,Non-specific,Open,Open L&SA,Deferred Open L&SA - New,0,0,FY09,11/15/2008" 

請注意:當雙擊CSV格式點擊同一個文件在Excel中打開用逗號分隔值,但沒有如記事本++中所示,每行中都有引號。

我已經使用編碼爲UTF-8,其提供了以下錯誤:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x91 in position 13: invalid start byte 

然後使用編碼= 'CP1252',然後再試圖與latin1的。

df=pd.read_csv(filename,encoding='cp1252') 

or 

df=pd.read_csv(filename,encoding='latin1') 

同時與編碼它沒有給出任何錯誤,得到的數據導入但作爲一個單獨的列而不是不同的列。

它與數據中每行之前存在的「」標記有關嗎?我有一個類似的逗號分隔值的csv文件,但是在每行中都沒有雙引號,並且使用cp1252和latin1都可以正確導入。但即使該文件在記事本++中以utf8格式保存,也不適用於UTF-8。但在這種情況下,utf8不像往常一樣工作,其他兩個將它作爲單列導入。

請指教。

感謝

+1

爲什麼每行都用引號括起來?只要刪除這些,導入應該可以正常工作。 – sloth

+0

@DominicKexel是的,就是這樣。在記事本++中打開文件時,我發現它的每行都有引號。也許這可能會阻止編碼。所以我的問題是我)有沒有一種編碼可以照顧報價。 ii)如果不是,那麼如何刪除每行中的引號? 謝謝 – Baktaawar

+0

@DominicKexel嗨,我希望如果你能幫助這個,因爲我已經嘗試了幾乎所有的選項,我可以在read_csv函數中找到熊貓,但它不能解決上面的引用問題,或者作爲單獨的列讀取,如果編碼='latin1'或編碼='cp1252'。請幫忙! – Baktaawar

回答

0

我敢肯定引號導致其解釋爲逃脫內的所有逗號。所以,你需要把它們全部去掉。這樣做相對簡單,但由於unicode問題,我會變得瘋狂並建議您閱讀它,去掉引號,然後將其寫入文件以與read_csv一起使用(因爲它將簡化編碼問題) 。

下面是如何寫入一個文件,並剝去引號,寫一個新的文件,然後用read_csv閱讀:

with open(filename) as infile, open(tmpfile, 'wb') as outfile: 
    for line in infile: 
     outfile.write(line.strip('"')) 

result = pd.read_csv(tmpfile, encoding='cp1252') 

你想要刪除臨時文件會在讀完之後它也在。

之所以我建議像上面這樣做,是因爲您在傳遞到StringIO緩衝區時避免處理編碼/解碼 - 對於Python和熊貓都可以挑剔。