2013-10-23 57 views
5

我正試圖將一些處理工作從R移到Python。在R中,我使用read.table()讀取真正凌亂的CSV文件,並自動以正確的格式分割記錄。例如。R在Python中的read.table等效項

391788,"HP Deskjet 3050 scanner always seems to break","<p>I'm running a Windows 7 64 blah blah blah........ake this work permanently?</p> 

<p>Update: It might have something to do with my computer. It seems to work much better on another computer, windows 7 laptop. Not sure exactly what the deal is, but I'm still looking into it...</p> 
","windows-7 printer hp" 

被正確地分成4列。 1條記錄可以分成許多行,並且在所有地方都有逗號。在R我只是這樣做:

read.table(infile, header = FALSE, nrows=chunksize, sep=",", stringsAsFactors=FALSE) 

在Python中有什麼可以做到這一點同樣好嗎?

謝謝!

回答

3

您可以使用csv模塊。

from csv import reader 
csv_reader = reader(open("C:/text.txt","r"), quotechar="\"") 

for row in csv_reader: 
    print row 

['391788', 'HP Deskjet 3050 scanner always seems to break', "<p>I'm running a Windows 7 64 blah blah blah........ake this work permanently?</p>\n\n<p>Update: It might have something to do with my computer. It seems to work much better on another computer, windows 7 laptop. Not sure exactly what the deal is, but I'm still looking into it...</p>\n", 'windows-7 printer hp'] 

長度輸出= 4

+0

但這只是返回字符串。它不會像read.table那樣推斷每一列的類型。 –

2

pandas模塊還提供了許多R-樣函數和數據結構,包括read_csv。這裏的優點是數據將作爲熊貓DataFrame讀入,比標準的Python列表或字典更容易操作(尤其是如果您習慣於R)。這裏是一個例子:

>>> from pandas import read_csv 
>>> ugly = read_csv("ugly.csv",header=None) 
>>> ugly 
     0            1 \ 
0 391788 HP Deskjet 3050 scanner always seems to break 

                2      3 
0 <p>I'm running a Windows 7 64 blah blah blah..... windows-7 printer hp