2017-05-05 16 views
3

我已經在文本文件的形式的下列數據,我想加載到蟒:讀取在python文本文件與可變間距

 pclass survived            name 
0   1   1      Allen, Miss. Elisabeth Walton 
1   1   1      Allison, Master. Hudson Trevor 
2   1   0      Allison, Miss. Helen Loraine 
3   1   0    Allison, Mr. Hudson Joshua Creighton 
4   1   0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) 
5   1   1        Anderson, Mr. Harry 
6   1   1     Andrews, Miss. Kornelia Theodosia 
7   1   0        Andrews, Mr. Thomas Jr 
8   1   1  Appleton, Mrs. Edward Dale (Charlotte Lamson) 
9   1   0       Artagaveytia, Mr. Ramon 
10   1   0        Astor, Col. John Jacob 

由於白色空間不是一個常數,並且還因爲最後一個字段(名稱)之間有一個空白區域,所以我無法解析它。我試過如下:

pd.read_csv("test.csv",sep = "\s+", header=0, index_col=0) 

但它給出了一個錯誤:

CParserError: Error tokenizing data. C error: Expected 7 fields in line 5, saw 8 

回答

2

您可以使用pandas.read_fwf(又名:固定寬度的格式)要做到這一點:

代碼:

df = pd.read_fwf(StringIO(data), header=1, index_col=0) 

測試代碼:

from io import StringIO 
import pandas as pd 

data = u""" 
     pclass survived            name 
0   1   1      Allen, Miss. Elisabeth Walton 
1   1   1      Allison, Master. Hudson Trevor 
2   1   0      Allison, Miss. Helen Loraine 
3   1   0    Allison, Mr. Hudson Joshua Creighton 
4   1   0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) 
5   1   1        Anderson, Mr. Harry 
6   1   1     Andrews, Miss. Kornelia Theodosia 
7   1   0        Andrews, Mr. Thomas Jr 
8   1   1  Appleton, Mrs. Edward Dale (Charlotte Lamson) 
9   1   0       Artagaveytia, Mr. Ramon 
10   1   0        Astor, Col. John Jacob""" 

df = pd.read_fwf(StringIO(data), header=1, index_col=0) 
print(df) 

結果:

pclass survived            name 
0  1   1     Allen, Miss. Elisabeth Walton 
1  1   1     Allison, Master. Hudson Trevor 
2  1   0      Allison, Miss. Helen Loraine 
3  1   0    Allison, Mr. Hudson Joshua Creighton 
4  1   0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) 
5  1   1        Anderson, Mr. Harry 
6  1   1    Andrews, Miss. Kornelia Theodosia 
7  1   0       Andrews, Mr. Thomas Jr 
8  1   1 Appleton, Mrs. Edward Dale (Charlotte Lamson) 
9  1   0       Artagaveytia, Mr. Ramon 
10  1   0       Astor, Col. John Jacob 
3

'\s+'假定一個或多個空格仍然解析你的最後一列。而是使用假定兩個或更多的正則表達式。

pd.read_csv("test.csv", sep="\s{2,}", header=0, index_col=0, engine='python') 

整個工作例

from io import StringIO 
import pandas as pd 

txt = """  pclass survived            name 
0   1   1      Allen, Miss. Elisabeth Walton 
1   1   1      Allison, Master. Hudson Trevor 
2   1   0      Allison, Miss. Helen Loraine 
3   1   0    Allison, Mr. Hudson Joshua Creighton 
4   1   0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) 
5   1   1        Anderson, Mr. Harry 
6   1   1     Andrews, Miss. Kornelia Theodosia 
7   1   0        Andrews, Mr. Thomas Jr 
8   1   1  Appleton, Mrs. Edward Dale (Charlotte Lamson) 
9   1   0       Artagaveytia, Mr. Ramon 
10   1   0        Astor, Col. John Jacob 
""" 

pd.read_csv(StringIO(txt), sep="\s{2,}", header=0, index_col=0, engine='python') 

    pclass survived            name 
0  1   1     Allen, Miss. Elisabeth Walton 
1  1   1     Allison, Master. Hudson Trevor 
2  1   0      Allison, Miss. Helen Loraine 
3  1   0    Allison, Mr. Hudson Joshua Creighton 
4  1   0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) 
5  1   1        Anderson, Mr. Harry 
6  1   1    Andrews, Miss. Kornelia Theodosia 
7  1   0       Andrews, Mr. Thomas Jr 
8  1   1 Appleton, Mrs. Edward Dale (Charlotte Lamson) 
9  1   0       Artagaveytia, Mr. Ramon 
10  1   0       Astor, Col. John Jacob