2016-11-09 160 views
2

生成一個文本文件,我有一個包含該表中的文本文件:熊貓閱讀從dataframe.to_string

    Ion TheoWavelength   Blended_Set 
Line_Label                                    
H1_4340A Hgamma_5_2  4340.471    None 
He1_4472A  HeI_4471  4471.479    None 
He2_4686A HeII_4686  4685.710    None 
Ar4_4711A  [ArIV]  4711.000    None 
Ar4_4740A  [ArIV]  4740.000    None 
H1_4861A  Hbeta_4_2  4862.683    None 

該表已經從熊貓數據框中使用dataframe.to_string然後保存unicode的變量生成。

我想用大熊貓函數來創建這個文件中的數據幀:

import pandas as pd 
df = pd.read_csv('my_table_file.txt', delim_whitespace = True, header = 0, index_col = 0) 

但是我得到這個錯誤

Traceback (most recent call last): 
    File 
    df = pd.read_csv(table, delim_whitespace = True, header = 0, index_col = 0) 
    File "/home/user/anaconda/python2/lib/python2.7/site-packages/pandas/io/parsers.py", line 562, in parser_f 
    return _read(filepath_or_buffer, kwds) 
    File "/home/user/anaconda/python2/lib/python2.7/site-packages/pandas/io/parsers.py", line 325, in _read 
    return parser.read() 
    File "/home/user/anaconda/python2/lib/python2.7/site-packages/pandas/io/parsers.py", line 815, in read 
    ret = self._engine.read(nrows) 
    File "/home/user/anaconda/python2/lib/python2.7/site-packages/pandas/io/parsers.py", line 1314, in read 
    data = self._reader.read(nrows) 
    File "pandas/parser.pyx", line 805, in pandas.parser.TextReader.read (pandas/parser.c:8748) 
    File "pandas/parser.pyx", line 827, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:9003) 
    File "pandas/parser.pyx", line 881, in pandas.parser.TextReader._read_rows (pandas/parser.c:9731) 
    File "pandas/parser.pyx", line 868, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:9602) 
    File "pandas/parser.pyx", line 1865, in pandas.parser.raise_parser_error (pandas/parser.c:23325) 
pandas.io.common.CParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 4 

我敢說,這是造成由於索引中的列名名稱在自己的行中。

無論如何避免這個問題或不包括此標籤導出表?

P.S.我試圖使用dataframe.to_csv表,但據我所知,它不允許你玩表格列格式,如果他們有不同的dtype

回答

1

我會在這種情況下使用HDF5格式 - 它會照顧您的索引。

除此之外,它的速度更快相比,CSV,您可以有條件地選擇數據(比如使用SQL數據庫),支持壓縮等

演示:

In [2]: df 
Out[2]: 
        Ion TheoWavelength Blended_Set 
Line_Label 
H1_4340A Hgamma_5_2  4340.471  None 
He1_4472A  HeI_4471  4471.479  None 
He2_4686A HeII_4686  4685.710  None 
Ar4_4711A  [ArIV]  4711.000  None 
Ar4_4740A  [ArIV]  4740.000  None 
H1_4861A  Hbeta_4_2  4862.683  None 

In [3]: df.to_hdf('d:/temp/myhdf.h5', 'df', format='t', data_columns=True) 

In [4]: x = pd.read_hdf('d:/temp/myhdf.h5', 'df') 

In [5]: x 
Out[5]: 
        Ion TheoWavelength Blended_Set 
Line_Label 
H1_4340A Hgamma_5_2  4340.471  None 
He1_4472A  HeI_4471  4471.479  None 
He2_4686A HeII_4686  4685.710  None 
Ar4_4711A  [ArIV]  4711.000  None 
Ar4_4740A  [ArIV]  4740.000  None 
H1_4861A  Hbeta_4_2  4862.683  None 

你甚至可以查詢您的HDF5文件,像SQL DB:

In [20]: x2 = pd.read_hdf('d:/temp/myhdf.h5', 'df', where="TheoWavelength > 4500 and Ion == '[ArIV]'") 

In [21]: x2 
Out[21]: 
       Ion TheoWavelength Blended_Set 
Line_Label 
Ar4_4711A [ArIV]   4711.0  None 
Ar4_4740A [ArIV]   4740.0  None 
+0

非常感謝您的回覆。這是非常有趣的SQL功能,它很好地工作...但是,對於這種情況下,它必須是一個文本文件。我設法使它工作,在「read_csv」中添加任何以「L」開頭的行(這不是此數據中的問題)中的註釋。我試圖使用ignore_rows,但它不起作用,如果你設置列索引...這很奇怪... – Delosari

0

考慮Python的內置StringIO,該io模塊的方法的Python 3(StringIO作爲Python 2中自己的模塊)從標量字符串中讀取文本。說它內大熊貓的read_table()然後操縱的字符串內容的第一線標題:

from io import StringIO 
import pandas as pd 

data = ''' 
        Ion TheoWavelength   Blended_Set 
Line_Label 
H1_4340A Hgamma_5_2  4340.471    None 
He1_4472A  HeI_4471  4471.479    None 
He2_4686A HeII_4686  4685.710    None 
Ar4_4711A  [ArIV]  4711.000    None 
Ar4_4740A  [ArIV]  4740.000    None 
H1_4861A  Hbeta_4_2  4862.683    None 
''' 

df = pd.read_table(StringIO(data), sep="\s+", header=None, skiprows=3, index_col=0) 

headers = [item for line in data.split('\n')[0:3] for item in line.split()][0:4] 
df.columns = headers[0:3] 
df.index.name = headers[3] 

如果你需要從文件中讀取,使用read_table從文件中讀取,然後讀取文本文件中提取頭:

df = pd.read_table("DataframeString.txt", sep="\s+", header=None, skiprows=3, index_col=0) 

data = [] 
with open("DataframeToString.txt", 'r') as f: 
    data.append(f.read().split()) 

df.index.name = data[0][3] 
df.columns = data[0][0:3] 

print(df) 
#     Ion TheoWavelength Blended_Set 
# Line_Label           
# H1_4340A Hgamma_5_2  4340.471  None 
# He1_4472A  HeI_4471  4471.479  None 
# He2_4686A HeII_4686  4685.710  None 
# vAr4_4711A  [ArIV]  4711.000  None 
# Ar4_4740A  [ArIV]  4740.000  None 
# H1_4861A  Hbeta_4_2  4862.683  None 
+0

非常感謝你的答覆,但一個問題:如果你有一個文本文件中的「數據」,你需要打開文件兩次(例如readlines),或者可以直接完成? – Delosari