熊貓閱讀從dataframe.to_string

生成一個文本文件，我有一個包含該表中的文本文件：熊貓閱讀從dataframe.to_string

    Ion TheoWavelength   Blended_Set 
Line_Label                                    
H1_4340A Hgamma_5_2  4340.471    None 
He1_4472A  HeI_4471  4471.479    None 
He2_4686A HeII_4686  4685.710    None 
Ar4_4711A  [ArIV]  4711.000    None 
Ar4_4740A  [ArIV]  4740.000    None 
H1_4861A  Hbeta_4_2  4862.683    None

該表已經從熊貓數據框中使用dataframe.to_string然後保存unicode的變量生成。

我想用大熊貓函數來創建這個文件中的數據幀：

import pandas as pd 
df = pd.read_csv('my_table_file.txt', delim_whitespace = True, header = 0, index_col = 0)

但是我得到這個錯誤

Traceback (most recent call last): 
    File 
    df = pd.read_csv(table, delim_whitespace = True, header = 0, index_col = 0) 
    File "/home/user/anaconda/python2/lib/python2.7/site-packages/pandas/io/parsers.py", line 562, in parser_f 
    return _read(filepath_or_buffer, kwds) 
    File "/home/user/anaconda/python2/lib/python2.7/site-packages/pandas/io/parsers.py", line 325, in _read 
    return parser.read() 
    File "/home/user/anaconda/python2/lib/python2.7/site-packages/pandas/io/parsers.py", line 815, in read 
    ret = self._engine.read(nrows) 
    File "/home/user/anaconda/python2/lib/python2.7/site-packages/pandas/io/parsers.py", line 1314, in read 
    data = self._reader.read(nrows) 
    File "pandas/parser.pyx", line 805, in pandas.parser.TextReader.read (pandas/parser.c:8748) 
    File "pandas/parser.pyx", line 827, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:9003) 
    File "pandas/parser.pyx", line 881, in pandas.parser.TextReader._read_rows (pandas/parser.c:9731) 
    File "pandas/parser.pyx", line 868, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:9602) 
    File "pandas/parser.pyx", line 1865, in pandas.parser.raise_parser_error (pandas/parser.c:23325) 
pandas.io.common.CParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 4

我敢說，這是造成由於索引中的列名名稱在自己的行中。

無論如何避免這個問題或不包括此標籤導出表？

P.S.我試圖使用dataframe.to_csv表，但據我所知，它不允許你玩表格列格式，如果他們有不同的dtype

來源

2016-11-09 Delosari

我會在這種情況下使用HDF5格式 - 它會照顧您的索引。

除此之外，它的速度更快相比，CSV，您可以有條件地選擇數據（比如使用SQL數據庫），支持壓縮等

演示：

In [2]: df 
Out[2]: 
        Ion TheoWavelength Blended_Set 
Line_Label 
H1_4340A Hgamma_5_2  4340.471  None 
He1_4472A  HeI_4471  4471.479  None 
He2_4686A HeII_4686  4685.710  None 
Ar4_4711A  [ArIV]  4711.000  None 
Ar4_4740A  [ArIV]  4740.000  None 
H1_4861A  Hbeta_4_2  4862.683  None 

In [3]: df.to_hdf('d:/temp/myhdf.h5', 'df', format='t', data_columns=True) 

In [4]: x = pd.read_hdf('d:/temp/myhdf.h5', 'df') 

In [5]: x 
Out[5]: 
        Ion TheoWavelength Blended_Set 
Line_Label 
H1_4340A Hgamma_5_2  4340.471  None 
He1_4472A  HeI_4471  4471.479  None 
He2_4686A HeII_4686  4685.710  None 
Ar4_4711A  [ArIV]  4711.000  None 
Ar4_4740A  [ArIV]  4740.000  None 
H1_4861A  Hbeta_4_2  4862.683  None

你甚至可以查詢您的HDF5文件，像SQL DB：

In [20]: x2 = pd.read_hdf('d:/temp/myhdf.h5', 'df', where="TheoWavelength > 4500 and Ion == '[ArIV]'") 

In [21]: x2 
Out[21]: 
       Ion TheoWavelength Blended_Set 
Line_Label 
Ar4_4711A [ArIV]   4711.0  None 
Ar4_4740A [ArIV]   4740.0  None

來源

2016-11-09 22:42:48 MaxU

非常感謝您的回覆。這是非常有趣的SQL功能，它很好地工作...但是，對於這種情況下，它必須是一個文本文件。我設法使它工作，在「read_csv」中添加任何以「L」開頭的行（這不是此數據中的問題）中的註釋。我試圖使用ignore_rows，但它不起作用，如果你設置列索引...這很奇怪... – Delosari

考慮Python的內置StringIO，該io模塊的方法的Python 3（StringIO作爲Python 2中自己的模塊）從標量字符串中讀取文本。說它內大熊貓的read_table()然後操縱的字符串內容的第一線標題：

from io import StringIO 
import pandas as pd 

data = ''' 
        Ion TheoWavelength   Blended_Set 
Line_Label 
H1_4340A Hgamma_5_2  4340.471    None 
He1_4472A  HeI_4471  4471.479    None 
He2_4686A HeII_4686  4685.710    None 
Ar4_4711A  [ArIV]  4711.000    None 
Ar4_4740A  [ArIV]  4740.000    None 
H1_4861A  Hbeta_4_2  4862.683    None 
''' 

df = pd.read_table(StringIO(data), sep="\s+", header=None, skiprows=3, index_col=0) 

headers = [item for line in data.split('\n')[0:3] for item in line.split()][0:4] 
df.columns = headers[0:3] 
df.index.name = headers[3]

如果你需要從文件中讀取，使用read_table從文件中讀取，然後讀取文本文件中提取頭：

df = pd.read_table("DataframeString.txt", sep="\s+", header=None, skiprows=3, index_col=0) 

data = [] 
with open("DataframeToString.txt", 'r') as f: 
    data.append(f.read().split()) 

df.index.name = data[0][3] 
df.columns = data[0][0:3] 

print(df) 
#     Ion TheoWavelength Blended_Set 
# Line_Label           
# H1_4340A Hgamma_5_2  4340.471  None 
# He1_4472A  HeI_4471  4471.479  None 
# He2_4686A HeII_4686  4685.710  None 
# vAr4_4711A  [ArIV]  4711.000  None 
# Ar4_4740A  [ArIV]  4740.000  None 
# H1_4861A  Hbeta_4_2  4862.683  None

來源

2016-11-10 01:46:15 Parfait

非常感謝你的答覆，但一個問題：如果你有一個文本文件中的「數據」，你需要打開文件兩次（例如readlines），或者可以直接完成？ – Delosari

熊貓閱讀從dataframe.to_string

回答

相關問題