2017-05-31 19 views
2

我想解析一個文本文件,將其轉換成熊貓數據框。 文件(包括空行):解析文本文件蟒蛇和隱藏到熊貓數據框

HEADING1 
value 1 

HEADING2 
value 2 

HEADING1, 
value 11 

HEADING2 
value 12 

應轉換成dataframe

HEADING1, HEADING2 
value 1, value 2 
value 11, value 12 

我曾嘗試下面的代碼。但是,我不確定使用converters可以工作嗎?

df = pd.read_table(textfile, header=None, skip_blank_lines=True, delimiter='\n', 
        # converters= 'what should I use?', 
        names= 'HEADING1, HEADING2'.split()) 
+0

你可以嘗試導入整個事情作爲一個系列,然後找到屬於行標題一作爲系列[系列==「header1」]。index.tolist()+ 1 – Adam

+0

@亞當我不知道我理解你的建議(原則上除外)。代碼是什麼樣的? – Andreuccio

回答

5

您解析上'\n\n'

# split file by `'\n\n'` to get rows 
# split again by `'\n'` to get columns 
# `zip` to get convenient lists of headers and values 
cols, vals = zip(
    *[line.split('\n') for line in open(textfile).read().split('\n\n')] 
) 

# construct a `pd.Series` 
# note: your index contained in the `cols` list will not be unique 
s = pd.Series(vals, cols) 

# we'll need to enumerate the duplicated index values so that we can unstack 
# we do this by creating a `pd.MultiIndex` with `cumcount` then the header values 
s.index = [s.groupby(level=0).cumcount(), s.index] 

# finally, `unstack` 
s.unstack() 

    HEADING1 HEADING2 
0 value 1 value 2 
1 value 11 value 12 

擊穿文自己和分裂

list理解

[line.split('\n') for line in StringIO(txt).read().split('\n\n')] 

[['HEADING1', 'value 1'], 
['HEADING2', 'value 2'], 
['HEADING1', 'value 11'], 
['HEADING2', 'value 12']] 

zip

list(zip(*[line.split('\n') for line in StringIO(txt).read().split('\n\n')])) 

[('HEADING1', 'HEADING2', 'HEADING1', 'HEADING2'), 
('value 1', 'value 2', 'value 11', 'value 12')] 

設置colsvals

cols, vals = zip(*[line.split('\n') for line in StringIO(txt).read().split('\n\n')]) 

print(cols) 
print() 
print(vals) 

('HEADING1', 'HEADING2', 'HEADING1', 'HEADING2') 

('value 1', 'value 2', 'value 11', 'value 12') 

製造一系列

s = pd.Series(vals, cols) 
s 

HEADING1  value 1 
HEADING2  value 2 
HEADING1 value 11 
HEADING2 value 12 
dtype: object 

枚舉索引值

s.index = [s.groupby(level=0).cumcount(), s.index] 
s 

0 HEADING1  value 1 
    HEADING2  value 2 
1 HEADING1 value 11 
    HEADING2 value 12 
dtype: object 

unstack

s.unstack() 

    HEADING1 HEADING2 
0 value 1 value 2 
1 value 11 value 12 

全部演示

import pandas as pd 
from io import StringIO 

txt = """HEADING1 
value 1 

HEADING2 
value 2 

HEADING1 
value 11 

HEADING2 
value 12""" 

cols, vals = zip(*[line.split('\n') for line in StringIO(txt).read().split('\n\n')]) 

s = pd.Series(vals, cols) 
s.index = [s.groupby(level=0).cumcount(), s.index] 

s.unstack() 

    HEADING1 HEADING2 
0 value 1 value 2 
1 value 11 value 12 
+1

這很聰明! – MaxU

0

使用defaultdict

from collections import defaultdict 
from io import StringIO 
import pandas as pd 

txt = """HEADING1 
value 1 

HEADING2 
value 2 

HEADING1 
value 11 

HEADING2 
value 12""" 

d = defaultdict(list) 
[ 
    d[k].append(v) 
    for k, v in [line.split('\n') 
    for line in StringIO(txt).read().split('\n\n')] 
]; 
pd.DataFrame(d) 

    HEADING1 HEADING2 
0 value 1 value 2 
1 value 11 value 12