2015-12-18 36 views
4

我有〜15000行,看起來像這樣read_csv缺失/不完整的標題或不規則的列數

SAMPLE_TIME,   POS,  OFF, HISTOGRAM 
2015-07-15 16:41:56, 0-0-0-0-3, 1, 2,0,5,59,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0, 
2015-07-15 16:42:55, 0-0-0-0-3, 1, 0,0,5,9,0,0,0,0,0,2,0,0,0,50,0, 
2015-07-15 16:43:55, 0-0-0-0-3, 1, 0,0,5,5,0,0,0,0,0,2,0,0,0,0,4,0,0,0, 
2015-07-15 16:44:56, 0-0-0-0-3, 1, 2,0,5,0,0,0,0,0,0,2,0,0,0,6,0,0,0,0 

我想它導入到pandas.DataFrame與提供給列的任何隨機值file.csv那沒有一個頭,這樣的事情:

SAMPLE_TIME,   POS,  OFF, HISTOGRAM 1 2 3 4 5 6 
2015-07-15 16:41:56, 0-0-0-0-3, 1, 2,   0, 5, 59, 4, 0, 0, 
2015-07-15 16:42:55, 0-0-0-0-3, 1, 0,   0, 5, 0, 6, 0, nan 
2015-07-15 16:43:55, 0-0-0-0-3, 1, 0,   0, 5, 0, 7, nan nan 
2015-07-15 16:44:56, 0-0-0-0-3, 1, 2,   0, 5, 0, 0, 2, nan 

這已經不可能進口,因爲我嘗試了不同的解決方案,如爲specific a header,但仍然沒有喜悅,只有這樣,我才得以使它的工作是添加一個頭文件lly在.csv文件中。這有點擊敗了自動化的目的!


然後我試圖this solution: 這樣做

lines=list(csv.reader(open('file.csv')))  
header, values = lines[0], lines[1:] 

它正確地讀給我15000元values的〜列表中的文件,每一個元素都是字符串,其中每個字符串是正確的列表從文件解析的數據字段,但是當我嘗試這樣做:

data = {h:v for h,v in zip (header, zip(*values))} 
df = pd.DataFrame.from_dict(data) 

或本:

data2 = {h:v for h,v in zip (str(xrange(16)), zip(*values))} 
df2 = pd.DataFrame.from_dict(data) 

則非帶標題列消失和列的順序是完全混合。任何可能的解決方案的想法?

回答

4

可以基於第一實際行的長度創建列:

from tempfile import TemporaryFile 
with open("out.txt") as f, TemporaryFile("w+") as t: 
    h, ln = next(f), len(next(f).split(",")) 
    header = h.strip().split(",") 
    f.seek(0), next(f) 
    header += range(ln) 
    print(pd.read_csv(f, names=header)) 

,這將給你:

  SAMPLE_TIME   POS   OFF HISTOGRAM 0 1 2 3 \ 
0 2015-07-15 16:41:56  0-0-0-0-3   1   2 0 5 59 0 
1 2015-07-15 16:42:55  0-0-0-0-3   1   0 0 5 9 0 
2 2015-07-15 16:43:55  0-0-0-0-3   1   0 0 5 5 0 
3 2015-07-15 16:44:56  0-0-0-0-3   1   2 0 5 0 0 

    4 5 ... 13 14 15 16 17 18 19 20 21 22 
0 0 0 ... 0 0 0 0 0 NaN NaN NaN NaN NaN 
1 0 0 ... 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 
2 0 0 ... 4 0 0 0 NaN NaN NaN NaN NaN NaN 
3 0 0 ... 0 0 0 0 NaN NaN NaN NaN NaN NaN 

[4 rows x 27 columns] 

或者你可以清理文件傳遞到熊貓之前:

import pandas as pd 

from tempfile import TemporaryFile 
with open("in.csv") as f, TemporaryFile("w+") as t: 
    for line in f: 
     t.write(line.replace(" ", "")) 
    t.seek(0) 
    ln = len(line.strip().split(",")) 
    header = t.readline().strip().split(",") 
    header += range(ln) 
    print(pd.read_csv(t,names=header)) 

它給你:

  SAMPLE_TIME  POS OFF HISTOGRAM 0 1 2 3 4 5 ... 11 \ 
0 2015-07-1516:41:56 0-0-0-0-3 1   2 0 5 59 0 0 0 ... 0 
1 2015-07-1516:42:55 0-0-0-0-3 1   0 0 5 9 0 0 0 ... 0 
2 2015-07-1516:43:55 0-0-0-0-3 1   0 0 5 5 0 0 0 ... 0 
3 2015-07-1516:44:56 0-0-0-0-3 1   2 0 5 0 0 0 0 ... 0 

    12 13 14 15 16 17 18 19 20 
0 0 0 0 0 0 0 NaN NaN NaN 
1 50 0 NaN NaN NaN NaN NaN NaN NaN 
2 0 4 0 0 0 NaN NaN NaN NaN 
3 6 0 0 0 0 NaN NaN NaN NaN 

[4 rows x 25 columns] 

或刪除列將所有娜娜:

print(pd.read_csv(f, names=header).dropna(axis=1,how="all")) 

爲您提供:

  SAMPLE_TIME   POS   OFF HISTOGRAM 0 1 2 3 \ 
0 2015-07-15 16:41:56  0-0-0-0-3   1   2 0 5 59 0 
1 2015-07-15 16:42:55  0-0-0-0-3   1   0 0 5 9 0 
2 2015-07-15 16:43:55  0-0-0-0-3   1   0 0 5 5 0 
3 2015-07-15 16:44:56  0-0-0-0-3   1   2 0 5 0 0 

    4 5 ... 8 9 10 11 12 13 14 15 16 17 
0 0 0 ... 2 0 0 0 0 0 0 0 0 0 
1 0 0 ... 2 0 0 0 50 0 NaN NaN NaN NaN 
2 0 0 ... 2 0 0 0 0 4 0 0 0 NaN 
3 0 0 ... 2 0 0 0 6 0 0 0 0 NaN 

[4 rows x 22 columns] 
-2

假設您的數據位於名爲foo.csv的文件中,您可以執行以下操作。這是測試對熊貓0.17

df = pd.read_csv('foo.csv', names=['sample_time', 'pos', 'off', 'histogram', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17'], skiprows=1) 
-1

那麼這怎麼樣。我從你的樣本數據中做出了一個csv。

當我輸入線:

with open('test.csv','rb') as f: 
    lines = list(csv.reader(f)) 
headers, values =lines[0],lines[1:] 

產生很好的頭名,使用該行:

headers = [i or ind for ind, i in enumerate(headers)] 

如此,因爲如何(我假設)CSV作品,標題應該有一堆空字符串值。空字符串評估爲False,所以這個理解返回沒有標題的每列的編號列。

然後,只需做一個DF:

df = pd.DataFrame(values,columns=headers) 

它看起來像:

11:   SAMPLE_TIME   POS   OFF HISTOGRAM 4 5 6 7 8 9 \ 
0 15/07/2015 16:41  0-0-0-0-3   1   2 0 5 59 0 0 0 
1 15/07/2015 16:42  0-0-0-0-3   1   0 0 5 9 0 0 0 
2 15/07/2015 16:43  0-0-0-0-3   1   0 0 5 5 0 0 0 
3 15/07/2015 16:44  0-0-0-0-3   1   2 0 5 0 0 0 0 

    ... 12 13 14 15 16 17 18 19 20 21 
0 ... 2 0 0 0 0 0 0 0 0 0 
1 ... 2 0 0 0 50 0    
2 ... 2 0 0 0 0 4 0 0 0  
3 ... 2 0 0 0 6 0 0 0 0  

[4 rows x 22 columns] 
+0

Python的2.7.10,蟒蛇2.1.0 Windows 7上的64位。Pandas 0.17.1,csv.1.0。我不明白你的不信。 https://gist.github.com/gregroberts/a6e6040c045ea9130fee –

+0

所以輸入在一個單元格中具有所有這些值。我看到了我的錯誤。 –

+0

是的,第一個例子是有很多問題的輸入 –

3

您可以分割列HISTOGRAMDataFrameconcat原始。

print df 
     SAMPLE_TIME,  POS, OFF, \ 
0 2015-07-15 16:41:56 0-0-0-0-3, 1, 
1 2015-07-15 16:42:55 0-0-0-0-3, 1, 
2 2015-07-15 16:43:55 0-0-0-0-3, 1, 
3 2015-07-15 16:44:56 0-0-0-0-3, 1, 

           HISTOGRAM 
0 2,0,5,59,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0, 
1   0,0,5,9,0,0,0,0,0,2,0,0,0,50,0, 
2  0,0,5,5,0,0,0,0,0,2,0,0,0,0,4,0,0,0, 
3  2,0,5,0,0,0,0,0,0,2,0,0,0,6,0,0,0,0 
#create new dataframe from column HISTOGRAM 
h = pd.DataFrame([ x.split(',') for x in df['HISTOGRAM'].tolist()]) 
print h 
    0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 
0 2 0 5 59 0 0 0 0 0 2 0 0 0 0 0 0  0  0  0  
1 0 0 5 9 0 0 0 0 0 2 0 0 0 50 0  None None None None 
2 0 0 5 5 0 0 0 0 0 2 0 0 0 0 4 0  0  0  None 
3 2 0 5 0 0 0 0 0 0 2 0 0 0 6 0 0  0  0 None None 

#append to original, rename 0 column 
df = pd.concat([df, h], axis=1).rename(columns={0:'HISTOGRAM'}) 
print df 
           HISTOGRAM HISTOGRAM 1 2 3 4 5 ... 10 \ 
0 2,0,5,59,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,   2 0 5 59 0 0 ... 0 
1   0,0,5,9,0,0,0,0,0,2,0,0,0,50,0,   0 0 5 9 0 0 ... 0 
2  0,0,5,5,0,0,0,0,0,2,0,0,0,0,4,0,0,0,   0 0 5 5 0 0 ... 0 
3  2,0,5,0,0,0,0,0,0,2,0,0,0,6,0,0,0,0   2 0 5 0 0 0 ... 0 

    11 12 13 14 15 16 17 18 19 
0 0 0 0 0 0  0  0  0   
1 0 0 50 0  None None None None 
2 0 0 0 4 0  0  0  None 
3 0 0 6 0 0  0  0 None None 

[4 rows x 24 columns]