2017-08-14 70 views
0

我有一系列非常雜亂的* .csv文件正被熊貓讀入。一個例子CSV是:忽略pandas.read_csv()中破壞標頭的關鍵數據壞行數據=關鍵字

Instrument 35392 
"Log File Name : station" 
"Setup Date (MMDDYY) : 031114" 
"Setup Time (HHMMSS) : 073648" 
"Starting Date (MMDDYY) : 031114" 
"Starting Time (HHMMSS) : 090000" 
"Stopping Date (MMDDYY) : 031115" 
"Stopping Time (HHMMSS) : 235959" 
"Interval (HHMMSS) : 010000" 
"Sensor warmup (HHMMSS) : 000200" 
"Circltr warmup (HHMMSS) : 000200" 


"Date","Time","","Temp","","SpCond","","Sal","","IBatt","" 
"MMDDYY","HHMMSS","","øC","","mS/cm","","ppt","","Volts","" 

"Random message here 031114 073721 to 031114 083200" 
03/11/14,09:00:00,"",15.85,"",1.408,"",.74,"",6.2,"" 
03/11/14,10:00:00,"",15.99,"",1.96,"",1.05,"",6.3,"" 
03/11/14,11:00:00,"",14.2,"",40.8,"",26.12,"",6.2,"" 
03/11/14,12:00:01,"",14.2,"",41.7,"",26.77,"",6.2,"" 
03/11/14,13:00:00,"",14.5,"",41.3,"",26.52,"",6.2,"" 
03/11/14,14:00:00,"",14.96,"",41,"",26.29,"",6.2,"" 
"message 3" 
"message 4"** 

我一直在使用這個代碼導入* csv文件,工藝在雙頭,拉出空列,然後用壞數據條塊違規行:

DF = pd.read_csv(BADFILE,parse_dates={'Datetime_(ascii)': [0,1]}, sep=",", \ 
      header=[10,11],na_values=['','na', 'nan nan'], \ 
      skiprows=[10], encoding='cp1252') 

DF = DF.dropna(how="all", axis=1) 
DF = DF.dropna(thresh=2) 
droplist = ['message', 'Random'] 
DF = DF[~DF['Datetime_(ascii)'].str.contains('|'.join(droplist))] 

DF.head() 

Datetime_(ascii) (Temp, øC) (SpCond, mS/cm) (Sal, ppt) (IBatt, Volts) 
0 03/11/14 09:00:00 15.85 1.408 0.74 6.2 
1 03/11/14 10:00:00 15.99 1.960 1.05 6.3 
2 03/11/14 11:00:00 14.20 40.800 26.12 6.2 
3 03/11/14 12:00:01 14.20 41.700 26.77 6.2 
4 03/11/14 13:00:00 14.50 41.300 26.52 6.2 

這是工作的罰款和花花公子,直到我有了標題後erronious 1排線文件:「隨機的信息在這裏031114 073721到031114 083200」

我receieve的錯誤是:

*C:\Users\USER\AppData\Local\Continuum\Anaconda3\lib\site- 
    packages\pandas\io\parsers.py in _do_date_conversions(self, names, data) 
    1554    data, names = _process_date_conversion(
    1555     data, self._date_conv, self.parse_dates, self.index_col, 
    -> 1556     self.index_names, names, 
    keep_date_col=self.keep_date_col) 
    1557 
    1558   return names, data 
    C:\Users\USER\AppData\Local\Continuum\Anaconda3\lib\site- 
    packages\pandas\io\parsers.py in _process_date_conversion(data_dict, 
    converter, parse_spec, index_col, index_names, columns, keep_date_col) 
    2975  if not keep_date_col: 
    2976   for c in list(date_cols): 
    -> 2977    data_dict.pop(c) 
    2978    new_cols.remove(c) 
    2979 
    KeyError: ('Time', 'HHMMSS')* 

如果我刪除該行,代碼工作正常。同樣,如果我刪除標題=行代碼工作正常。但是,我希望能夠保留這一點,因爲我正在閱讀上百個這樣的文件。

難度:我傾向於在打電話給pandas.read_csv()之前不打開每個文件,因爲這些文件可能相當大 - 因此我不想多次讀取和保存!另外,我更喜歡真正的熊貓/ pythonic解決方案,它不涉及首先打開文件作爲一個stringIO緩衝區來刪除違規行。

+0

你能後的錯誤路線?在出現錯誤的每種情況下,它會出現在同一種錯誤行中,還是在某些文件的其他行上可能存在其他類型的問題? –

+0

創建錯誤的錯誤行是: 「隨機消息在這裏031114 073721到031114 083200」 此行可能存在或可能不存在於所有文件中。因此,我不能只增加skiprows = index。 此外,如果我改變該行的實際文本,錯誤仍然存​​在 - 文本是什麼並不重要,但它是一行後面只有1列的標題。 –

回答

1

下面是一種方法,利用skip_rows接受可調用函數這一事實。該函數僅接收正在考慮的行索引,這是該參數的內置限制。因此,可調用函數skip_test()首先檢查當前索引是否在跳過的已知索引集合中。如果不是,那麼它打開實際的文件並檢查相應的行以查看它的內容是否匹配。

skip_test()函數在檢查實際文件的意義上有一點不好意思,儘管它只檢查直到它正在評估的當前行索引。它還假設壞行總是以相同的字符串開頭(在示例中爲"foo"),但是對於OP,這似乎是一個安全的假設。

# example data 
""" foo.csv 
uid,a,b,c 
0,1,2,3 
skip me 
1,11,22,33 
foo 
2,111,222,333 
""" 

import pandas as pd 

def skip_test(r, fn, fail_on, known): 
    if r in known: # we know we always want to skip these 
     return True 
    # check if row index matches problem line in file 
    # for efficiency, quit after we pass row index in file 
    f = open(fn, "r") 
    data = f.read() 
    for i, line in enumerate(data.splitlines()): 
     if (i == r) & line.startswith(fail_on): 
      return True 
     elif i > r: 
      break 
    return False 

fname = "foo.csv" 
fail_str = "foo" 
known_skip = [2] 
pd.read_csv(fname, sep=",", header=0, 
      skiprows=lambda x: skip_test(x, fname, fail_str, known_skip)) 
# output 
    uid a b c 
0 0 1 2 3 
1 1 11 22 33 
2 2 111 222 333 

如果你到底是哪行的隨機消息會出現,當它出現在知道了,那麼這將是更快,因爲你可以告訴它不檢查文件內容對任何指數以前的潛在違規線。

+0

謝謝!是的,我知道通過我的文件會出現什麼消息,所以我可以解析它們。 –

+0

不客氣! –

0

昨天經過一番修補之後,我發現了一個解決方案以及潛在的問題。

我試過skip_test()功能答案上面,但我仍然得到錯誤與表的大小:

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader.read (pandas\_libs\parsers.c:10862)() 

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory (pandas\_libs\parsers.c:11138)() 

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_rows (pandas\_libs\parsers.c:11884)() 

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows (pandas\_libs\parsers.c:11755)() 

pandas\_libs\parsers.pyx in pandas._libs.parsers.raise_parser_error (pandas\_libs\parsers.c:28765)() 

ParserError: Error tokenizing data. C error: Expected 1 fields in line 14, saw 11 

因此,與skiprows玩弄後=我發現我只是沒有得到我想要的行爲,當使用引擎='c'read_csv()仍然從前幾行確定文件的大小,其中一些單列行仍在傳遞。這可能是因爲我沒有計劃在我的csv集中存在更多的單列不良行。

相反,我創建了一個任意大小的DataFrame作爲模板。我拉整個.csv文件,然後用邏輯去掉NaN行。

例如,我知道我會遇到的最大表格將是10行。所以,我對大熊貓電話是:

DF = pd.read_csv(csv_file, sep=',', \ 
    parse_dates={'Datetime_(ascii)': [0,1]},\ 
    na_values=['','na', '999999', '#'], engine='c',\ 
    encoding='cp1252', names = list(range(0,10))) 

然後我用這兩條線從數據框中刪除NaN的行和列:

#drop the null columns created by double deliminators 
DF = DF.dropna(how="all", axis=1) 
DF = DF.dropna(thresh=2) # drop if we don't have at least 2 cells with real values