2016-03-27 54 views
1

在文件「ratings.dat」上運行下面的代碼時,我遇到「ValueError」。我已經在另一個帶有「,」的文件上嘗試了相同的代碼作爲分隔符,沒有任何問題。然而,當分隔符是「::」時,熊貓似乎失敗了。Pandas在「::」分隔符上的read_csv中的值錯誤

我輸入的代碼錯了嗎?

代碼:

import pandas as pd 
import numpy as np 

r_cols = ['userId', 'movieId', 'rating'] 
r_types = {'userId': np.str, 'movieId': np.str, 'rating': np.float64} 
ratings = pd.read_csv(
     r'C:\\Users\\Admin\\OneDrive\\Documents\\_Learn!\\' 
     r'Learn Data Science\\Data Sets\\MovieLens\\ml-1m\\ratings.dat', 
     sep='::', names=r_cols, usecols=range(3), dtype=r_types 
     ) 

m_cols = ['movieId', 'title'] 
m_types = {'movieId': np.str, 'title': np.str} 
movies = pd.read_csv(
     r'C:\\Users\\Admin\\OneDrive\\Documents\\_Learn!\\' 
     r'Learn Data Science\\Data Sets\\MovieLens\\ml-1m\\movies.dat', 
     sep='::', names=m_cols, usecols=range(2), dtype=m_types 
     ) 

ratings = pd.merge(movies, ratings) 
ratings.head() 

「ratings.dat」

1::1287::5::978302039 
1::2804::5::978300719 
1::594::4::978302268 
1::919::4::978301368 
1::595::5::978824268 

錯誤輸出:

---------------------------------------------------------------------------ValueError        Traceback (most recent call last)<ipython-input-19-a2649e528fb9> in <module>() 
     7   r'C:\\Users\\Admin\\OneDrive\\Documents\\_Learn!\\' 
     8   r'Learn Data Science\\Data Sets\\MovieLens\\ml-1m\\ratings.dat', 
----> 9   sep='::', names=r_cols, usecols=range(3), dtype=r_types 
    10  ) 
    11 
C:\Anaconda3\lib\site-packages\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines) 
    496      skip_blank_lines=skip_blank_lines) 
    497 
--> 498   return _read(filepath_or_buffer, kwds) 
    499 
    500  parser_f.__name__ = name 
C:\Anaconda3\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds) 
    273 
    274  # Create the parser. 
--> 275  parser = TextFileReader(filepath_or_buffer, **kwds) 
    276 
    277  if (nrows is not None) and (chunksize is not None): 
C:\Anaconda3\lib\site-packages\pandas\io\parsers.py in __init__(self, f, engine, **kwds) 
    584 
    585   # might mutate self.engine 
--> 586   self.options, self.engine = self._clean_options(options, engine) 
    587   if 'has_index_names' in kwds: 
    588    self.options['has_index_names'] = kwds['has_index_names'] 
C:\Anaconda3\lib\site-packages\pandas\io\parsers.py in _clean_options(self, options, engine) 
    663       msg += " (Note the 'converters' option provides"\ 
    664        " similar functionality.)" 
--> 665      raise ValueError(msg) 
    666     del result[arg] 
    667 
ValueError: Falling back to the 'python' engine because the 'c' engine does not support regex separators, but this causes 'dtype' to be ignored as it is not supported by the 'python' engine. (Note the 'converters' option provides similar functionality.) 

回答

3

如果你讀了最後一行仔細回顧,你可能會得到答案,爲什麼它失敗。我把它分成兩行

ValueError: Falling back to the 'python' engine because the 'c' engine does not support regex separators,

but this causes 'dtype' to be ignored as it is not supported by the 'python' engine. (Note the 'converters' option provides similar functionality.)

所以分隔符'::'被解釋爲正則表達式。由於熊貓文檔中關於sep說:

Regular expressions are accepted and will force use of the python parsing engine

(重點煤礦)

因此,大熊貓將使用「巨蟒引擎」來讀取數據。錯誤的下一行然後說因爲使用Python引擎,所以dtype被忽略。 (據推測,C-引擎意味着numpy的,它可以使用D型; Python的顯然不應對dtypes。)


如何解決呢

您可以刪除從dtype參數您致電read_csv(您仍然會收到警告),或者對分隔符進行操作。

第二個選項似乎很棘手:轉義或原始字符串沒有幫助。顯然,任何超過1個字符的分隔符都被Pandas解釋爲正則表達式。這可能是熊貓方面的一個不幸的決定。

避免這一切的一種方法是使用單個':'作爲分隔符,並避免每隔一個(空)列。例如:

ratings = pd.read_csv(filename, sep=':', names=r_cols, 
         usecols=[0, 2, 4], dtype=r_types) 

(或使用usecols=range(0, 5, 2)如果你在使用range設置。)


附錄

的OP正確地提出了關於具有單一:字符場點。也許有辦法解決,但除此之外,你可以把它一個兩步走的方法,使用numpy的的genfromtxt代替:

# genfromtxt requires a proper numpy dtype, not a dict 
# for Python 3, we need U10 for strings 
dtype = np.dtype([('userId', 'U10'), ('movieID', 'U10'), 
        ('rating', np.float64)]) 
data = np.genfromtxt(filename, dtype=dtype, names=r_cols, 
        delimiter='::', usecols=list(range(3))) 
ratings = pd.DataFrame(data) 
+0

一個數據記錄是通過一個「:」數據字段內。因此,Python不斷拋出一個C錯誤:「第12行預期的5個字段,看到6」。無論如何處理這個? – Cloud

+0

那時候,我可能會在文本編輯器中打開數據文件,看看是否有例如文件中的任何逗號或分號,然後用','替換全部爲'::'。當然,我可以訪問該文件。 – Evert

+0

@Cloud你可能想問一下你在Pandas郵件列表中的情況(如何避免''::''被解釋爲一個正則表達式;當我測試這個時反斜槓不起作用或者在Pandas github上提出問題頁。 – Evert

相關問題