在文件「ratings.dat」上運行下面的代碼時,我遇到「ValueError」。我已經在另一個帶有「,」的文件上嘗試了相同的代碼作爲分隔符,沒有任何問題。然而,當分隔符是「::」時,熊貓似乎失敗了。Pandas在「::」分隔符上的read_csv中的值錯誤
我輸入的代碼錯了嗎?
代碼:
import pandas as pd
import numpy as np
r_cols = ['userId', 'movieId', 'rating']
r_types = {'userId': np.str, 'movieId': np.str, 'rating': np.float64}
ratings = pd.read_csv(
r'C:\\Users\\Admin\\OneDrive\\Documents\\_Learn!\\'
r'Learn Data Science\\Data Sets\\MovieLens\\ml-1m\\ratings.dat',
sep='::', names=r_cols, usecols=range(3), dtype=r_types
)
m_cols = ['movieId', 'title']
m_types = {'movieId': np.str, 'title': np.str}
movies = pd.read_csv(
r'C:\\Users\\Admin\\OneDrive\\Documents\\_Learn!\\'
r'Learn Data Science\\Data Sets\\MovieLens\\ml-1m\\movies.dat',
sep='::', names=m_cols, usecols=range(2), dtype=m_types
)
ratings = pd.merge(movies, ratings)
ratings.head()
「ratings.dat」
1::1287::5::978302039
1::2804::5::978300719
1::594::4::978302268
1::919::4::978301368
1::595::5::978824268
錯誤輸出:
---------------------------------------------------------------------------ValueError Traceback (most recent call last)<ipython-input-19-a2649e528fb9> in <module>()
7 r'C:\\Users\\Admin\\OneDrive\\Documents\\_Learn!\\'
8 r'Learn Data Science\\Data Sets\\MovieLens\\ml-1m\\ratings.dat',
----> 9 sep='::', names=r_cols, usecols=range(3), dtype=r_types
10 )
11
C:\Anaconda3\lib\site-packages\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines)
496 skip_blank_lines=skip_blank_lines)
497
--> 498 return _read(filepath_or_buffer, kwds)
499
500 parser_f.__name__ = name
C:\Anaconda3\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds)
273
274 # Create the parser.
--> 275 parser = TextFileReader(filepath_or_buffer, **kwds)
276
277 if (nrows is not None) and (chunksize is not None):
C:\Anaconda3\lib\site-packages\pandas\io\parsers.py in __init__(self, f, engine, **kwds)
584
585 # might mutate self.engine
--> 586 self.options, self.engine = self._clean_options(options, engine)
587 if 'has_index_names' in kwds:
588 self.options['has_index_names'] = kwds['has_index_names']
C:\Anaconda3\lib\site-packages\pandas\io\parsers.py in _clean_options(self, options, engine)
663 msg += " (Note the 'converters' option provides"\
664 " similar functionality.)"
--> 665 raise ValueError(msg)
666 del result[arg]
667
ValueError: Falling back to the 'python' engine because the 'c' engine does not support regex separators, but this causes 'dtype' to be ignored as it is not supported by the 'python' engine. (Note the 'converters' option provides similar functionality.)
一個數據記錄是通過一個「:」數據字段內。因此,Python不斷拋出一個C錯誤:「第12行預期的5個字段,看到6」。無論如何處理這個? – Cloud
那時候,我可能會在文本編輯器中打開數據文件,看看是否有例如文件中的任何逗號或分號,然後用','替換全部爲'::'。當然,我可以訪問該文件。 – Evert
@Cloud你可能想問一下你在Pandas郵件列表中的情況(如何避免''::''被解釋爲一個正則表達式;當我測試這個時反斜槓不起作用或者在Pandas github上提出問題頁。 – Evert