2015-07-13 100 views
2

我使用,我從所獲得的代碼:Comparing and replacing values inside DataFrames通過文件循環大熊貓

main_df = pd.read_csv('main.txt', sep='|', encoding='utf-8') 
data_df = pd.read_csv('data.csv', encoding='utf-8') 

main_df_part = main_df[['PRIM_LAT_DEC', 'PRIM_LONG_DEC', 'FEATURE_NAME', 'STATE_ALPHA']] 
main_df_part.columns = ['LAT', 'LONG', 'CITY', 'STATE'] 
main_df_part = main_df_part.set_index(['CITY', 'STATE']) 
data_df = data_df.set_index(['CITY', 'STATE']) 

data_df.update(main_df_part) 

data_df.to_csv('data/new.csv', sep=',', mode='a') 

我有大約60文件,我需要通過運行。 main_df,我試過如下:

總之

  1. Concatnate的文件,但繼續得到pandas.parser.CParserError: Error tokenizing data. C error: out of memory
  2. 使用CHUNKSIZE,但這種轉換數據幀到 pandas.io.parsers.TextFileReader做一些我以前 無效
  3. 方法最後,我試圖通過每個文件迭代,並把正確的 名稱,而不是main.txt但這樣做時繼續得到Exception: cannot handle a non-unique multi-index!

這是使用第三種方法:

files = [f for f in os.listdir('./data') if os.path.isfile(os.path.join('./data', f))] 

for w in files: 
    main_df = pd.read_csv(w, sep='|', low_memory=False, encoding='utf-8') 

任何想法如何解決多指標差?

的擴展信息

從方法1錯誤:

Traceback (most recent call last): 
    File "C:/Users/Leb/Desktop/Python/py-script/geo_pandas.py", line 6, in <module> 
    main_df = pd.read_csv('data.txt', sep='|', low_memory=False, encoding='utf-8') 
    File "C:\Python34\lib\site-packages\pandas\io\parsers.py", line 474, in parser_f 
    return _read(filepath_or_buffer, kwds) 
    File "C:\Python34\lib\site-packages\pandas\io\parsers.py", line 260, in _read 
    return parser.read() 
    File "C:\Python34\lib\site-packages\pandas\io\parsers.py", line 721, in read 
    ret = self._engine.read(nrows) 
    File "C:\Python34\lib\site-packages\pandas\io\parsers.py", line 1170, in read 
    data = self._reader.read(nrows) 
    File "pandas\parser.pyx", line 772, in pandas.parser.TextReader.read (pandas\parser.c:7581) 
    File "pandas\parser.pyx", line 858, in pandas.parser.TextReader._read_rows (pandas\parser.c:8532) 
    File "pandas\parser.pyx", line 1742, in pandas.parser.raise_parser_error (pandas\parser.c:20715) 
pandas.parser.CParserError: Error tokenizing data. C error: out of memory 

錯誤從方法2:

Traceback (most recent call last): 
    File "C:/Users/Leb/Desktop/Python/py-script/geo_pandas.py", line 11, in <module> 
    main_df_part = main_df[['PRIM_LAT_DEC', 'PRIM_LONG_DEC','FEATURE_NAME', 'STATE_ALPHA']] 
TypeError: 'TextFileReader' object is not subscriptable 

錯誤從方法3:

Traceback (most recent call last): 
    File "C:/Users/Leb/Desktop/Python/py-script/geo_pandas.py", line 32, in <module> 
    data_df.update(main_df_part) 
    File "C:\Python34\lib\site-packages\pandas\core\frame.py", line 3416, in update 
    other = other.reindex_like(self) 
    File "C:\Python34\lib\site-packages\pandas\core\generic.py", line 1564, in reindex_like 
    return self.reindex(**d) 
    File "C:\Python34\lib\site-packages\pandas\core\frame.py", line 2511, in reindex 
    **kwargs) 
    File "C:\Python34\lib\site-packages\pandas\core\generic.py", line 1773, in reindex 
    method, fill_value, copy).__finalize__(self) 
    File "C:\Python34\lib\site-packages\pandas\core\frame.py", line 2470, in _reindex_axes 
    fill_value, limit) 
    File "C:\Python34\lib\site-packages\pandas\core\frame.py", line 2477, in _reindex_index 
    limit=limit) 
    File "C:\Python34\lib\site-packages\pandas\core\index.py", line 4929, in reindex 
    "cannot handle a non-unique multi-index!") 
Exception: cannot handle a non-unique multi-index! 
+0

請發佈您嘗試過的每件產品的確切回溯。 – Manhattan

+0

將需要一秒鐘,但我會努力。 – Leb

+1

我會冒險猜測方法3:'main_df_part'有兩個完全相同的索引。掃描它。您可能有一個城市州組合,在您的某個文件中出現多次。 – Manhattan

回答

0

正如評論,這可能是其中一個文件你正在使用更新data.csv在其索引重複。我在下面運行了一個示例代碼。這是相當長的,但我希望它表明這種特殊情況。

In [1]: import pandas as pd 

In [2]: main = pd.read_csv('Main.csv') 
    ...: target1 = pd.read_csv('Target1.csv') 
    ...: target2 = pd.read_csv('Target2.csv') 

In [3]: main 
Out[3]: 
      City State Lat Long 
0   NY NY NaN NaN 
1  Albany NY NaN NaN 
2  Syracuse NY NaN NaN 
3  Columbia MO NaN NaN 
4 Kansas City MO NaN NaN 
5 Springfield MO NaN NaN 

In [4]: target1 
Out[4]: 
    Lat Long  City State 
0 100 200  NY NY 
1 300 400 Albany NY 
2 500 600 Syracuse NY 

In [5]: target2 
Out[5]: 
    Lat Long   City State 
0 100 200  Columbia MO 
1 300 400 Kansas City MO 
2 500 600 Springfield MO 
3 700 800 Springfield MO 

In [6]: m = main.set_index(['City','State']) 
    ...: t1 = target1.set_index(['City','State']) 
    ...: t2 = target2.set_index(['City','State']) 

In [7]: m 
Out[7]: 
        Lat Long 
City  State   
NY   NY  NaN NaN 
Albany  NY  NaN NaN 
Syracuse NY  NaN NaN 
Columbia MO  NaN NaN 
Kansas City MO  NaN NaN 
Springfield MO  NaN NaN 

In [8]: t1 
Out[8]: 
       Lat Long 
City  State   
NY  NY  100 200 
Albany NY  300 400 
Syracuse NY  500 600 

In [9]: t2 
Out[9]: 
        Lat Long 
City  State   
Columbia MO  100 200 
Kansas City MO  300 400 
Springfield MO  500 600 
      MO  700 800 

請特別注意上面的最後一行,[9]。請注意0​​現在如何爲其自身分配兩行值。

In [12]: m.update(t1) 

In [13]: m 
Out[13]: 
        Lat Long 
City  State   
NY   NY  100 200 
Albany  NY  300 400 
Syracuse NY  500 600 
Columbia MO  NaN NaN 
Kansas City MO  NaN NaN 
Springfield MO  NaN NaN 

In [14]: m.update(t2) 
--------------------------------------------------------------------------- 
Exception         Traceback (most recent call last) 
<ipython-input-14-f5f30165a245> in <module>() 
----> 1 m.update(t2) 

C:\Anaconda\Lib\site-packages\pandas\core\frame.pyc in update(self, other, join, overwrite, filter_func, raise_conflict) 
    3414    other = DataFrame(other) 
    3415 
-> 3416   other = other.reindex_like(self) 
    3417 
    3418   for col in self.columns: 

C:\Anaconda\Lib\site-packages\pandas\core\generic.pyc in reindex_like(self, other, method, copy, limit) 
    1562     method=method, copy=copy, limit=limit) 
    1563 
-> 1564   return self.reindex(**d) 
    1565 
    1566  def drop(self, labels, axis=0, level=None, inplace=False, errors='raise'): 

C:\Anaconda\Lib\site-packages\pandas\core\frame.pyc in reindex(self, index, columns, **kwargs) 
    2509  def reindex(self, index=None, columns=None, **kwargs): 
    2510   return super(DataFrame, self).reindex(index=index, columns=columns, 
-> 2511            **kwargs) 
    2512 
    2513  @Appender(_shared_docs['reindex_axis'] % _shared_doc_kwargs) 

C:\Anaconda\Lib\site-packages\pandas\core\generic.pyc in reindex(self, *args, **kwargs) 
    1771   # perform the reindex on the axes 
    1772   return self._reindex_axes(axes, level, limit, 
-> 1773         method, fill_value, copy).__finalize__(self) 
    1774 
    1775  def _reindex_axes(self, axes, level, limit, method, fill_value, copy): 

C:\Anaconda\Lib\site-packages\pandas\core\frame.pyc in _reindex_axes(self, axes, level, limit, method, fill_value, copy) 
    2468   if index is not None: 
    2469    frame = frame._reindex_index(index, method, copy, level, 
-> 2470           fill_value, limit) 
    2471 
    2472   return frame 

C:\Anaconda\Lib\site-packages\pandas\core\frame.pyc in _reindex_index(self, new_index, method, copy, level, fill_value, limit) 
    2475      limit=None): 
    2476   new_index, indexer = self.index.reindex(new_index, method, level, 
-> 2477             limit=limit) 
    2478   return self._reindex_with_indexers({0: [new_index, indexer]}, 
    2479           copy=copy, fill_value=fill_value, 

C:\Anaconda\Lib\site-packages\pandas\core\index.pyc in reindex(self, target, method, level, limit) 
    4927     else: 
    4928      raise Exception(
-> 4929       "cannot handle a non-unique multi-index!") 
    4930 
    4931   if not isinstance(target, MultiIndex): 

Exception: cannot handle a non-unique multi-index! 

這會引發與您一樣的錯誤。

+0

我正在考慮使用'drop_duplicates'作爲擺脫多餘行的解決方案。直到我得到一個更好的'main.txt'集合,我纔會得到我得到的東西。這是分配索引之前。 – Leb