我正在處理一個項目的兩個大型數據集文件。我管理着逐行清理文件。然而,在嘗試應用相同的邏輯來合併基於普通列的2個文件時,它失敗了。問題是第二個循環完全運行,然後運行頂層循環(不知道爲什麼會發生這種情況)。我嘗試使用numpy的需要在python中逐行合併2個大型csv文件
buys = np.genfromtxt('buys_dtsep.dat',delimiter=",",dtype='str')
clicks = np.genfromtxt('clicks_dtsep.dat',delimiter=",",dtype='str')
f = open('combined.dat', 'w')
for s in clicks:
for s2 in buys:
#process data
但加載與33000000個條目的文件到一個數組是不可行的,由於存儲器限制的時間,將採取的數據加載到一個數組,然後對其進行處理。我正在嘗試逐行處理文件以避免內存不足。打印的
buys = open('buys_dtsep.dat')
clicks = open('clicks_dtsep.dat')
f = open('combined.dat', 'w')
csv_buys = csv.reader(buys)
csv_clicks = csv.reader(clicks)
for s in csv_clicks:
print 'file 1 row x'#to check when it loops
for s2 in csv_buys:
print s2[0] #check looped data
#do merge op
輸出應該
file 1 row 0
file 2 row 0
...
file 2 row x
file 1 row 1
and so on
輸出我得到的是
file 2 row 0
file 2 row 1
...
file 2 row x
file 1 row 0
...
file 1 row z
如果上述循環的問題能夠得到解決,生病能夠通過行合併文件行。
更新時間:樣本數據
購買文件樣本
420374,2014-04-06,18:44:58.314,214537888,12462,1
420374,2014-04-06,18:44:58.325,214537850,10471,1
281626,2014-04-06,09:40:13.032,214535653,1883,1
420368,2014-04-04,06:13:28.848,214530572,6073,1
420368,2014-04-04,06:13:28.858,214835025,2617,1
140806,2014-04-07,09:22:28.132,214668193,523,1
140806,2014-04-07,09:22:28.176,214587399,1046,1
點擊文件樣本
420374,2014-04-06,18:44:58,214537888,0
420374,2014-04-06,18:41:50,214537888,0
420374,2014-04-06,18:42:33,214537850,0
420374,2014-04-06,18:42:38,214537850,0
420374,2014-04-06,18:43:02,214537888,0
420374,2014-04-06,18:43:10,214537888,0
420369,2014-04-07,19:39:43,214839373,0
420369,2014-04-07,19:39:56,214684513,0
你不能用'pandas'是什麼?如果是的話,你可以考慮像''那樣''read_csv''chunks'參數(http://pandas.pydata.org/pandas-docs/stable/io.html#iterating-through-files-chunk-by-chunk )示例 –
您可以從兩個dat文件中添加一些樣本行嗎? –