最佳算法來比較龐大的數據

我有一個大的數據集爲CSV（334MB），它看起來像下面。最佳算法來比較龐大的數據

month, output 
1,"['23482394','4358309','098903284'....(total 2.5 million entries)]" 
2,"['92438545','23482394',323103404'....(total 2.2 million entries)]" 
3,"[...continue

現在，我需要比較多少百分比在一個月的輸出與一個在上月的重疊。

例如，當我比較1月份和2月份，我想獲得類似結果「月2輸出具有對MONTH1 90％重疊」，然後選擇「Month3具有抗MONTH2 88％overap」

Python3解決這個問題的最好方法是什麼？

來源

2017-08-08 K.K.

是在價值觀？獨特的，總是整數 –

334 MB是要融入你的普通計算機的RAM，所以一定要確保不overengineer這一次每個特定的月份，請確定這種重疊：這些總是整數請問「0」前綴事？他們是獨一無二的？是相關的順序？請添加一些代碼來告訴你如何會在python比較兩個短，簡單的例子字符串，這會讓事情變得更加簡單。每個月 – reto

@IvanSivak值都是獨一無二的他們總是整數。 –

可以使用交集方法提取的B/W兩個陣列或列表共同的元件。交集的複雜度爲O（分鐘（LEN（一），LEN（B））。

# generate random numpy array with unique elements 
import numpy as np 

month1 = np.random.choice(range(10**5, 10**7), size=25*10**5, replace=False) 
month2 = np.random.choice(range(10**5, 10**7), size=22*10**5, replace=False) 
month3 = np.random.choice(range(10**5, 10**7), size=21*10**5, replace=False) 

print('Month 1, 2, and 3 contains {}, {}, and {} elements respectively'.format(len(month1), len(month2), len(month3))) 

Month 1, 2, and 3 contains 2500000, 2200000, and 2100000 elements respectively 

# Compare month arrays for overlap 

import time 

startTime = time.time() 
union_m1m2 = set(month1).intersection(month2) 
union_m2m3 = set(month2).intersection(month3) 

print('Percent of elements in both month 1 & 2: {}%'.format(round(100*len(union_m1m2)/len(month2),2))) 
print('Percent of elements in both month 2 & 3: {}%'.format(round(100*len(union_m2m3)/len(month3),2))) 

print('Process time:{:.2f}s'.format(time.time()-startTime)) 

Percent of elements in both month 1 & 2: 25.3% 
Percent of elements in both month 2 & 3: 22.24% 
Process time:2.46s

您可能必須與實際數據月份條目之間的重疊更好的成功。

來源

2017-08-08 08:25:42 orugantn

嗨，是否有任何理由使用np.random.choice而不是random.sample？謝謝。 –

最佳算法來比較龐大的數據

回答

相關問題