在python中，加速數據流的字數近似算法

這是我正在做的一項家庭作業。我有一個工作版本的代碼，但它目前需要大約1小時才能運行我們提供的文件。我將分享這些文件的示例，以及我的代碼（和高級描述），然後可以使用想法來說明爲什麼我的代碼運行得儘可能慢。下面的第一個文件是字文件，其中我近似的倍每個字（表示爲數字）的數量已經出現：在python中，加速數據流的字數近似算法

the_words.txt 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
13 
16 
17 
6 
18 
19 
20 
21 
22 
23 
24 
25 
6 
26 
27 
28 
29 
30 
9 
31 
32 
33 
34 
15 
35 
36 
37 
9 
38 
39 
11 
40 
13 
41 
42

第二文件包括參數在我的腳本中使用5個散列函數：

the_hashes.txt 
3 1561 
17 277 
38 394 
61 13 
78 246

這裏是我的代碼版本。在高層次上，我（1）做我的導入和設置變量，（2）創建一個散列函數，（3）遍歷the_words.txt中的單詞（這是一個int，我知道會混淆） 5個散列函數，並且在C矩陣中將相應索引中的值遞增1。我的代碼：

# imports 
import numpy as np 
import matplotlib.pyplot as plt 
import math 


# variables used throughout the program 
dlt = math.exp(-5) 
eps = math.exp(1) * math.pow(10, -4) 
my_p = 123457 

the_hashes = map(str.split, open('the_hashes.txt', 'r')) 
the_hashes = [[int(float(j)) for j in i] for i in the_hashes] 
end = len(the_hashes) 

rows = math.ceil(math.log(1/dlt)) 
cols = math.ceil(math.exp(1)/eps) 
C = np.zeros((rows,cols)) 


# Returns hash(x) for hash function 
# given by parameters a, b, p and n_buckets 
def hash_fun(a, b, p, n_buckets, x): 
    y = x % p 
    hash_val = (a*y + b) % p 
    output = hash_val % n_buckets 
    return(output) 


# read the file line by line, implementing the algorithm 
counter = 0 
with open("the_words.txt", "r") as file: 
    for word in file: 
     counter = counter + 1 
     my_x = int(word) 

     # loop over the 5 different pairs of (a,b) values for the hashes 
     for i in range(0,end): 
      my_a = the_hashes[i][0] 
      my_b = the_hashes[i][1] 

      my_output = hash_fun(my_a, my_b, my_p, cols, my_x) 
      C[i,my_output] += 1 

     if(counter % 10000 == 0): 
      print counter

但是，對於200M字的文件，這對我來說目前需要很長時間。有什麼顯而易見的是導致我的代碼運行緩慢？我知道可能需要一段時間才能流出200多萬字，但我希望從當前正在採用的小時數開始減少。

謝謝！

來源

2017-03-08 Canovice

將代碼分成幾塊，然後做時間測量測試（如[this]（http://stackoverflow.com/questions/7370801/measure-time-elapsed -in-python）），每個塊都有合理的單詞樣本。然後你可以找出哪個區塊花費最多的時間。 –

您是否嘗試在您的代碼上運行探查器，以瞭解大部分時間都花在哪裏？假設你調用'hash_fun'多少次，這可能值得在C中實現，但我會認爲這不是你在這裏練習的目的。一些想法：加載內存中的整個文件以避免循環中的磁盤訪問，內聯hash_fun保存函數調用，在pypy中運行代碼... –

其中一條指令是不將the_words.txt文件加載到內存中。最花時間的是有200M行代碼，但是我猜是看一次迭代/行輸入，看到最慢的是一個聰明的開始。 – Canovice

如果您不能加載在內存中的數據，也有一些地方，你可以內嵌和分解出：

my_range = range(0, end) # python 2 only, see note below 
with open("the_words.txt", "r") as file: 
    for word in file: 
     counter = counter + 1 
     y = int(word) % p # factor this out: save 160 million calculations 
     # loop over the 5 different pairs of (a,b) values for the hashes 
     for i in my_range: 
      my_a = the_hashes[i][0] 
      my_b = the_hashes[i][1] 

      # save a function call by inlining 
      # my_output = hash_fun(my_a, my_b, my_p, cols, my_x) 

      hash_val = (a*y + b) % p 
      my_output = hash_val % n_buckets 
      C[i,my_output] += 1 

     if(counter % 10000 == 0): 
      print counter

我也想看看hash_val = ...算算看你是否能分解出一些計算。

對於range(0, end)取決於您使用的是哪個版本的Python，您可能需要緩存該調用。見https://stackoverflow.com/a/40294986/1138710）。（我懷疑你的打印語句中的python 2）。

此外，我建議您閱讀Python performance characteristics以獲得一些有趣的方式來提高性能，或至少更好地理解您正在做的事情。

以上只是猜測。查看How can you profile a script?瞭解如何分析您的代碼並確定瓶頸在哪裏。

我的另一個猜測，因爲你使用numpy，將依靠它的矩陣計算函數，我認爲這將會得到更好的優化。 (a*y + b) % p看起來像不錯的向量數學對我:)

來源

2017-03-08 17:16:35

這絕對是非常棒的，所有真正直觀的事情都需要調整。我會看看你發佈的消息來源。 – Canovice

在python中，加速數據流的字數近似算法

回答

相關問題