2013-09-05 83 views
1

我做了下面的代碼,但我想改進它。我不想重新讀取文件,但是如果我刪除sales_input.seek(0),它不會迭代拋出銷售中的每一行。我怎樣才能改善這一點?重新讀取python中的csv文件,而無需再次加載它

def computeCritics(mode, cleaned_sales_input = "data/cleaned_sales.csv"): 
    if mode == 1: 
     print "creating customer.critics.recommendations" 
     critics_output = open("data/customer/customer.critics.recommendations", 
           "wb") 
     ID = getCustomerSet(cleaned_sales_input) 
     sales_dict = pickle.load(open("data/customer/books.dict.recommendations", 
             "r")) 
    else: 
     print "creating books.critics.recommendations" 
     critics_output = open("data/books/books.critics.recommendations", 
           "wb") 
     ID = getBookSet(cleaned_sales_input) 
     sales_dict = pickle.load(open("data/books/users.dict.recommendations", 
             "r")) 
    critics = {} 
    # make critics dict and pickle it 
    for i in ID: 
     with open(cleaned_sales_input, 'rb') as sales_input: 
      sales = csv.reader(sales_input) # read new 
      for j in sales: 
       if mode == 1: 
        if int(i) == int(j[2]): 
         sales_dict[int(j[6])] = 1 
       else: 
        if int(i) == int(j[6]): 
         sales_dict[int(j[2])] = 1 
      critics[int(i)] = sales_dict 
    pickle.dump(critics, critics_output) 
    print "done" 

cleaned_sales_input看起來像

6042772,2723,3546414,9782072488887,1,9.99,314968 
6042769,2723,3546414,9782072488887,1,9.99,314968 
... 

,其中6號是書和號碼0是客戶ID

我希望得到一個字典至極的樣子

critics = { 
    CustomerID1: { 
     BookID1: 1, 
     BookID2: 0, 
     ........ 
     BookIDX: 0 
    }, 
    CustomerID2: { 
     BookID1: 0, 
     BookID2: 1, 
     ... 
    } 
} 

critics = { 
    BookID1: { 
     CustomerID1: 1, 
     CustomerID2: 0, 
     ........ 
     CustomerIDX: 0 
    }, 
    BookID1: { 
     CustomerID1: 0, 
     CustomerID2: 1, 
     ... 
     CustomerIDX: 0 
    } 
} 

我希望這不是多少信息

+0

你是否對此進行了配置文件以查看csv閱讀是否是瓶頸? – RickyA

+0

抱歉,這是什麼配置文件?我從來沒有聽說過。 –

+0

[profiler](http://docs.python.org/2/library/profile.html)用於查看代碼的每個部分花費多少時間。您可以這樣做來識別代碼中的瓶頸。在配置文件之前優化事物幾乎是無用的,因爲你不知道瓶頸是什麼。所以也許你的文件讀取不是這裏的瓶頸。 – RickyA

回答

2

以下是一些建議:

讓我們在這個代碼模式先來看看:

for i in ID: 
    for j in sales: 
     if int(i) == int(j[2]) 

通知,i只被用j[2]比較。這是循環中唯一的目的。 int(i) == int(j[2])只能爲每個i最多一次。

所以,我們完全可以通過改寫它作爲

for j in sales: 
    key = j[2] 
    if key in ID: 

基於函數名稱getCustomerSetgetBookSet刪除for i in ID循環,聽起來好像 ID是一組(而不是一個列表或元組)。我們希望ID是一個集合,因爲 測試集合中的成員資格是O(1)(而不是列表或元組的O(n))。


下一步,考慮這條線:

critics[int(i)] = sales_dict 

這裏有一個潛在的缺陷。此行將爲ID中的每個i分配sales_dict至 。每個鍵int(i)被映射到非常相同的dict。正如我們循環salesID,我們正在修改sales_dict這樣,例如:

sales_dict[int(j[6])] = 1 

但是,這將導致在critics所有critics點被同時修改,因爲所有的鍵的相同的dict ,sales_dict。我懷疑這是你想要的。

爲了避免這一缺陷,我們需要做的sales_dict的副本:

critics = {i:sales_dict.copy() for i in ID} 

def computeCritics(mode, cleaned_sales_input="data/cleaned_sales.csv"): 
    if mode == 1: 
     filename = 'customer.critics.recommendations' 
     path = os.path.join("data/customer", filename) 
     ID = getCustomerSet(cleaned_sales_input) 
     sales_dict = pickle.load(
      open("data/customer/books.dict.recommendations", "r")) 
     key_idx, other_idx = 2, 6 
    else: 
     filename = 'books.critics.recommendations' 
     path = os.path.join("data/books", filename)   
     ID = getBookSet(cleaned_sales_input) 
     sales_dict = pickle.load(
      open("data/books/users.dict.recommendations", "r")) 
     key_idx, other_idx = 6, 2 

    print "creating {}".format(filename) 
    ID = {int(item) for item in ID} 
    critics = {i:sales_dict.copy() for i in ID} 
    with open(path, "wb") as critics_output: 
     # make critics dict and pickle it 
     with open(cleaned_sales_input, 'rb') as sales_input: 
      sales = csv.reader(sales_input) # read new 
      for j in sales: 
       key = int(j[key_idx]) 
       if key in ID: 
        other_key = int(j[other_idx]) 
        critics[key][other_key] = 1      
       critics[key] = sales_dict 
     pickle.dump(dict(critics), critics_output) 
     print "done" 
+0

對不起,沒有添加它,但我想讓字典看起來像c = {ID1 {書:1,書:0 ........書:0},ID2 .....}所以我必須這樣做,還是我只是被封鎖了? –

+2

你的代碼和字典之間沒有明顯的關係。您需要填寫更多詳細信息,比如'ID'等於什麼,以及您的問題之前的'cleaned_sales_input'的樣本是否可以回答。 – unutbu

+0

我增加了更多的信息,我希望這不是很多^^ –

0

@ unutbu的回答是好,但如果你堅持這種結構可以把整個文件在內存中:

sales = [] 
with open(cleaned_sales_input, 'rb') as sales_input: 
    sales_reader = csv.reader(sales_input)  
    [sales.append(line) for line in sales_reader] 

    for i in ID: 
     for j in sales: 
      #do stuff 
相關問題