Python Killed：9使用從2 csv文件創建的字典運行代碼時

我正在運行一直爲我工作的代碼。這次我在2個.csv文件上運行它：「data」（24 MB）和「data1」（475 MB）。「數據」每列有大約680000個元素的3列，而「data1」有3列，每列33000000個元素。當我運行代碼時，經過大約5分鐘的處理之後，我得到了「Killed：9」。如果這是一個內存問題，如何解決它？任何建議是值得歡迎的！Python Killed：9使用從2 csv文件創建的字典運行代碼時

這是代碼：

import csv 
import numpy as np 

from collections import OrderedDict # to save keys order 

from numpy import genfromtxt 
my_data = genfromtxt('data.csv', dtype='S', 
       delimiter=',', skip_header=1) 
my_data1 = genfromtxt('data1.csv', dtype='S', 
       delimiter=',', skip_header=1) 

d= OrderedDict((rows[2],rows[1]) for rows in my_data) 
d1= dict((rows[0],rows[1]) for rows in my_data1) 

dset = set(d) # returns keys 
d1set = set(d1) 

d_match = dset.intersection(d1) # returns matched keys 

import sys 
sys.stdout = open("rs_pos_ref_alt.csv", "w") 

for row in my_data: 
    if row[2] in d_match: 
     print [row[1], row[2]]

的「數據」標題是：

dbSNP RS ID Physical Position 
0 rs4147951 66943738 
1 rs2022235 14326088 
2 rs6425720 31709555 
3 rs12997193 106584554 
4 rs9933410 82323721 
5 rs7142489 35532970

的「數據1」標題是：

V2 V4 V5 
10468 TC T 
10491 CC C 
10518 TG T 
10532 AG A 
10582 TG T

來源

2015-12-14 Lucas

你是在自己的計算機上還是在某臺服務器上運行它？如果它運行在服務器上，可能有一些腳本監視「流氓」的進程，在一段時間後調用'kill -9'。 –

嗨@tobias_k，我在自己的筆記本電腦上運行它 – Lucas

在標準輸出中，或者在一個異常信息中，你如何獲得「殺死：9」？ –

最有可能的內核殺死它。您需要採取不同的方法並儘量減少內存中數據的大小。

您也可以找到這個問題的有用：Very large matrices using Python and NumPy

在下面的代碼片段我試圖避免處理它行由行裝載巨大data1.csv到內存中。試一試。

import csv 

from collections import OrderedDict # to save keys order 

with open('data.csv', 'rb') as csvfile: 
    reader = csv.reader(csvfile, delimiter=',') 
    next(reader) #skip header 
    d = OrderedDict((rows[2], {"val": rows[1], "flag": False}) for rows in reader) 

with open('data1.csv', 'rb') as csvfile: 
    reader = csv.reader(csvfile, delimiter=',') 
    next(reader) #skip header 
    for rows in reader: 
     if rows[0] in d: 
      d[rows[0]]["flag"] = True 

import sys 
sys.stdout = open("rs_pos_ref_alt.csv", "w") 

for k, v in d.iteritems(): 
    if v["flag"]: 
     print [v["val"], k]

來源

2015-12-14 15:31:27 frizz

首先，創建一個python腳本並運行以下代碼來查找所有Python進程。

import subprocess 

wmic_cmd = """wmic process where "name='python.exe' or name='pythonw.exe'" get commandline,processid""" 
wmic_prc = subprocess.Popen(wmic_cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True) 
wmic_out, wmic_err = wmic_prc.communicate() 
pythons = [item.rsplit(None, 1) for item in wmic_out.splitlines() if item][1:] 
pythons = [[cmdline, int(pid)] for [cmdline, pid] in pythons] 
for line in pythons: 
    cv = str(line).split('\\') 
    cb=str(cv).strip('"') 
    fin = cv[-1] 
    if fin[0:11] != 'pythonw.exe': 
     print 'pythonw.exe', fin 
    if fin[0:11] != 'python.exe': 
     print "'python.exe'", fin

運行後，粘貼輸出，在這裏的問題部分，我會看到通知。

*編輯

列出所有的進程，並在你的答案張貼，使用以下命令：因爲你的腳本消耗了太多的內存

import psutil 
for process in psutil.process_iter(): 
    print process

來源

2015-12-14 14:35:14 ajsp

謝謝@ajsp，我運行你的代碼，但是我沒有看到任何輸出，至少不是在終端中。腳本似乎運行良好雖然 – Lucas

有一個進程在某處運行-9 kill，如果你發現它已經找到了你的罪魁禍首，你最近是否在處理任何其他代碼？你是否在腳本的某個地方寫了一個腳本來殺死一個PID號碼？ – ajsp

請參閱編輯和回到我身邊 – ajsp

您的計算機有多少內存？

您可以添加一些優化來節省一些內存，如果這還不夠，您可以權衡一些CPU和IO以提高內存效率。

如果你只比較關鍵，不要做任何事情的價值觀，你只能提取鍵：

d1 = set([rows[0] for rows in my_data1])

的則不用OrderedDict，你可以嘗試使用順序組無論是從這個答案 - Does python has ordered set或使用ordered-set 模塊來自pypi。

一旦獲得了所有相交鍵，就可以編寫另一個程序查找來自源csv的所有匹配值。

如果這些優化不夠，您可以從較大的集合中提取所有密鑰，將它們保存到一個文件中，然後使用generators從文件中逐個加載密鑰，以便程序只保留一個一套鑰匙加上一個鑰匙而不是兩套鑰匙。

另外我會建議使用python pickle模塊來存儲中間結果。

來源

2015-12-14 15:45:36

Python Killed：9使用從2 csv文件創建的字典運行代碼時

回答

相關問題