2016-05-02 57 views
0

問題描述:Python:讀取和處理遠程服務器中的多個gzip文件

我在遠程服務器中有多個(1000+)* .gz文件。我必須閱讀這些文件並檢查某些字符串。如果字符串匹配,我必須返回文件名。我已經嘗試了下面的代碼。以下程序正在運行,但看起來效率不高,因爲涉及到巨大的IO。你能否建議一個有效的方法來做到這一點。

我的代碼:

import gzip 
import os 
import paramiko 
import multiprocessing 
from bisect import insort 
synchObj=multiprocessing.Manager() 
hostname = '192.168.1.2' 
port = 22 
username='may' 
password='Apa$sW0rd' 

def miniAnalyze(): 
    ifile_list=synchObj.list([]) # A synchronized list to Store the File names containing the matched String. 

    def analyze_the_file(file_single): 
     strings = ("error 72","error 81",) # Hard Coded the Strings that needs to be searched. 
     try: 
      ssh=paramiko.SSHClient() 
      #Code to FTP the file to local system from the remote machine. 
      ..... 
      ........ 
      path_f='/home/user/may/'+filename 

      #Read the Gzip file in local system after FTP is done 

      with gzip.open(path_f, 'rb') as f: 
      contents = f.read() 
      if any(s in contents for s in strings): 
       print "File " + str(path_f) + " is a hit." 
       insort(ifile_list, filename) # Push the file into the list if there is a match. 
       os.remove(path_f) 
      else: 
       os.remove(path_f) 
     except Exception, ae: 
      print "Error while Analyzing file "+ str(ae) 

     finally: 
      if ifile_list: 
      print "The Error is at "+ ifile_list 
      ftp.close() 
      ssh.close() 


    def assign_to_proc(): 
     # Code to glob files matching a pattern and pass to another function via multiprocess . 
     apath = '/home/remotemachine/log/' 
     apattern = '"*.gz"' 
     first_command = 'find {path} -name {pattern}' 
     command = first_command.format(path=apath, pattern=apattern) 

     try: 
      ssh=paramiko.SSHClient() 
      ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy()) 
      ssh.connect(hostname,username=username,password=password) 
      stdin, stdout, stderr = ssh.exec_command(command) 
      while not stdout.channel.exit_status_ready(): 
       time.sleep(2) 
      filelist = stdout.read().splitlines() 

      jobs = [] 

      for ifle in filelist: 
       p = multiprocessing.Process(target=analyze_the_file,args=(ifle,)) 
       jobs.append(p) 
       p.start() 

      for job in jobs: 
       job.join() 


     except Exception, fe: 
      print "Error while getting file names "+ str(fe) 

     finally: 
      ssh.close() 


if __name__ == '__main__': 
    miniAnalyze() 

上面的代碼是緩慢的。獲取GZ文件到本地系統時有很多的IO。請幫助我找到一個更好的方法來做到這一點。

回答

0

執行遠程OS命令(如zgrep),並在本地處理命令結果。這樣,您就不必在本地機器上傳輸整個文件內容。

相關問題