luigi批處理模塊用於直批處理任務

我有500個鏈接要下載，並且想要通過例如10個項目對它們進行批處理。luigi批處理模塊用於直批處理任務

這個僞代碼是怎麼樣的？

class BatchJobTask(luigi.Task) 
    items = luigi.Parameter() 
    def run(self): 
     listURLs = [] 
     with ('urls_chunk', 'r') as urls 
      for line in urls: 
       listURLs.append('http://ggg'+line+'.org') 
      10_urls = listURLs[0:items] #10 items here 
      for i in 10_urls: 
       req = request.get(url) 
       req.contents 
    def output(self): 
     return self.LocalTarger("downloaded_filelist.txt") 

class BatchWorker(luigi.Task) 
    def run(self) 
     # Here I should run BatchJobTask from 0 to 10, next 11 - 21 new etc...

會是怎樣？

來源

2017-10-28 GarfieldCat

你的網址列表在哪裏？ – MattMcKnight

我已經更新了第一篇文章 – GarfieldCat

我的意思是這個URL列表存儲在哪裏？在一個隊列中，一個數據庫，一個文件？你需要做的是弄清楚那件東西有多少，然後從那裏建立你的大塊。我將在下面舉一個例子，但由於您未指定問題的相關部分，因此它不太可能與您的問題相關。 – MattMcKnight

這是一種做你喜歡的東西的方法，但是將字符串列表存儲爲一個文件中的單獨行。

import luigi 
import requests 

BATCH_SIZE = 10 


class BatchProcessor(luigi.Task): 
    items = luigi.ListParameter() 
    max = luigi.IntParameter() 

    def requires(self): 
     return None 

    def output(self): 
     return luigi.LocalTarget('processed'+str(max)+'.txt') 

    def run(self): 
     for item in self.items: 
      req = requests.get('http://www.'+item+'.org') 
      # do something useful here 
      req.contents 
     open("processed"+str(max)+".txt",'w').close() 


class BatchCreator(luigi.Task): 
    file_with_urls = luigi.Parameter() 

    def requires(self): 
     required_tasks = [] 
     f = open(self.file_with_urls) 
     batch_index = 0 
     total_index = 0 
     lines = [] 
     while True: 
      line = f.readline() 
      if not line: break 
      total_index += 1 
      if batch_index < BATCH_SIZE: 
       lines.append(line) 
       batch_index += 1 
      else: 
       required_tasks.append(BatchProcessor(batch=lines)) 
       lines = [line] 
       batch_index = 1 
     return required_tasks 

    def output(self): 
     return luigi.LocalTarget(str(self.file_with_urls) + 'processed') 

    def run(self): 
     open(str(self.file_with_urls) + 'processed', 'w').close()

來源

2017-10-30 14:03:06 MattMcKnight

我做到了。

class GetListtask(luigi.Task) 
    def run(self): 
     ... 
    def output(self): 
    return luigi.LocalTarget(self.outputfile) 

class GetJustOneFile(luigi.Task): 
    fid = luigi.IntParameter() 
    def requires(self): 
     pass 

    def run(self): 
     url = 'http://my-server.com/test' + str(self.fid) + '.txt' 
     download_file = requests.get(url, stream=True) 
     with self.output().open('w') as downloaded_file: 
      downloaded_file.write(str(download_file.content)) 

    def output(self): 
     return luigi.LocalTarget("test{}.txt".format(self.fid)) 


class GetAllFiles(luigi.WrapperTask): 
    def requires(self): 
     listoffiles = [] # 0..999 
     for i in range(899): 
      listoffiles.append(i) 
     return [GetJustOneFile(fid=fileid) for fileid in listoffiles]

這段代碼可怕嗎？

來源

2017-10-30 19:34:35 GarfieldCat

嗯，它不會做配料，但它應該工作。 – MattMcKnight

如何在GetAllFiles而不是預定義列表中從GetListTask輸入文件？ – GarfieldCat

這就是我在BatchCreator任務的'require'方法中展示的內容，假設您有一個文件，其中每行文件都是變化的URN組件。 – MattMcKnight

luigi批處理模塊用於直批處理任務

回答

相關問題