2017-09-28 93 views
0

我想實現args進入蜘蛛的url。例如:scrapy python帶參數

scrapy crawl test -a url="https://example.com" 

之後,我想自動採取start_urls並將其自動轉換爲domain_allowed。例如:

domain_allowed = ['example.com'] 

之後,我想舉例通過剛剛字到mysql管道在那裏創建表從domain_allowed的只用一句話

這是什麼我對現在:

class Spider(BaseSpider): 
    name = 'seeker' 

    def __init__(self, *args, **kwargs): 
     urls = kwargs.pop('urls', []) 
     if urls: 
      self.start_urls = urls.split(',') 
     self.logger.info(self.start_urls) 

     # take the arg "urls" and convert it to allowed_domains 
     url = "".join(urls) 
     self.allowed_domains = [url.split('/')[-1]] 

     super(SeekerSpider, self).__init__(*args, **kwargs) 


    # i have to use "domain" here and not inside the function parge_page or __init__ 
    domain = domain_allowed.replace(".", "_") 
    # create folder with the domain name 

    def parse_page(self, response): 
     ... 

基本上我需要使用self.allowed_dom ains以外的功能...這就是我的問題...變量不接受它。

這是我pipelines.py的一部分

class MySQLPipeline(object): 
    def __init__(self, *args, **kwargs): 
     self.connect = pymysql.connect(...) 
     self.cursor = self.connect.cursor() 
     # print "Input the name of the table: " <-- its commented 
     # tablename = raw_input(" ") <-- its commented 
     date = datetime.datetime.now().strftime("%y_%m_%d_%H_%M") 
     self.tablename = kwargs.pop('tbl', '') 
     self.newname = self.tablename + "_" + date 
     print self.newname 
     # create a different way to create a tablename 
     # importing the "allowed_domain" and strip it 
     # and give tablename 

管線我已經做到了這way..but其並不好......我想借此allowed_domain從蜘蛛和通在這裏,並把它分解拿域的唯一名稱,而不.COM.whatever

預先感謝您

回答

0

在我的手機有關格式非常抱歉......

我會使用過程中的項目函數蜘蛛對象: 高清process_item(個體經營,項目,蜘蛛):「。‘ spider.allowed_domains.replace(’ _「)

+0

是的,但它會通過它每一個請求...所以基本上它會創建它100次....我需要在外面做...如果我做** def __init__(self,蜘蛛)**而是...實際上,我已經嘗試過,它不工作...我不能做spider.allowed_domain ...它不存在 – Omega