2017-09-26 41 views
1

我正在從網站上下載電子郵件地址。 我有一個簡單的Scrapy抓取工具,它將一個.txt文件與域相關聯,然後通過抓取它們來查找電子郵件地址。Scrapy - 意外的後綴「%0A」鏈接

不幸的是,Scrapy正在鏈接中添加後綴「%0A」。你可以在日誌文件中看到它。

這裏是我的代碼:

class EmailsearcherSpider(scrapy.Spider): 
    name = 'emailsearcher' 
    allowed_domains = [] 
    start_urls = [] 
    unique_data = set() 

    def __init__(self): 
     for line in open('/home/*****/domains', 
        'r').readlines(): 
      self.allowed_domains.append(line) 
      self.start_urls.append('http://{}'.format(line)) 


    def parse(self, response): 
     emails = response.xpath('//body').re('([a-zA-Z0-9_.+-][email protected][a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)') 
     for email in emails: 
      print(email) 
      print('\n') 
      if email and (email not in self.unique_data): 
       self.unique_data.add(email) 
       yield {'emails': email} 

domains.txt:

link4.pl/kontakt 
danone.pl/Kontakt 
axadirect.pl/kontakt/dane-axa-direct.html 
andrzejtucholski.pl/kontakt 
premier.gov.pl/kontakt.html 

這裏是原木控制檯:

2017-09-26 22:27:02 [scrapy.core.engine] INFO: Spider opened 
2017-09-26 22:27:02 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2017-09-26 22:27:02 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6026 
2017-09-26 22:27:03 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.premier.gov.pl/kontakt.html> from <GET http://premier.gov.pl/kontakt.html> 
2017-09-26 22:27:03 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://andrzejtucholski.pl/kontakt> from <GET http://andrzejtucholski.pl/kontakt%0A> 
2017-09-26 22:27:05 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://axadirect.pl/kontakt/dane-axa-direct.html%0A> from <GET http://axadirect.pl/kontakt/dane-axa-direct.html%0A> 
2017-09-26 22:27:05 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.link4.pl/kontakt> from <GET http://link4.pl/kontakt%0A> 
2017-09-26 22:27:05 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://danone.pl/Kontakt%0a> from <GET http://danone.pl/Kontakt%0A> 
+0

請勿使用'readlines()';一個文件已經是一個迭代器,它可以讓你一次從文件中讀取一行。 – chepner

+0

@chepner謝謝! – dzierzak

回答

0

我找到了正確的解決方案。我不得不使用rstrip函數。

self.start_urls.append('http://{}'.format(line.rstrip())) 
0

%0A是換行符。讀取這些行可以保持換行符不變。要擺脫它們,您可以使用string.strip函數,如下所示:

  self.start_urls.append('http://{}'.format(string.strip(line))) 
+0

我發現,.rstrip功能在這裏更好。無論如何,謝謝你的回答! – dzierzak