2016-06-08 29 views
0

我一直在尋找答案,並沒有對這個論壇沒有答案雖然有幾個問題一直在問。一個答案是可以在一定時間後停止蜘蛛,但這不適合我,因爲我通常每蜘蛛發射10個網站。所以我的挑戰是,我有10個網站蜘蛛,我想限制每個域20秒的時間,以避免陷入某個網上商店。怎麼做?Scrapy,如何限制每個域的時間?

總的來說,我還可以告訴你,我爬2000公司網站,以使其在有一天,我把這些網站的爲200組的10個網站和我啓動並行200種蜘蛛。這可能是業餘愛好者,但我知道的最好。計算機幾乎凍結,因爲蜘蛛會佔用整個CPU和內存,但第二天我就得到了結果。我正在尋找的是公司網站上的就業網頁。有沒有人有更好的主意如何抓取2000個網站?如果網站之間有網絡商店,爬行可能需要幾天時間,所以這就是爲什麼我想限制每個網域的時間。

預先感謝您。

馬爾科

我的代碼:

#!/usr/bin/python 
# -*- coding: utf-8 -*- 
# encoding=UTF-8 
import scrapy, urlparse, time, sys 
from scrapy.http import Request 
from scrapy.utils.response import get_base_url 
from urlparse import urlparse, urljoin 
from vacancies.items import JobItem 

#We need that in order to force Slovenian pages instead of English pages. It happened at "http://www.g-gmi.si/gmiweb/" that only English pages were found and no Slovenian. 
from scrapy.conf import settings 
settings.overrides['DEFAULT_REQUEST_HEADERS'] = {'Accept':'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8','Accept-Language':'sl',} 
#Settings.set(name, value, priority='cmdline') 
#settings.overrides['DEFAULT_REQUEST_HEADERS'] = {'Accept':'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8','Accept-Language':'sl','en':q=0.8,} 




#start_time = time.time() 
# We run the programme in the command line with this command: 

#  scrapy crawl jobs -o urls.csv -t csv --logfile log.txt 


# We get two output files 
# 1) urls.csv 
# 2) log.txt 

# Url whitelist. 
with open("Q:/Big_Data/Spletne_strani_podjetij/strganje/kljucne_besede/url_whitelist.txt", "r+") as kw: 
    url_whitelist = kw.read().replace('\n', '').split(",") 
url_whitelist = map(str.strip, url_whitelist) 

# Tab whitelist. 
# We need to replace character the same way as in detector. 
with open("Q:/Big_Data/Spletne_strani_podjetij/strganje/kljucne_besede/tab_whitelist.txt", "r+") as kw: 
    tab_whitelist = kw.read().decode(sys.stdin.encoding).encode('utf-8') 
tab_whitelist = tab_whitelist.replace('Ŕ', 'č') 
tab_whitelist = tab_whitelist.replace('L', 'č') 
tab_whitelist = tab_whitelist.replace('Ő', 'š') 
tab_whitelist = tab_whitelist.replace('Ü', 'š') 
tab_whitelist = tab_whitelist.replace('Ä', 'ž') 
tab_whitelist = tab_whitelist.replace('×', 'ž') 
tab_whitelist = tab_whitelist.replace('\n', '').split(",") 
tab_whitelist = map(str.strip, tab_whitelist) 



# Look for occupations in url. 
with open("Q:/Big_Data/Spletne_strani_podjetij/strganje/kljucne_besede/occupations_url.txt", "r+") as occ_url: 
    occupations_url = occ_url.read().replace('\n', '').split(",") 
occupations_url = map(str.strip, occupations_url) 

# Look for occupations in tab. 
# We need to replace character the same way as in detector. 
with open("Q:/Big_Data/Spletne_strani_podjetij/strganje/kljucne_besede/occupations_tab.txt", "r+") as occ_tab: 
    occupations_tab = occ_tab.read().decode(sys.stdin.encoding).encode('utf-8') 
occupations_tab = occupations_tab.replace('Ŕ', 'č') 
occupations_tab = occupations_tab.replace('L', 'č') 
occupations_tab = occupations_tab.replace('Ő', 'š') 
occupations_tab = occupations_tab.replace('Ü', 'š') 
occupations_tab = occupations_tab.replace('Ä', 'ž') 
occupations_tab = occupations_tab.replace('×', 'ž') 
occupations_tab = occupations_tab.replace('\n', '').split(",") 
occupations_tab = map(str.strip, occupations_tab) 

#Join url whitelist and occupations. 
url_whitelist_occupations = url_whitelist + occupations_url 

#Join tab whitelist and occupations. 
tab_whitelist_occupations = tab_whitelist + occupations_tab 


#base = open("G:/myVE/vacancies/bazni.txt", "w") 
#non_base = open("G:/myVE/vacancies/ne_bazni.txt", "w") 


class JobSpider(scrapy.Spider): 

    #Name of spider 
    name = "jobs" 

    #start_urls = open("Q:\Big_Data\Utrip\spletne_strani.txt", "r+").readlines()[0] 
    #print urls 
    #start_urls = map(str.strip, urls) 
    #Start urls 
    start_urls = ["http://www.alius.si"] 
    print "\nSpletna stran   ", start_urls, "\n" 

    #Result of the programme is this list of job vacancies webpages. 
    jobs_urls = [] 


    def parse(self, response): 

     #Theoretically I could save the HTML of webpage to be able to check later and see how it looked like 
     # at the time of downloading. That is important for validation, because it is easier to look at nice HTML webpage instead of naked text. 
     # but I have to write a pipeline http://doc.scrapy.org/en/0.20/topics/item-pipeline.html 

     response.selector.remove_namespaces() 
     #print "response url" , str(response.url) 

     #Take url of response, because we would like to stay on the same domain. 
     parsed = urlparse(response.url) 

     #Base url.   
     #base_url = get_base_url(response).strip() 
     base_url = parsed.scheme+'://'+parsed.netloc 
     #print "base url" , str(base_url) 
     #If the urls grows from seeds, it's ok, otherwise not. 
     if base_url in self.start_urls: 
      #print "base url je v start" 
      #base.write(response.url+"\n") 



      #net1 = parsed.netloc 

      #Take all urls, they are marked by "href" or "data-link". These are either webpages on our website either new websites. 
      urls_href = response.xpath('//@href').extract()  
      urls_datalink = response.xpath('//@data-link').extract() 
      urls = urls_href + urls_datalink 
      #print "povezave na tej strani ", urls 




      #Loop through all urls on the webpage. 
      for url in urls: 

       #Test all new urls. NE DELA 

       #print "url ", str(url) 

       #If url doesn't start with "http", it is relative url, and we add base url to get absolute url.  
       if not (url.startswith("http")): 

        #Povežem delni url z baznim url. 
        url = urljoin(base_url,url).strip() 

       #print "new url ", str(url) 

       new_parsed = urlparse(url) 
       new_base_url = new_parsed.scheme+'://'+new_parsed.netloc 
       #print "new base url ", str(new_base_url) 

       if new_base_url in self.start_urls: 
        #print "yes" 

        url = url.replace("\r", "") 
        url = url.replace("\n", "") 
        url = url.replace("\t", "") 
        url = url.strip() 

        #Remove anchors '#', that point to a section on the same webpage, because this is the same webpage. 
        #But we keep question marks '?', which mean, that different content is pulled from database. 
        if '#' in url: 
         index = url.find('#') 
         url = url[:index] 
         if url in self.jobs_urls: 
          continue 




        #Ignore ftp and sftp. 
        if url.startswith("ftp") or url.startswith("sftp"): 

         continue 





        #Compare each url on the webpage with original url, so that spider doesn't wander away on the net. 
        #net2 = urlparse(url).netloc 
        #test.write("lokacija novega url "+ str(net2)+"\n") 

        #if net2 != net1: 
        # continue 
         #test.write("ni ista lokacija, nadaljujemo\n") 

        #If the last character is slash /, I remove it to avoid duplicates. 
        if url[len(url)-1] == '/':   
         url = url[:(len(url)-1)] 


        #If url includes characters like %, ~ ... it is LIKELY NOT to be the one I are looking for and I ignore it. 
        #However in this case I exclude good urls like http://www.mdm.si/company#employment 
        if any(x in url for x in ['%', '~', 

         #slike 
         '.jpg', '.jpeg', '.png', '.gif', '.eps', '.ico', '.svg', '.tif', '.tiff', 
         '.JPG', '.JPEG', '.PNG', '.GIF', '.EPS', '.ICO', '.SVG', '.TIF', '.TIFF', 

         #dokumenti 
         '.xls', '.ppt', '.doc', '.xlsx', '.pptx', '.docx', '.txt', '.csv', '.pdf', '.pd', 
         '.XLS', '.PPT', '.DOC', '.XLSX', '.PPTX', '.DOCX', '.TXT', '.CSV', '.PDF', '.PD', 

         #glasba in video 
         '.mp3', '.mp4', '.mpg', '.ai', '.avi', '.swf', 
         '.MP3', '.MP4', '.MPG', '.AI', '.AVI', '.SWF', 

         #stiskanje in drugo 
         '.zip', '.rar', '.css', '.flv', '.xml' 
         '.ZIP', '.RAR', '.CSS', '.FLV', '.XML' 

         #Twitter, Facebook, Youtube 
         '://twitter.com', '://mobile.twitter.com', 'www.twitter.com', 
         'www.facebook.com', 'www.youtube.com' 

         #Feeds, RSS, arhiv 
         '/feed', '=feed', '&feed', 'rss.xml', 'arhiv' 


           ]): 

         continue 


        #We need to save original url for xpath, in case we change it later (join it with base_url) 
        #url_xpath = url      


        #We don't want to go to other websites. We want to stay on our website, so we keep only urls with domain (netloc) of the company we are investigating.   
        #if (urlparse(url).netloc == urlparse(base_url).netloc): 



        #The main part. We look for webpages, whose urls include one of the employment words as strings. 
        #We will check the tab of the url as well. This is additional filter, suggested by Dan Wu, to improve accuracy. 
        #tabs = response.xpath('//a[@href="%s"]/text()' % url_xpath).extract() 
        tabs = response.xpath('//a[@href="%s"]/text()' % url).extract() 

        # Sometimes tabs can be just empty spaces like '\t' and '\n' so in this case we replace it with []. 
        # That was the case when the spider didn't find this employment url: http://www.terme-krka.com/si/sl/o-termah-krka/o-podjetju-in-skupini-krka/zaposlitev/ 
        tabs = [tab.encode('utf-8') for tab in tabs] 
        tabs = [tab.replace('\t', '') for tab in tabs] 
        tabs = [tab.replace('\n', '') for tab in tabs] 
        tab_empty = True 
        for tab in tabs: 
         if tab != '': 
          tab_empty = False 
        if tab_empty == True: 
         tabs = [] 


        # -- Instruction. 
        # -- Users in other languages, please insert employment words in your own language, like jobs, vacancies, career, employment ... -- 
        # Starting keyword_url is zero, then we add keywords as we find them in url. 
        keyword_url = '' 
        #for keyword in url_whitelist: 
        for keyword in url_whitelist_occupations: 

         if keyword in url: 
          keyword_url = keyword_url + keyword + ' ' 
        # a) If we find at least one keyword in url, we continue. 
        if keyword_url != '':     

         #1. Tabs are empty. 
         if tabs == []: 



          #We found url that includes one of the magic words and also the text includes a magic word. 
          #We check url, if we have found it before. If it is new, we add it to the list "jobs_urls". 
          if url not in self.jobs_urls : 


           self.jobs_urls.append(url) 
           item = JobItem() 
           item["url"] = url 
           #item["keyword_url"] = keyword_url 
           #item["keyword_url_tab"] = ' ' 
           #item["keyword_tab"] = ' ' 
           print "Zaposlitvena podstran ", url 

           #We return the item. 
           yield item 



         #2. There are texts in tabs, one or more. 
         else: 

          #For the same partial url several texts are possible. 
          for tab in tabs:        

           #We search for keywords in tabs. 
           keyword_url_tab = '' 
           #for key in tab_whitelist: 
           for key in tab_whitelist_occupations: 

            if key in tab: 
             keyword_url_tab = keyword_url_tab + key + ' ' 

           # If we find some keywords in tabs, then we have found keywords in both url and tab and we can save the url. 
           if keyword_url_tab != '': 

            # keyword_url_tab starts with keyword_url from before, because we want to remember keywords from both url and tab. So we add initial keyword_url. 
            keyword_url_tab = 'URL ' + keyword_url + ' TAB ' + keyword_url_tab 

            #We found url that includes one of the magic words and also the tab includes a magic word. 
            #We check url, if we have found it before. If it is new, we add it to the list "jobs_urls". 
            if url not in self.jobs_urls:        

             self.jobs_urls.append(url) 
             item = JobItem() 
             item["url"] = url 
             #item["keyword_url"] = ' ' 
             #item["keyword_url_tab"] = keyword_url_tab 
             #item["keyword_tab"] = ' ' 
             print "Zaposlitvena podstran ", url 

             #We return the item. 
             yield item 

           #We haven't found any keywords in tabs, but url is still good, because it contains some keywords, so we save it. 
           else: 

            if url not in self.jobs_urls:        

             self.jobs_urls.append(url) 
             item = JobItem() 
             item["url"] = url 
             #item["keyword_url"] = keyword_url 
             #item["keyword_url_tab"] = ' ' 
             #item["keyword_tab"] = ' ' 
             print "Zaposlitvena podstran ", url 

             #We return the item. 
             yield item        

        # b) If keyword_url = empty, there are no keywords in url, but perhaps there are keywords in tabs. So we check tabs. 
        else: 
         for tab in tabs: 


          keyword_tab = '' 
          #for key in tab_whitelist: 
          for key in tab_whitelist_occupations: 


           if key in tab: 
            keyword_tab = keyword_tab + key + ' ' 
          if keyword_tab != '':       

           if url not in self.jobs_urls:        

            self.jobs_urls.append(url) 
            item = JobItem() 
            item["url"] = url 
            #item["keyword_url"] = ' ' 
            #item["keyword_url_tab"] = ' ' 
            #item["keyword_tab"] = keyword_tab 
            print "Zaposlitvena podstran ", url 

            #We return the item. 
            yield item     

        #We don't put "else" sentence because we want to further explore the employment webpage to find possible new employment webpages. 
        #We keep looking for employment webpages, until we reach the DEPTH set in settings.py. 
        yield Request(url, callback = self.parse) 

      #else: 
       #non_base.write(response.url+"\n") 
+0

它是如何將不同的停在一個特定的「時間」的完整蜘蛛不是停止它每個域? – eLRuLL

+0

@eLRuLL不同的是,蜘蛛抓取10個網站,如果I(200)秒後停止它,我不能確保每個網站有20秒。這可能是一個網站一直都在消耗,其他網站則被拋在後面。我不知道有關進程以及機器內部如何處理請求。 – Marko

回答

1

只需使用scrapyd調度2000單的網站抓取。設置max_proc = 10 []並行運行10個蜘蛛。設置蜘蛛的CLOSESPIDER_TIMEOUT [2到20運行間隔蜘蛛爲20秒。原生停止使用Windows,因爲這很痛苦。我觀察到Scrapy和scrapyd在虛擬機內運行得更快,而不是在Windows上運行。我可能是錯的 - 所以儘量讓自己反覆檢查,但我有強烈的感覺,如果你在Windows上使用Ubuntu 14.04 virtualbox image,它會更快。您的抓取需要2000 * 20/10 = 17分鐘。

+0

謝謝@neverlastn,看起來不錯。如果我能設法管理它,因爲我的技術不太先進。我只是在開發一些應用程序,我不太喜歡它。如果運行速度更快,它會很好。這是我開始的項目,現在我承擔了它。 – Marko

+0

@Marko - 很高興!嘗試一段時間,如果它不起作用,讓我知道。我很肯定這個解決方案工程:) – neverlastn

+0

是好,我在國家統計局工作,我們都與計算機科學學院,是幫助我們與機器學習工具橙色,這是建立在Python的。所以既然他們是Python程序員,我可能會問他們的幫助:) – Marko