2015-11-06 36 views
1

我使用scrapy得到的 http://www.bbb.org/greater-san-francisco/business-reviews/architects/klopf-architecture-in-san-francisco-ca-152805使用scrapy

數據丟失的數據所以我創造了一些項目,以保存該信息,但我不每次運行腳本時得到的所有數據,通常我得到一些空的項目,所以我需要再次運行腳本,直到我得到所有的項目。

這是蜘蛛

import scrapy 
from tutorial.items import Product 
from scrapy.loader import ItemLoader 

from scrapy.contrib.loader import XPathItemLoader 
from scrapy.selector import HtmlXPathSelector 



class DmozSpider(scrapy.Spider): 
    name = "dmoz" 
    allowed_domains = ["bbb.org/"] 
    start_urls = [ 
     "http://www.bbb.org/greater-san-francisco/business-reviews/architects/klopf-architecture-in-san-francisco-ca-152805" 
     #"http://www.bbb.org/greater-san-francisco/business-reviews/architects/a-d-architects-in-oakland-ca-133229" 
     #"http://www.bbb.org/greater-san-francisco/business-reviews/architects/aecom-in-concord-ca-541360" 
    ] 


    def parse(self, response): 
     filename = response.url.split("/")[-2] + '.html' 
     with open(filename, 'wb') as f: 
      f.write(response.body) 
     producto = Product() 
     #producto['name'] = response.xpath('//*[@id="business-detail"]/div/h1') 
     producto = Product(Name=response.xpath('//*[@id="business-detail"]/div/h1/text()').extract(), 
     Telephone=response.xpath('//*[@id="business-detail"]/div/p/span[1]/text()').extract(), 
     Address=response.xpath('//*[@id="business-detail"]/div/p/span[2]/span[1]/text()').extract(), 
     Description=response.xpath('//*[@id="business-description"]/p[2]/text()').extract(), 
     BBBAccreditation =response.xpath('//*[@id="business-accreditation-content"]/p[1]/text()').extract(), 
     Complaints=response.xpath('//*[@id="complaint-sort-container"]/text()').extract(), 
     Reviews=response.xpath('//*[@id="complaint-sort-container"]/p/text()').extract(), 
     WebPage=response.xpath('//*[@id="business-detail"]/div/p/span[3]/a/text()').extract(), 
     Rating = response.xpath('//*[@id="accedited-rating"]/img/text()').extract(), 
     ServiceArea = response.xpath('//*[@id="business-additional-info-text"]/span[4]/p/text()').extract(), 
     ReasonForRating = response.xpath('//*[@id="reason-rating-content"]/ul/li[1]/text()').extract(), 
     NumberofEmployees = response.xpath('//*[@id="business-additional-info-text"]/p[8]/text()').extract(), 
     LicenceNumber = response.xpath('//*[@id="business-additional-info-text"]/p[6]/text()').extract(), 
     Contact = response.xpath('//*[@id="business-additional-info-text"]/span[3]/span/span[1]/text()').extract(), 
     BBBFileOpened = response.xpath('//*[@id="business-additional-info-text"]/span[3]/span/span[1]/text()').extract(), 
     BusinessStarted = response.xpath('//*[@id="business-additional-info-text"]/span[3]/span/span[1]/text()').extract(),) 



     #producto.add_xpath('name', '//*[@id="business-detail"]/div/h1') 
     #product.add_value('name', 'today') # you can also use literal values 
     #product.load_item() 





     return producto 

本頁面requieres設置一個用戶代理的代碼,所以我有用戶代理的文件,可能比他們中的一些是錯誤的?

+0

可能是他們中的一些也都是錯的禁用randoming用戶代理,會發生什麼,如果你只是設置'USER_AGENT =「someuseragent」'您的設置(記得要去除randoming用戶代理中間件 – eLRuLL

+0

似乎工作 –

回答

1

是的,你的一些用戶代理的可能是錯誤的(也許有些舊的,過時)和站點,如果在只使用一個用戶代理沒問題,你可以添加到settings.py

USER_AGENT="someuseragent" 

記得刪除或settings.py