Scrapy爬蟲不能在網站上工作我得到部分結果

我是Scrapy和Python的新手。我一直在努力從2個網站提取數據，如果我直接使用python，它們工作得非常好。我調查了，我想抓取這些網站：Scrapy爬蟲不能在網站上工作我得到部分結果

homedepot.com.mx/comprar/es/miguel-aleman/home（作品完美）
vallenproveedora.com.mx/（不工作）

有人能告訴我如何使第二個鏈接工作？

我看到這條消息：

DEBUG: Crawled (200) allenproveedora.com.mx/> (referer: None) ['partial']

，但我不能找出如何解決它。

我將不勝感激任何幫助和支持。下面是代碼和日誌：

items.py 

from scrapy.item import Item, Field 

class CraigslistSampleItem(Item): 
    title = Field() 
    link = Field()

Test.py（蜘蛛文件夾）

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 
from craigslist_sample.items import CraigslistSampleItem 

class MySpider(BaseSpider): 
    name = "craig" 
    allowed_domains = ["vallenproveedora.com.mx"] 
    #start_urls = ["http://www.homedepot.com.mx/webapp/wcs/stores/servlet/SearchDisplay?searchTermScope=&filterTerm=&orderBy=&maxPrice=&showResultsPage=true&langId=-5&beginIndex=0&sType=SimpleSearch&pageSize=&manufacturer=&resultCatEntryType=2&catalogId=10052&pageView=table&minPrice=&urlLangId=-5&storeId=13344&searchTerm=guante"] 
    start_urls = ["http://www.vallenproveedora.com.mx/"] 
    def parse(self, response): 
     titles = response.xpath('//ul/li') 
     for titles in titles: 
      title = titles.select("a/text()").extract() 
      link = titles.select("a/@href").extract() 
      print (title, link)

來源

2016-08-21 Roberto Lozano

你看到你的日誌['partial']因爲vallenproveedora.com.mx服務器沒有設置其響應中的Content-Length標題;運行curl -I親自查看。有關partial標誌原因的更多詳細信息，請參見my answer here。

但是，您實際上不必擔心這一點。響應主體就在那裏，Scrapy會解析它。您真正遇到的問題是XPath //ul/li/a沒有選擇任何元素。您應該查看頁面源並相應地修改您的選擇器。我會建議爲每個網站編寫一個特定的蜘蛛，因爲網站通常需要不同的選擇器。

來源

2016-08-24 03:14:14

非常感謝！它完美地工作。 –

Scrapy爬蟲不能在網站上工作我得到部分結果

回答

相關問題