用Scrapy解析文檔

我有一個問題，我想解析一個網站並從中抓取每篇文章的鏈接，但問題是Scrapy不抓取所有鏈接並隨機抓取其中一些鏈接。用Scrapy解析文檔

import scrapy 

from tutorial.items import GouvItem 

class GouvSpider(scrapy.Spider): 

    name = "gouv" 

    allowed_domains = ["legifrance.gouv.fr"] 

    start_urls = [ 

     "http://www.legifrance.gouv.fr/affichCode.do?cidTexte=LEGITEXT000006069577&dateTexte=20160128" 

     ] 

    def parse(self, response): 
     for href in response.xpath('//span/a/@href'): 
      url = response.urljoin(href.extract()) 
      yield scrapy.Request(url, callback=self.parse_article) 

    def parse_article(self, response): 
     for art in response.xpath("//div[@class='corpsArt']"): 
      item = GouvItem() 
      item['article'] = art.xpath('p/text()').extract() 
      yield item 




#And this is the GouvItem : 

import scrapy 

class GouvItem(scrapy.Item): 
    title1 = scrapy.Field() 
    title2 = scrapy.Field() 
    title3 = scrapy.Field() 
    title4 = scrapy.Field() 
    title5 = scrapy.Field() 
    title6 = scrapy.Field() 
    link = scrapy.Field() 
    article = scrapy.Field()

This is some lines of the json file and we can see that some articles missing and others are there but many times

的問題是，法律的每篇文章應該有且只有一次。在網站上，每篇文章只出現一次。

非常感謝！

來源

2016-02-02 Aurelien.Farcy

請編輯你的帖子並在這裏粘貼你的代碼以便人們可以將其複製粘貼到他們的編輯器中 –

包括'GouvItem'的定義，太好了 –

......我只是意識到如果我執行相同的腳本兩次，兩個結果都不一樣......我不明白... –

指向網站子頁面的鏈接包含一個sessionID。它看起來像一個請求的響應考慮到sessionID的方式，不適合與scrapy發送多個併發請求。

解決此問題的一種方法是將settings.py中的CONCERRENT_REQUESTS的數量設置爲1。使用此設置刮刮花的時間會更長。

另一種方法是用列表手動控制請求。在SO上看到這個answer。

爲了防止空結果使用相對XPath（後點），並提取所有文字：

item['article'] = art.xpath('.//text()').extract()

希望這有助於。

來源

2016-02-18 16:37:49

非常感謝！似乎要做得更好，但法律並不正確。這意味着爬蟲採取所有ul/li文本，然後所有ul/li/ul/li等？我要測試整個頁面以瞭解。 –

It Works !!!!!非常感謝！！！我得到了一切！我現在唯一的問題是，法律仍然沒有正確的順序......你有什麼想法嗎？ –

將文章部分文本另存爲項目的其他字段。然後，您可以按照該字段對生成的json文件進行排序。我不知道如何直接用scrapy完成 - 對不起！ –

用Scrapy解析文檔

回答

相關問題