在Scrapy中輸出空文件json

我說我已經閱讀了關於同樣問題的一些答案，但是我無法解決我的問題。我是Python新手，我試圖從Aptoide中提取關於應用程序和商店的數據，並且我想要一個輸出結果作爲.json文件（或csv），但是我得到的文件是空的，我不知道原因。在Scrapy中輸出空文件json

這是我的代碼：

import scrapy 

    from scrapy.spiders import CrawlSpider, Rule 
    from scrapy.linkextractors import LinkExtractor 
    from scrapy.contrib.spiders import CrawlSpider, Rule 
    from scrapy.selector import HtmlXPathSelector 

    class ApptoideItem(scrapy.Item): 
     app_name = scrapy.Field() 
     rating = scrapy.Field() 
     security_status = scrapy.Field() 
     good_flag = scrapy.Field() 
     licence_flag = scrapy.Field() 
     fake_flag = scrapy.Field() 
     freeze_flag = scrapy.Field() 
     virus_flag = scrapy.Field() 
     five_stars = scrapy.Field() 
     four_stars = scrapy.Field() 
     three_stars = scrapy.Field() 
     two_stars = scrapy.Field() 
     one_stars = scrapy.Field() 
     info = scrapy.Field() 
     download = scrapy.Field() 
     version = scrapy.Field() 
     size = scrapy.Field() 
     link = scrapy.Field() 
     store = scrapy.Field() 

class AppSpider(CrawlSpider): 
    name = "second" 
    allowed_domains = ["aptoide.com"] 
    start_urls = [ "http://www.aptoide.com/page/morestores/type:top" ] 

    rules = (
     Rule(LinkExtractor(allow=(r'\w+\.store\.aptoide\.com$'))), 

     Rule(LinkExtractor(allow=(r'\w+\.store\.aptoide\.com/app/market')), callback='parse_item') 
     ) 


def parse_item(self, response): 

    item = ApptoideItem() 
    item['app_name']= str(response.css(".app_name::text").extract()[0]) 
    item['rating']= str(response.css(".app_rating_number::text").extract()[0]) 
    item['security_status']= str(response.css("#show_app_malware_data::text").extract()[0]) 
    item['good_flag']= int(response.css(".good > div:nth-child(3)::text").extract()[0]) 
    item['licence_flag']= int(response.css(".license > div:nth-child(3)::text").extract()[0]) 
    item['fake_flag']= int(response.css(".fake > div:nth-child(3)::text").extract()[0]) 
    item['freeze_flag']= int(response.css(".freeze > div:nth-child(3)::text").extract()[0]) 
    item['virus_flag']= int(response.css(".virus > div:nth-child(3)::text").extract()[0]) 
    item['five_stars']= int(response.css("div.app_ratting_bar_holder:nth-child(1) > div:nth-child(3)::text").extract()[0]) 
    item['four_stars']= int(response.css("div.app_ratting_bar_holder:nth-child(2) > div:nth-child(3)::text").extract()[0]) 
    item['three_stars']= int(response.css("div.app_ratting_bar_holder:nth-child(3) > div:nth-child(3)::text").extract()[0]) 
    item['two_stars']= int(response.css("div.app_ratting_bar_holder:nth-child(4) > div:nth-child(3)::text").extract()[0]) 
    item['link']= response.url 
    item['one_stars']= int(response.css("div.app_ratting_bar_holder:nth-child(5) > div:nth-child(3)::text").extract()[0]) 
    item['download']= int(response.css("p.app_meta::text").re('(\d[\w\.]*)')[0].replace('.', '')) 
    item['version']= str(response.css("p.app_meta::text").re('(\d[\w\.]*)')[1]) 
    item['size']= str(response.css("p.app_meta::text").re('(\d[\w\.]*)')[2]) 
    item['store_name']= str(response.css(".sec_header_txt::text").extract()[0]) 
    item['info_store']= str(response.css(".ter_header2::text").extract()[0]) 
    yield item

我敢肯定的是，problema是永遠不會調用該方法parse_item，我不知道原因。第一條規則在商店之後，而第二條則在商店之後。我認爲正則表達式的語法是正確的。

設置有：

CLOSESPIDER_PAGECOUNT = 1000 
CLOSESPIDER_ITEMCOUNT = 500 
CONCURRENT_REQUESTS = 1 
CONCURRENT_ITEMS = 1 

BOT_NAME = 'nuovo' 


SPIDER_MODULES = ['nuovo.spiders'] 
NEWSPIDER_MODULE = 'nuovo.spiders'

任何人都可以發現問題，並提出了我的解決方案？

來源

2016-08-10 Majid

你是否檢查過你的xpath表達式是否通過'scrapy shell'工作？輸出可能是通過JavaScript來完成的嗎？另外，'scrapy'提供了一個名爲'extract_first（）'的方法，所以你不需要擺弄索引。 – Jan

你的代碼是完全錯誤的，當您運行的蜘蛛，你可以保存日誌，並通過它使用grep：

您ApptoideItem是：我發現
```
scrapy crawl spidername 2>&1 | tee crawl.log 
```
幾個錯誤缺少像store_name等幾個字段。
您的整個int()轉換是不安全的，這意味着如果您的response.css返回None，如果它找不到任何內容，它會執行一個錯誤。

要解決的第二個問題，我建議尋找到scrapy ItemLoaders這將允許你指定默認的行爲對一些領域，如轉項領域_flag爲布爾等
另外，作爲@Jan在評論中提到的，你應該使用extract_first()方法而不是extract()[0]，extract_first允許你指定默認屬性，當沒有找到任何東西時，即.extract_first(default=0)

來源

2016-08-10 11:50:46 Granitosaurus

在Scrapy中輸出空文件json

回答

相關問題