產量scrapy.Request不返回標題

我是Scrapy的新手，並嘗試使用它來練習抓取網站。但是，即使我遵循教程提供的代碼，它也不會返回結果。它看起來像yield scrapy.Request不起作用。我的代碼如下：產量scrapy.Request不返回標題

Import scrapy 
from bs4 import BeautifulSoup 
from apple.items import AppleItem 

class Apple1Spider(scrapy.Spider): 
    name = 'apple' 
    allowed_domains = ['appledaily.com'] 
    start_urls =['http://www.appledaily.com.tw/realtimenews/section/new/'] 

    def parse(self, response): 
     domain = "http://www.appledaily.com.tw" 
     res = BeautifulSoup(response.body) 
     for news in res.select('.rtddt'): 
      yield scrapy.Request(domain + news.select('a')[0]['href'], callback=self.parse_detail) 

    def parse_detail(self, response): 
     res = BeautifulSoup(response.body) 
     appleitem = AppleItem() 
     appleitem['title'] = res.select('h1')[0].text 
     appleitem['content'] = res.select('.trans')[0].text 
     appleitem['time'] = res.select('.gggs time')[0].text 
     return appleitem

它表明，蜘蛛被打開和關閉，但它什麼都沒有返回。 Python的版本是3.6。任何人都可以幫忙嗎？謝謝。

編輯我

爬網日誌可以達到here。

編輯II

也許，如果我改變，因爲下面的代碼會使問題更加清晰：

Import scrapy 
from bs4 import BeautifulSoup 


class Apple1Spider(scrapy.Spider): 
    name = 'apple' 
    allowed_domains = ['appledaily.com'] 
    start_urls = ['http://www.appledaily.com.tw/realtimenews/section/new/'] 

    def parse(self, response): 
     domain = "http://www.appledaily.com.tw" 
     res = BeautifulSoup(response.body) 
     for news in res.select('.rtddt'): 
      yield scrapy.Request(domain + news.select('a')[0]['href'], callback=self.parse_detail) 

    def parse_detail(self, response): 
     res = BeautifulSoup(response.body) 
     print(res.select('#h1')[0].text)

的代碼應打印出URL，並分別冠軍，但它不返回任何東西。

來源

2017-07-10 tzu

你可以張貼爬網日誌？您可以通過'scrapy crawl spider --logfile output.log'或'scrapy crawl spider 2> 1 | tee output.log'命令（後者將輸出放到屏幕和文件中）。 – Granitosaurus

@Granitosaurus，我只是將鏈接添加到日誌文件。謝謝。 – tzu

您的登錄狀態：

2017年7月10日十九時12分47秒[scrapy.spidermiddlewares.offsite] DEBUG：過濾異地請求 'www.appledaily.com.tw'：HTTP： //www.appledaily.com.tw/realtimenews/article/life/201 70710/1158177/oBike％E7％A6％81％E5％81％9C％E6％A9％9F％E8％BB％8A％E6％ A0％BC％E3％80％80％E6％96％B0％E5％8C％ 97％E7％81％AB％E9％80％9F％E5％86％8D％E5％85％AC％E5％91 ％8A6％E5％8D％80％E7％A6％81％E5％81％9C>

你的蜘蛛設置爲：

allowed_domains = ['appledaily.com']

所以這也許應該是：

allowed_domains = ['appledaily.com.tw']

來源

2017-07-10 11:43:48 Granitosaurus

非常感謝。我甚至不認爲這是因爲這個。 – tzu

看起來您對parse方法感興趣的內容（即列表項rtddt）是動態生成的 - 可以使用Chrome進行檢查，但不存在於HTML源代碼中一個迴應）。

您將不得不使用某些東西來首先呈現Scrapy的頁面。我會推薦Splash與scrapy-splash包。

來源

2017-07-10 11:24:41

產量scrapy.Request不返回標題

回答

相關問題