2015-11-05 16 views
1

使用scrapy我遇到了JavaScript呈現頁面的問題。對於站點論壇特許經營,例如鏈接http://www.idee-franchise.com/forum/viewtopic.php?f=3&t=69,試圖取消源html我無法檢索任何帖子,因爲它們似乎是在頁面呈現後「附加」(可能通過javascript)。PyQt4 Scrapy的實現

所以我一直在尋找解決這個問題的網絡,我碰到了https://impythonist.wordpress.com/2015/01/06/ultimate-guide-for-scraping-javascript-rendered-web-pages/

我對PYPQ是全新的,但希望能採取快捷方式並複製粘貼一些代碼。

這個工作完美的當我試圖報廢一個頁面。但後來當我在scrapy實現這一點,我得到以下錯誤:

QObject::connect: Cannot connect (null)::configurationAdded(QNetworkConfiguration) to QNetworkConfigurationManager::configurationAdded(QNetworkConfiguration) 
QObject::connect: Cannot connect (null)::configurationRemoved(QNetworkConfiguration) to QNetworkConfigurationManager::configurationRemoved(QNetworkConfiguration) 
QObject::connect: Cannot connect (null)::configurationChanged(QNetworkConfiguration) to QNetworkConfigurationManager::configurationChanged(QNetworkConfiguration) 
QObject::connect: Cannot connect (null)::onlineStateChanged(bool) to QNetworkConfigurationManager::onlineStateChanged(bool) 
QObject::connect: Cannot connect (null)::configurationUpdateComplete() to QNetworkConfigurationManager::updateCompleted() 

如果我放棄一個頁面,則不會發生錯誤,但是當我設置爬蟲遞歸模式,然後在第二個鏈接我得到python.exe停止工作的錯誤和上述錯誤。

我將搜索這可能是什麼,而我讀的QApplication對象應該只啓動一次。

有人能告訴我什麼應該是正確的實施?

蜘蛛

# -*- coding: utf-8 -*- 
import scrapy 
import sys, traceback 
from bs4 import BeautifulSoup as bs 
from scrapy.spiders import CrawlSpider, Rule 
from scrapy.linkextractors import LinkExtractor 
from crawler.items import ThreadItem, PostItem 
from crawler.utils import utils 


class IdeefranchiseSpider(CrawlSpider): 
    name = "ideefranchise" 
    allowed_domains = ["idee-franchise.com"] 
    start_urls = (
     'http://www.idee-franchise.com/forum/', 
     # 'http://www.idee-franchise.com/forum/viewtopic.php?f=3&t=69', 
    ) 

    rules = [ 
     Rule(LinkExtractor(allow='/forum/'), callback='parse_thread', follow=True) 
    ] 

    def parse_thread(self, response): 
     print "Parsing Thread", response.url 
     thread = ThreadItem() 
     thread['url'] = response.url 
     thread['domain'] = self.allowed_domains[0] 
     thread['title'] = self.get_thread_title(response) 
     thread['forumname'] = self.get_thread_forum_name(response) 
     thread['posts'] = self.get_thread_posts(response) 
     yield thread 

     # paginate if possible 
     next_page = response.css('fieldset.display-options > a::attr("href")') 
     if next_page: 
      url = response.urljoin(next_page[0].extract()) 
      yield scrapy.Request(url, self.parse_thread) 

    def get_thread_posts(self, response): 
     # using PYQTRenderor to reload page. I think this is where the problem 
     # occurs, when i initiate the PYQTPageRenderor object. 
     soup = bs(unicode(utils.PYQTPageRenderor(response.url).get_html())) 

     # sleep so that PYQT can render page 
     # time.sleep(5) 

     # comments 
     posts = [] 
     for item in soup.select("div.post.bg2") + soup.select("div.post.bg1"): 
      try: 
       post = PostItem() 
       post['profile'] = item.select("p.author > strong > a")[0].get_text() 
       details = item.select('dl.postprofile > dd') 
       post['date'] = details[2].get_text() 
       post['content'] = item.select('div.content')[0].get_text() 

       # appending the comment 
       posts.append(post) 
      except: 
       e = sys.exc_info()[0] 
       self.logger.critical("ERROR GET_THREAD_POSTS %s", e) 
       traceback.print_exc(file=sys.stdout) 
     return posts 

的PYPQ實施

import sys 
from PyQt4.QtCore import QUrl 
from PyQt4.QtGui import QApplication 
from PyQt4.QtWebKit import QWebPage 

class Render(QWebPage): 
    def __init__(self, url): 
     self.app = QApplication(sys.argv) 
     QWebPage.__init__(self) 
     self.loadFinished.connect(self._loadFinished) 
     self.mainFrame().load(QUrl(url)) 
     self.app.exec_() 

    def _loadFinished(self, result): 
     self.frame = self.mainFrame() 
     self.app.quit() 


class PYQTPageRenderor(object): 
    def __init__(self, url): 
     self.url = url 

    def get_html(self): 
     r = Render(self.url) 
     return unicode(r.frame.toHtml()) 

回答

0

的正確實施,如果你想自己做,將是創建一個使用PyQt的處理請求的downlader middleware。它將被Scrapy實例化一次。

不應該那麼複雜,只是

  1. middleware.py文件項目

  2. 構造應該創建QApplication對象的創建QTDownloader類。

  3. process_request方法應該執行url加載和HTML抓取。請注意,您返回一個Response對象與HTML字符串。

  4. 你可能會在你的班級的_cleanup方法中做適當的清理。

  5. 最後,通過將您的中間件添加到項目的settings.py文件的DOWNLOADER_MIDDLEWARES變量中來激活您的中間件。

如果你不想寫自己的解決方案,你可以使用一個使用硒做下載,就像scrapy-webdriver現有的中間件。如果你不想有一個可見的瀏覽器,你可以指示它使用PhantomJS。

EDIT1: 所以適當辦法做到這一點,正如Rejected指出的是使用下載處理。這個想法是類似的,但下載應該發生在download_request方法中,並且應該通過將其添加到DOWNLOAD_HANDLERS來啓用它。以WebdriverDownloadHandler爲例。

+0

感謝你的指導方針。我會馬上試試 –

+1

我一直認爲這是一個糟糕的實施。中間件不應該是做什麼處理下載。您正在縮短下載程序的使用時間,這會導致很多意想不到的結果(如限制/延遲)。直接實現它的* proper *方法是重寫http(s)下載處理程序。 – Rejected

+0

@Rejected我似乎無法找到關於下載器的文檔。你能告訴我,我應該閱讀下載的文件嗎?謝謝 –