使用scrapy我遇到了JavaScript呈現頁面的問題。對於站點論壇特許經營,例如鏈接http://www.idee-franchise.com/forum/viewtopic.php?f=3&t=69,試圖取消源html我無法檢索任何帖子,因爲它們似乎是在頁面呈現後「附加」(可能通過javascript)。PyQt4 Scrapy的實現
所以我一直在尋找解決這個問題的網絡,我碰到了https://impythonist.wordpress.com/2015/01/06/ultimate-guide-for-scraping-javascript-rendered-web-pages/。
我對PYPQ是全新的,但希望能採取快捷方式並複製粘貼一些代碼。
這個工作完美的當我試圖報廢一個頁面。但後來當我在scrapy實現這一點,我得到以下錯誤:
QObject::connect: Cannot connect (null)::configurationAdded(QNetworkConfiguration) to QNetworkConfigurationManager::configurationAdded(QNetworkConfiguration)
QObject::connect: Cannot connect (null)::configurationRemoved(QNetworkConfiguration) to QNetworkConfigurationManager::configurationRemoved(QNetworkConfiguration)
QObject::connect: Cannot connect (null)::configurationChanged(QNetworkConfiguration) to QNetworkConfigurationManager::configurationChanged(QNetworkConfiguration)
QObject::connect: Cannot connect (null)::onlineStateChanged(bool) to QNetworkConfigurationManager::onlineStateChanged(bool)
QObject::connect: Cannot connect (null)::configurationUpdateComplete() to QNetworkConfigurationManager::updateCompleted()
如果我放棄一個頁面,則不會發生錯誤,但是當我設置爬蟲遞歸模式,然後在第二個鏈接我得到python.exe停止工作的錯誤和上述錯誤。
我將搜索這可能是什麼,而我讀的QApplication對象應該只啓動一次。
有人能告訴我什麼應該是正確的實施?
蜘蛛
# -*- coding: utf-8 -*-
import scrapy
import sys, traceback
from bs4 import BeautifulSoup as bs
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from crawler.items import ThreadItem, PostItem
from crawler.utils import utils
class IdeefranchiseSpider(CrawlSpider):
name = "ideefranchise"
allowed_domains = ["idee-franchise.com"]
start_urls = (
'http://www.idee-franchise.com/forum/',
# 'http://www.idee-franchise.com/forum/viewtopic.php?f=3&t=69',
)
rules = [
Rule(LinkExtractor(allow='/forum/'), callback='parse_thread', follow=True)
]
def parse_thread(self, response):
print "Parsing Thread", response.url
thread = ThreadItem()
thread['url'] = response.url
thread['domain'] = self.allowed_domains[0]
thread['title'] = self.get_thread_title(response)
thread['forumname'] = self.get_thread_forum_name(response)
thread['posts'] = self.get_thread_posts(response)
yield thread
# paginate if possible
next_page = response.css('fieldset.display-options > a::attr("href")')
if next_page:
url = response.urljoin(next_page[0].extract())
yield scrapy.Request(url, self.parse_thread)
def get_thread_posts(self, response):
# using PYQTRenderor to reload page. I think this is where the problem
# occurs, when i initiate the PYQTPageRenderor object.
soup = bs(unicode(utils.PYQTPageRenderor(response.url).get_html()))
# sleep so that PYQT can render page
# time.sleep(5)
# comments
posts = []
for item in soup.select("div.post.bg2") + soup.select("div.post.bg1"):
try:
post = PostItem()
post['profile'] = item.select("p.author > strong > a")[0].get_text()
details = item.select('dl.postprofile > dd')
post['date'] = details[2].get_text()
post['content'] = item.select('div.content')[0].get_text()
# appending the comment
posts.append(post)
except:
e = sys.exc_info()[0]
self.logger.critical("ERROR GET_THREAD_POSTS %s", e)
traceback.print_exc(file=sys.stdout)
return posts
的PYPQ實施
import sys
from PyQt4.QtCore import QUrl
from PyQt4.QtGui import QApplication
from PyQt4.QtWebKit import QWebPage
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
class PYQTPageRenderor(object):
def __init__(self, url):
self.url = url
def get_html(self):
r = Render(self.url)
return unicode(r.frame.toHtml())
感謝你的指導方針。我會馬上試試 –
我一直認爲這是一個糟糕的實施。中間件不應該是做什麼處理下載。您正在縮短下載程序的使用時間,這會導致很多意想不到的結果(如限制/延遲)。直接實現它的* proper *方法是重寫http(s)下載處理程序。 – Rejected
@Rejected我似乎無法找到關於下載器的文檔。你能告訴我,我應該閱讀下載的文件嗎?謝謝 –