Scrapy蜘蛛無法找到上點擊加載網址

我試圖從該頁面抽取數據 - http://catalog.umassd.edu/content.php?catoid=45&navoid=3554 Scrapy蜘蛛無法找到上點擊加載網址

我想擴大與「顯示課程本部門」鏈接每個部分，然後拿到該頁面上每門課程的課程信息（文本）。

我已經寫了下面的腳本：

from scrapy.spiders import CrawlSpider, Rule, BaseSpider, Spider 
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor 
from scrapy.selector import Selector 
from scrapy.http import HtmlResponse 

from courses.items import Course 


class EduSpider(CrawlSpider): 
    name = 'umassd.edu' 
    allowed_domains = ['umassd.edu'] 
    start_urls = ['http://catalog.umassd.edu/content.php'] 

    rules = (Rule(LxmlLinkExtractor(
     allow=('.*/http://catalog.umassd.edu/preview_course.php? 
     catoid=[0-9][0-9]&coid=[0-9][0-9][0-9][0-9][0-9][0-9]',), 
     ), callback='parse_item'), 

    def parse_item(self, response): 
     item = Course() 
     print (response)

現在，不管我給什麼START_URL，蜘蛛似乎無法永遠到達preview_course.php鏈接 - 我嘗試了一些變化。腳本退出時根本沒有抓取任何/content.php頁面。

這僅用於教育目的。

來源

2017-03-24 boltthrower

您正在尋找的網址是通過AJAX請求獲取的。如果你打開你的瀏覽器的開發工具，進入「網絡」選項卡，你可以看到正在進行的一個請求，當您單擊按鈕時，是這樣的：由產生該網址

http://catalog.umassd.edu/ajax/preview_filter_show_hide_data.php?show_hide=show&cat_oid=45&nav_oid=3554&ent_oid=2027&type=c&link_text=this%20department

JavaScript，然後它的內容被下載並注入到您的頁面中。
由於scrapy不執行任何JavaScript，你需要自己重新創建這個URL。幸運的是，在您的情況下對此進行逆向工程非常簡單。

如果檢查HTML源代碼，你可以看到，「爲這個部門顯示課程」鏈接節點上有一些有趣的東西：

<a href="#" 
target="_blank" 
onclick="showHideFilterData(this, 'show', '45', '3554', '2027', 'c', 'this department'); return false;> 
Display courses for this department.</a>

我們可以看到，當我們點擊了JavaScript函數發生，如果我們將這與我們上面的網址進行比較，您可以清楚地看到一些相似之處。

現在，我們可以利用這些數據創建此網址：

class MySpider(scrapy.Spider): 
    name = 'myspider' 
    start_urls = ['http://catalog.umassd.edu/content.php?catoid=45&navoid=3554'] 

    def parse(self, response): 
     # get "onclick" java function of every "show more" link 
     # and extract parameters supplied to this function with regular expressions 
     links = response.xpath("//a/@onclick[contains(.,'showHide')]") 
     for link in links: 
      args = link.re("'(.+?)'") 
      # make our url by putting arguments from page source 
      # into a template of an url 
      url = 'http://catalog.umassd.edu/ajax/preview_filter_show_hide_data.php?show_hide={}&cat_oid={}&nav_oid={}&ent_oid={}&type={}&link_text={}'.format(*args) 
      yield scrapy.Request(url, self.parse_more) 

    def parse_more(self, response): 
     # here you'll get page source with all of the links

來源

2017-03-24 08:44:40 Granitosaurus

這是非常複雜，我只得到了尋找的AJAX鏈接和參數，但我不知道如何使用它們。非常感謝！我必須提到args是一個unicode類型，並將args轉換爲列表將使格式（* args）行正常工作。 – boltthrower

@boltthrower謝謝，我修復了args部分。 – Granitosaurus

Scrapy蜘蛛無法找到上點擊加載網址

回答

相關問題