2014-08-30 58 views
1

更多的文章我想刮使用網站scrapy, 我的蜘蛛如下:與崗位要求的工作負載與scrapy蟒蛇

class mySpider(CrawlSpider): 
    name = "mytest" 
    allowed_domains = {'www.example.com'} 
    start_urls = ['http://www.example.com'] 

    rules = [ 
    Rule(SgmlLinkExtractor(allow=[r'\d{4}/\d{2}/\w+']), callback = 'parse_post', 
    follow= True) 
    ] 

    def parse_post(self, response): 
     item = PostItem() 

     item['url'] = response.url 

     item['title'] = response.xpath('//title/text()').extract() 
     item['authors'] = response.xpath('//span[@class="author"]/text()').extract() 

     return item 

一切工作正常,但它只是擦傷的網頁的鏈接。它允許加載更多的文章與發佈請求,即'點擊更多文章'。 是否有反正我可以模擬加載更多的文章按鈕來加載文章,並繼續刮板?

+0

這取決於如何 「的文章」 鏈接實際工作。你能分享到網站的實際鏈接嗎? – alecxe 2014-08-30 14:30:06

+0

@alecxe它的ijreview.com – Anish 2014-08-30 14:30:33

回答

2

「加載更多文章」按鈕由JavaScript管理,點擊ti激發AJAX發佈請求。

換句話說,這是Scrapy不能輕易處理的東西。

但是,如果Scrapy不是必需的,這裏是用requestsBeautifulSoup一個解決方案:

from bs4 import BeautifulSoup 
import requests 


url = "http://www.ijreview.com/wp-admin/admin-ajax.php" 
session = requests.Session() 
page_size = 24 

params = { 
    'action': 'load_more', 
    'numPosts': page_size, 
    'category': '', 
    'orderby': 'date', 
    'time': '' 
} 

offset = 0 
limit = 100 
while offset < limit: 
    params['offset'] = offset 
    response = session.post(url, data=params) 
    links = [a['href'] for a in BeautifulSoup(response.content).select('li > a')] 
    for link in links: 
     response = session.get(link) 
     page = BeautifulSoup(response.content) 
     title = page.find('title').text.strip() 
     author = page.find('span', class_='author').text.strip() 
     print {'link': link, 'title': title, 'author': author} 

    offset += page_size 

打印:

{'author': u'Kevin Boyd', 'link': 'http://www.ijreview.com/2014/08/172770-president-obama-realizes-world-messy-place-thanks-social-media/', 'title': u'President Obama Calls The World A Messy Place & Blames Social Media for Making People Take Notice'} 
{'author': u'Reid Mene', 'link': 'http://www.ijreview.com/2014/08/172405-17-politicians-weird-jobs-time-office/', 'title': u'12 Most Unusual Professions of Politicians Before They Were Elected to Higher Office'} 
{'author': u'Michael Hausam', 'link': 'http://www.ijreview.com/2014/08/172653-video-duty-mp-fakes-surrender-shoots-hostage-taker/', 'title': u'Video: Off-Duty MP Fake Surrenders at Gas Station Before Revealing Deadly Surprise for Hostage Taker'} 
... 

您可能需要調整的代碼,以便它支持不同類別,排序等您還可以通過允許BeautifulSoup使用lxml解析器引擎蓋內 - 而不是BeautifulSoup(response.content),使用BeautifulSoup(response.content, "lxml"),但您woul d需要安裝lxml


這是你如何調整解決Scrapy:

import urllib 
from scrapy import Item, Field, Request, Spider 

class mySpider(Spider): 
    name = "mytest" 
    allowed_domains = {'www.ijreview.com'} 

    def start_requests(self): 
     page_size = 25 
     headers = {'User-Agent': 'Scrapy spider', 
        'X-Requested-With': 'XMLHttpRequest', 
        'Host': 'www.ijreview.com', 
        'Origin': 'http://www.ijreview.com', 
        'Accept': '*/*', 
        'Referer': 'http://www.ijreview.com/', 
        'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'} 
     for offset in (0, 200, page_size): 
      yield Request('http://www.ijreview.com/wp-admin/admin-ajax.php', 
          method='POST', 
          headers=headers, 
          body=urllib.urlencode(
           {'action': 'load_more', 
           'numPosts': page_size, 
           'offset': offset, 
           'category': '', 
           'orderby': 'date', 
           'time': ''})) 

    def parse(self, response): 
     for link in response.xpath('//ul/li/a/@href').extract(): 
      yield Request(link, callback=self.parse_post) 

    def parse_post(self, response): 
     item = PostItem() 

     item['url'] = response.url 
     item['title'] = response.xpath('//title/text()').extract()[0].strip() 
     item['authors'] = response.xpath('//span[@class="author"]/text()').extract()[0].strip() 

     return item 

輸出:

{'authors': u'Kyle Becker', 
'title': u'17 Reactions to the \u2018We Don\u2019t Have a Strategy\u2019 Gaffe That May Haunt the Rest of Obama\u2019s Presidency', 
'url': 'http://www.ijreview.com/2014/08/172569-25-reactions-obamas-dont-strategy-gaffe-may-haunt-rest-presidency/'} 

... 
+0

看起來不錯,但你能給我一些與scrapy綁定的想法嗎? – Anish 2014-08-30 14:50:31

+0

@Ngeunpo當然,我已經爲Scrapy添加了一個示例調整。希望你能用這個基礎和改進。 – alecxe 2014-08-30 15:10:29

+0

@alecxe Request的一個子類是[FormRequest](http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spider.Spider.start_requests),這也可能有助於這個案例。 – 2014-09-03 04:50:35