「加載更多文章」按鈕由JavaScript管理,點擊ti激發AJAX發佈請求。
換句話說,這是Scrapy
不能輕易處理的東西。
但是,如果Scrapy
不是必需的,這裏是用requests
和BeautifulSoup
一個解決方案:
from bs4 import BeautifulSoup
import requests
url = "http://www.ijreview.com/wp-admin/admin-ajax.php"
session = requests.Session()
page_size = 24
params = {
'action': 'load_more',
'numPosts': page_size,
'category': '',
'orderby': 'date',
'time': ''
}
offset = 0
limit = 100
while offset < limit:
params['offset'] = offset
response = session.post(url, data=params)
links = [a['href'] for a in BeautifulSoup(response.content).select('li > a')]
for link in links:
response = session.get(link)
page = BeautifulSoup(response.content)
title = page.find('title').text.strip()
author = page.find('span', class_='author').text.strip()
print {'link': link, 'title': title, 'author': author}
offset += page_size
打印:
{'author': u'Kevin Boyd', 'link': 'http://www.ijreview.com/2014/08/172770-president-obama-realizes-world-messy-place-thanks-social-media/', 'title': u'President Obama Calls The World A Messy Place & Blames Social Media for Making People Take Notice'}
{'author': u'Reid Mene', 'link': 'http://www.ijreview.com/2014/08/172405-17-politicians-weird-jobs-time-office/', 'title': u'12 Most Unusual Professions of Politicians Before They Were Elected to Higher Office'}
{'author': u'Michael Hausam', 'link': 'http://www.ijreview.com/2014/08/172653-video-duty-mp-fakes-surrender-shoots-hostage-taker/', 'title': u'Video: Off-Duty MP Fake Surrenders at Gas Station Before Revealing Deadly Surprise for Hostage Taker'}
...
您可能需要調整的代碼,以便它支持不同類別,排序等您還可以通過允許BeautifulSoup
使用lxml
解析器引擎蓋內 - 而不是BeautifulSoup(response.content)
,使用BeautifulSoup(response.content, "lxml")
,但您woul d需要安裝lxml
。
這是你如何調整解決Scrapy:
import urllib
from scrapy import Item, Field, Request, Spider
class mySpider(Spider):
name = "mytest"
allowed_domains = {'www.ijreview.com'}
def start_requests(self):
page_size = 25
headers = {'User-Agent': 'Scrapy spider',
'X-Requested-With': 'XMLHttpRequest',
'Host': 'www.ijreview.com',
'Origin': 'http://www.ijreview.com',
'Accept': '*/*',
'Referer': 'http://www.ijreview.com/',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'}
for offset in (0, 200, page_size):
yield Request('http://www.ijreview.com/wp-admin/admin-ajax.php',
method='POST',
headers=headers,
body=urllib.urlencode(
{'action': 'load_more',
'numPosts': page_size,
'offset': offset,
'category': '',
'orderby': 'date',
'time': ''}))
def parse(self, response):
for link in response.xpath('//ul/li/a/@href').extract():
yield Request(link, callback=self.parse_post)
def parse_post(self, response):
item = PostItem()
item['url'] = response.url
item['title'] = response.xpath('//title/text()').extract()[0].strip()
item['authors'] = response.xpath('//span[@class="author"]/text()').extract()[0].strip()
return item
輸出:
{'authors': u'Kyle Becker',
'title': u'17 Reactions to the \u2018We Don\u2019t Have a Strategy\u2019 Gaffe That May Haunt the Rest of Obama\u2019s Presidency',
'url': 'http://www.ijreview.com/2014/08/172569-25-reactions-obamas-dont-strategy-gaffe-may-haunt-rest-presidency/'}
...
這取決於如何 「的文章」 鏈接實際工作。你能分享到網站的實際鏈接嗎? – alecxe 2014-08-30 14:30:06
@alecxe它的ijreview.com – Anish 2014-08-30 14:30:33