我試圖刮scrape craigslist使用scrapy,並已成功獲取網址,但現在我想要從網頁中的網頁提取數據。以下是代碼:scrapy使用scrapy
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from craigslist.items import CraigslistItem
class craigslist_spider(BaseSpider):
name = "craigslist_unique"
allowed_domains = ["craiglist.org"]
start_urls = [
"http://sfbay.craigslist.org/search/sof?zoomToPosting=&query=&srchType=A&addFour=part-time",
"http://newyork.craigslist.org/search/sof?zoomToPosting=&query=&srchType=A&addThree=internship",
"http://seattle.craigslist.org/search/sof?zoomToPosting=&query=&srchType=A&addFour=part-time"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select("//span[@class='pl']")
items = []
for site in sites:
item = CraigslistItem()
item['title'] = site.select('a/text()').extract()
item['link'] = site.select('a/@href').extract()
#item['desc'] = site.select('text()').extract()
items.append(item)
hxs = HtmlXPathSelector(response)
#print title, link
return items
我是新來scrapy和無法弄清楚至於如何真正打URL(HREF)和URL的網頁中獲取數據,並在這樣做了所有的URL。的start_urls
由於您正在抓取,請使用'CrawlSpider'。閱讀文檔中的幾個示例。 – Blender