0
試圖刮Y!小組和我可以從一個頁面獲取數據,但就是這樣。我有一些基本的規則,但顯然他們是不正確的。任何人已經解決了這個問題Scrapy雅虎集團蜘蛛
class YgroupSpider(CrawlSpider):
name = "yahoo.com"
allowed_domains = ["launch.groups.yahoo.com"]
start_urls = [
"http://launch.groups.yahoo.com/group/random_public_ygroup/post"
]
rules = (
Rule(SgmlLinkExtractor(allow=('message','messages'), deny=('mygroups',))),
Rule(SgmlLinkExtractor(), callback='parse_item'),
)
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('/html')
item = Item()
for site in sites:
item = YgroupItem()
item['title'] = site.select('//title').extract()
item['pubDate'] = site.select('//abbr[@class="updated"]/text()').extract()
item['desc'] = site.select("//div[contains(concat(' ',normalize-space(@class),' '),' entry-content ')]/text()").extract()
return item
不錯,謝謝。我可能應該用exoanded來說,我想要groupname/message/1,groupname/message/2等(它們是來自/ post?id = averylongidstringthat的其他別名,不能用於消息1或2 – linkingarts 2011-03-27 04:07:17