lxml.html
解析它很好。只需使用該代替捆綁的HtmlXPathSelector即可。
import lxml.html as lxml
bad_html = """<a href="example.com/page1.html">Site1</a><br/>
<a href="example.com/page2.html">Site2</a><br/>
<a href="example.com/page3.html">Site3</a><br/>"""
tree = lxml.fromstring(bad_html)
for link in tree.iterfind('a'):
print link.attrib['href']
結果:
example.com/page1.html
example.com/page2.html
example.com/page3.html
所以,如果你想在一個CrawlSpider使用這種方法,你只需要編寫一個簡單的(或複雜)link extractor。
例如,
import lxml.html as lxml
class SimpleLinkExtractor:
extract_links(self, response):
tree = lxml.fromstring(response.body)
links = tree.xpath('a/@href')
return links
,然後用它在你的蜘蛛..
class MySpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']
rules = (
Rule(SimpleLinkExtractor(), callback='parse_item'),
)
# etc ...
+1爲詳細的例子。是的你是對的,但這也應該在scrapy代碼庫中得到解決。 – Medorator