scrapy如何抓取更多網址？

，因爲我們看到：scrapy如何抓取更多網址？

def parse(self, response): 
    hxs = HtmlXPathSelector(response) 
    sites = hxs.select('//ul/li') 
    items = [] 

    for site in sites: 
     item = Website() 
     item['name'] = site.select('a/text()').extract() 
     item['url'] = site.select('//a[contains(@href, "http")]/@href').extract() 
     item['description'] = site.select('text()').extract() 
     items.append(item) 

    return items

scrapy只是得到一個頁面響應，並找到在頁面響應的URL。我認爲這只是一個表面爬行！

但我想要更多的定義深度的網址。

我能做些什麼來實現它？

謝謝！

來源

2012-06-25 Harold

我不明白你的問題，但我注意到在你的代碼的幾個問題，其中一些可能與你的問題（參見代碼中的註釋）：

sites = hxs.select('//ul/li') 
items = [] 

for site in sites: 
    item = Website() 
    # this extracts a list, so i guess .extract()[0] is expected 
    item['name'] = site.select('a/text()').extract() 
    # '//a[...]' maybe you expect that this gets the links within the `site`, but it actually get the links from the entire page; you should use './/a[...]'. 
    # And, again, this returns a list, not a single url. 
    item['url'] = site.select('//a[contains(@href, "http")]/@href').extract()

來源

2012-06-25 08:10:20 warvariuc