Scrapy - 抓取簡單網站的問題

我想使用Scrapy來設置一個簡單的蜘蛛來定期檢查a webpage來拉取已發佈文章（標題和抽象網址）上的簡單數據。Scrapy - 抓取簡單網站的問題

我設置蜘蛛如下：

class JournalSpider(Spider): 
    name = "journal" 
    allowed_domains = ["ametsoc.org"] 
    start_urls = [ 
     "http://journals.ametsoc.org/toc/wefo/current/" 
    ] 

    def parse(self, response): 

     journalTitle = Selector(response).xpath('//*[@id="journalBlurbPanel"]/div[2]/h3/text()').extract()[0] 
     journalIssue = Selector(response).xpath('//*[@id="articleToolsHeading"]/text()').extract()[0].strip() # remove whitespace at start and end 

     # find all articles for the issue and parse each one individually 
     articles = Selector(response).xpath('//div[@id="rightColumn"]//table[@class="articleEntry"]') 

     for article in articles: 
      item = ArticleItem() 
      item['journalTitle'] = journalTitle 
      item['journalIssue'] = journalIssue 
      item['title'] = article.xpath('//div[@class="art_title"]/text()').extract()[0] 
      item['url'] = article.xpath('//a/@href').extract()[0] 
      yield item

這成功地拉journalTitle和journalIssue甚至迭代25次，這是頁面上的文章數量，但每篇文章都有相同title （第一篇文章的標題）。此外，我不知道在哪裏url正在從，因爲它沒有關聯到任何東西，我可以在頁面上看到拉：/action/ssostart?idp=https%3A%2F%2Fshib.ametsoc.org%2Fshibboleth%2Fidp

我覺得我必須被搞亂了我的XPath字符串（我新來擺弄xpaths，所以我不會感到驚訝，如果是這種情況！），或者當我通過Scrapy訪問時，可能會被服務於不同版本的站點？

有什麼想法？

來源

2016-01-06 oldo.nicho

在循環中的XPath表達式必須上下文具體和以點開始：

item['title'] = article.xpath('.//div[@class="art_title"]/text()').extract()[0] 
item['url'] = article.xpath('.//a/@href').extract()[0]

你也可以用它代替extract()[0]的extract_first()方法和用途response.xpath()快捷方式，而不是Selector(response).xpath()。

來源

2016-01-06 14:59:44 alecxe

這樣一個簡單的解決方案，因爲他們經常是。謝謝你的獎金語法提示。 –

嘿@alecxe'.'在xpath中表示什麼？ –

@NikhilParmar它基本上是指 - 從當前節點開始遍歷和搜索，而不是從文檔根開始搜索。 – alecxe

Scrapy - 抓取簡單網站的問題

回答

相關問題