2013-07-08 31 views
1

我的Scrapy Spider存在一個問題,它報告了「不受支持的URL方案」。 我想要一個帶有搜索結果的頁面。由於這個漫長的動態網址,我的蜘蛛一直都在失敗。Scrapy start_urls太長/不受支持的URL計劃

class RadioSpider(CrawlSpider): 
    name = 'radio' 
    allowed_domains = ['dashitradio.de'] 
    start_urls = ["[http://www.dashitradio.de/nc/search-in-playlist.html?tx_wfqbe_pi1%5BSTART%5D=2013-06-17%2006:00&tx_wfqbe_pi1%5BEND%5D=2013-06-21%2018:00&tx_wfqbe_pi1%5Bsubmit%5D=Suchen&tx_wfqbe_pi1%5Bshowpage%5D%5B3%5D=1][1]"] 
    rules = (
     Rule(SgmlLinkExtractor(allow=r'Items/'), callback='parse_item', follow=True), 
    ) 

    def parse_item(self, response): 
     hxs = HtmlXPathSelector(response) 
     i = RadioItem() 

     i['title'] = hxs.select("//*[@id='playlist-results']/table//tr[1]/td[1]/text()").extract() 
     i['interpret'] = hxs.select("//*[@id='playlist-results']/table[1]//tr/td[2]/text()").extract() 
     i['date'] = hxs.select("//*[@id='playlist-results']/table//tr[1]/td[3]/text()").extract() 

     return i 

如果我在一個Scrapy殼牌控制檯運行它,它只是正常工作與引號除了URL,如"URL"

我該如何讓Scrapy接受這個字符串作爲我的Spider中的單個URL?

回答

0

您的start_urls設置不正確:[開頭,][1]最後使其無效網址。

我根據您的意見更新了蜘蛛的代碼:

from scrapy.item import Item, Field 
from scrapy.selector import HtmlXPathSelector 
from scrapy.spider import BaseSpider 


class RadioItem(Item): 
    title = Field() 
    interpret = Field() 
    date = Field() 


class RadioSpider(BaseSpider): 
    name = 'radio' 
    allowed_domains = ['dashitradio.de'] 
    start_urls = ["http://www.dashitradio.de/nc/search-in-playlist.html?tx_wfqbe_pi1%5BSTART%5D=2013-06-17%2006:00&tx_wfqbe_pi1%5BEND%5D=2013-06-21%2018:00&tx_wfqbe_pi1%5Bsubmit%5D=Suchen&tx_wfqbe_pi1%5Bshowpage%5D%5B3%5D=1"] 

    def parse(self, response): 
     hxs = HtmlXPathSelector(response) 

     rows = hxs.select("//div[@id='playlist-results']/table/tbody/tr") 
     for row in rows: 
      item = RadioItem() 

      item['title'] = row.select(".//td[1]/text()").extract()[0] 
      item['interpret'] = row.select(".//td[2]/text()").extract()[0] 
      item['date'] = row.select(".//td[3]/text()").extract()[0] 

      yield item 

保存成my_spider.py,並通過runspider運行:

scrapy runspider my_spider.py -o output.json 

你會在output.json看到:

{"date": "2013-06-21 17:48:00", "interpret": "MUMFORD & SONS", "title": "I WILL WAIT"} 
{"date": "2013-06-21 17:44:00", "interpret": "TASMIN ARCHER", "title": "SLEEPING SATELLITE"} 
{"date": "2013-06-21 17:40:03", "interpret": "ROBIN THICKE", "title": "BLURRED LINES (feat. T.I. & PHARRELL)"} 
{"date": "2013-06-21 17:35:02", "interpret": "TINA TURNER", "title": "TWO PEOPLE"} 
{"date": "2013-06-21 17:31:02", "interpret": "BON JOVI", "title": "WHAT ABOUT NOW"} 
{"date": "2013-06-21 17:28:03", "interpret": "ROXETTE", "title": "SHE'S GOT NOTHING ON (BUT THE RADIO)"} 
{"date": "2013-06-21 17:18:01", "interpret": "GNARLS BARKLEY", "title": "CRAZY"} 
{"date": "2013-06-21 17:08:01", "interpret": "FLO RIDA", "title": "WHISTLE"} 
{"date": "2013-06-21 17:05:03", "interpret": "WHAM", "title": "WAKE ME UP BEFORE YOU GO GO"} 
{"date": "2013-06-21 17:00:03", "interpret": "P!NK FEAT. NATE RUESS", "title": "JUST GIVE ME A REASON"} 
{"date": "2013-06-21 16:48:01", "interpret": "SHAKIRA", "title": "WHENEVER, WHEREVER"} 
{"date": "2013-06-21 16:44:00", "interpret": "ALPHAVILLE", "title": "BIG IN JAPAN"} 
{"date": "2013-06-21 16:40:01", "interpret": "XAVIER NAIDOO", "title": "BEI MEINER SEELE"} 
{"date": "2013-06-21 16:36:02", "interpret": "SANTANA", "title": "SMOOTH"} 
{"date": "2013-06-21 16:32:01", "interpret": "OLLY MURS", "title": "ARMY OF TWO"} 

希望有所幫助。

+0

如果我確實刪除了URL的兩側,我會將結果清除.json文件,而不包含任何數據。在我發佈問題之前,我嘗試了它。我認爲,如果沒有scrapy,請不要接受我的長鏈接,因爲它裏面有很多參數,所以將它分成幾部分。 – user2559824

+0

它應該工作。你能發佈你的蜘蛛的整個代碼嗎? – alecxe

+0

請編輯您的問題並粘貼代碼,謝謝。 – alecxe