2016-10-12 95 views
-2

這是我的代碼,但它似乎是正確的,但它不能正常工作,請大家幫忙這裏有什麼錯誤?

HEADER_XPATH = ['//h1[@class="story-body__h1"]//text()']  
AUTHOR_XPATH = ['//span[@class="byline__name"]//text()'] 
PUBDATE_XPATH = ['//div/@data-datetime'] 
WTAGS_XPATH = [''] 
CATEGORY_XPATH = ['//span[@rev="news|source""]//text()']  
TEXT = ['//div[@property="articleBody"]//p//text()'] 
INTERLINKS = ['//div[@class="story-body__link"]//p//a/@href'] 
DATE_FORMAT_STRING = '%Y-%m-%d' 

class BBCSpider(Spider): 
    name = "bbc" 
    allowed_domains = ["bbc.com"] 
    sitemap_urls = [ 
     'http://Www.bbc.com/news/sitemap/', 
     'http://www.bbc.com/news/technology/', 
     'http://www.bbc.com/news/science_and_environment/'] 

    def parse_page(self, response): 
     items = [] 
     item = ContentItems() 
     item['title'] = process_singular_item(self, response, HEADER_XPATH, single=True) 
     item['resource'] = urlparse(response.url).hostname 
     item['author'] = process_array_item(self, response, AUTHOR_XPATH, single=False) 
     item['pubdate'] = process_date_item(self, response, PUBDATE_XPATH, DATE_FORMAT_STRING, single=True) 
     item['tags'] = process_array_item(self, response, TAGS_XPATH, single=False) 
     item['category'] = process_array_item(self, response, CATEGORY_XPATH, single=False) 
     item['article_text'] = process_article_text(self, response, TEXT) 
     item['external_links'] = process_external_links(self, response, INTERLINKS, single=False) 
     item['link'] = response.url 
     items.append(item) 
     return items 
+3

問題是什麼?也許可以解釋問題是什麼?輸入?慾望輸出?你在做什麼? – MooingRawr

+0

問題是,當我運行我的代碼時,什麼都沒有發生。它不通過頁面!我認爲我的錯誤是在變數@MooingRawr – nik

回答

0

你的蜘蛛只是嚴重結構正因爲如此它什麼都不做。
scrapy.Spider spider需要start_urls class屬性,它應該包含蜘蛛將用於開始抓取的url列表,所有這些url將回調到類方法parse,這意味着它也是必需的。

您的蜘蛛擁有sitemap_urls class屬性,並且它沒有被用於任何地方,您的蜘蛛也有parse_page類方法,這種方法在任何地方都不會使用。
因此總之你的蜘蛛應該看起來像這樣:

class BBCSpider(Spider): 
    name = "bbc" 
    allowed_domains = ["bbc.com"] 
    start_urls = [ 
     'http://Www.bbc.com/news/sitemap/', 
     'http://www.bbc.com/news/technology/', 
     'http://www.bbc.com/news/science_and_environment/'] 

    def parse(self, response): 
     # This is a page with all of the articles 
     article_urls = # find article urls in the pages 
     for url in article_urls: 
      yield Request(url, self.parse_page) 

    def parse_page(self, response): 
     # This is an article page 
     items = [] 
     item = ContentItems() 
     # populate item 
     return item 
+0

我真的很感激它 – nik

+0

@nik太棒了!如果它解決了您的問題,請點擊左側的「接受問題按鈕」。 – Granitosaurus

+0

,當然,我會的。你能告訴我在articles_urls中我必須提供什麼樣的例子嗎?(因爲我正在處理大量的url) – nik