2014-07-25 50 views
0

我在Windows Vista 64位上使用Python.org版本2.7,64位。我正在構建一個遞歸的webscraper,它似乎只在從單個頁面中提取文本時才起作用,但在抓取多個頁面時似乎不起作用。該代碼是下面:遞歸webscraper不使用Scrapy將文本從頁面打印到屏幕

from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.selector import Selector 
from scrapy.item import Item 
from scrapy.spider import BaseSpider 
from scrapy import log 
from scrapy.cmdline import execute 
from scrapy.utils.markup import remove_tags 
import time 


class ExampleSpider(CrawlSpider): 
    name = "goal3" 
    allowed_domains = ["whoscored.com"] 
    start_urls = ["http://www.whoscored.com"] 
    download_delay = 1 
    rules = [Rule(SgmlLinkExtractor(allow=()), 
        follow=True), 
      Rule(SgmlLinkExtractor(allow=()), callback='parse_item') 
    ] 

    def parse_item(self,response): 
     self.log('A response from %s just arrived!' % response.url) 
     scripts = response.selector.xpath("normalize-space(//title)") 
     for scripts in scripts: 
      body = response.xpath('//p').extract() 
      body2 = "".join(body) 
      print remove_tags(body2).encode('utf-8') 


execute(['scrapy','crawl','goal3']) 

我從這個獲得的輸出的一個示例是如下:

2014-07-25 19:31:32+0100 [goal3] DEBUG: Crawled (200) <GET http://www.whoscored.com/Players/133260/Show/Michael-Ngoo> (referer: http://www.whoscored.com/Players/14170/Show/Ishmael-Miller) 
2014-07-25 19:31:33+0100 [goal3] DEBUG: Crawled (200) <GET http://www.whoscored.com/Teams/160/Show/England-Charlton> (referer: http://www.whoscored.com/Players/10794/Show/Rafik-Djebbour) 
2014-07-25 19:31:33+0100 [goal3] DEBUG: Filtered offsite request to 'www.cafc.co.uk': <GET http://www.cafc.co.uk/page/Home> 
2014-07-25 19:31:34+0100 [goal3] DEBUG: Crawled (200) <GET http://www.whoscored.com/Matches/721465/Live/England-Championship-2013-2014-Nottingham-Forest-Charlton> (referer: http://www.whoscored.com/Players/10794/Show/Rafik-Djebbour) 
2014-07-25 19:31:36+0100 [goal3] DEBUG: Crawled (200) <GET http://www.whoscored.com/Teams/126/News> (referer: http://www.whoscored.com/Teams/1426/News) 
2014-07-25 19:31:36+0100 [goal3] DEBUG: Filtered offsite request to 'www.fcsochaux.fr': <GET http://www.fcsochaux.fr/fr/index.php?lng=fr> 
2014-07-25 19:31:37+0100 [goal3] DEBUG: Crawled (200) <GET http://www.whoscored.com/Teams/976/News> (referer: http://www.whoscored.com/Teams/1426/News) 
2014-07-25 19:31:37+0100 [goal3] DEBUG: Filtered offsite request to 'www.grenoblefoot38.fr': <GET http://www.grenoblefoot38.fr/> 
2014-07-25 19:31:37+0100 [goal3] DEBUG: Filtered offsite request to 'www.as.com': <GET http://www.as.com/futbol/articulo/leones-ponen-manos-obra-grenoble/20120713dasdaiftb_52/Tes> 
2014-07-25 19:31:38+0100 [goal3] DEBUG: Crawled (200) <GET http://www.whoscored.com/Teams/56/News> (referer: http://www.whoscored.com/Teams/53/News) 
2014-07-25 19:31:38+0100 [goal3] DEBUG: Filtered offsite request to 'www.realracingclub.es': <GET http://www.realracingclub.es/default.aspx> 
2014-07-25 19:31:39+0100 [goal3] DEBUG: Crawled (200) <GET http://www.whoscored.com/Teams/125/News> (referer: http://www.whoscored.com/Teams/146/News) 
2014-07-25 19:31:39+0100 [goal3] DEBUG: Filtered offsite request to 'www.asnl.net': <GET http://www.asnl.net/pages/club/entraineurs.html> 
2014-07-25 19:31:40+0100 [goal3] DEBUG: Crawled (200) <GET http://www.whoscored.com/Teams/425/News> (referer: http://www.whoscored.com/Teams/24/News) 
2014-07-25 19:31:40+0100 [goal3] DEBUG: Filtered offsite request to 'www.dbu.dk': <GET http://www.dbu.dk/> 
2014-07-25 19:31:42+0100 [goal3] DEBUG: Crawled (200) <GET http://www.whoscored.com/Teams/282/News> (referer: http://www.whoscored.com/Teams/50/News) 
2014-07-25 19:31:42+0100 [goal3] DEBUG: Filtered offsite request to 'www.fc-koeln.de': <GET http://www.fc-koeln.de/index.php?id=10> 
2014-07-25 19:31:43+0100 [goal3] DEBUG: Crawled (200) <GET http://www.whoscored.com/Teams/58/News> (referer: http://www.whoscored.com/Teams/131/News) 
2014-07-25 19:31:43+0100 [goal3] DEBUG: Filtered offsite request to 'www.realvalladolid.es': <GET http://www.realvalladolid.es/> 
2014-07-25 19:31:44+0100 [goal3] DEBUG: Crawled (200) <GET http://www.whoscored.com/Teams/973/News> (referer: http://www.whoscored.com/Teams/145/News) 
2014-07-25 19:31:44+0100 [goal3] DEBUG: Filtered offsite request to 'www.fifci.org': <GET http://www.fifci.org/> 

我可以理解被過濾的外部鏈接出來,因爲它們從履帶的範圍的,但是我不明白爲什麼返回的結果是一個'DEBUG:'消息和頁面的鏈接,特別是因爲所有這些結果都有一個成功的HTTP返回碼200。

任何人都可以看到這裏的問題是什麼?

感謝

回答

1

你只需要一個單一的規則與follow=True

rules = [Rule(SgmlLinkExtractor(), follow=True, callback='parse_item')] 
+0

喜試。這似乎工作,但它只是從頁面返回頁腳鏈接。我將需要看看html,並發現文本體是如何編碼的,因爲'// p'在這種情況下不起作用。謝謝 – gdogg371