2015-05-13 129 views
0

我一直想湊這個網站,有油井的細節在科羅拉多州 https://cogcc.state.co.us/cogis/FacilityDetail.asp?facid=12307555&type=WELLScrapy使用XPath

Scrapy刮擦網站從表中提取元素時返回一個空的輸出,並返回URL時,我刮,但是當我需要使用它的XPath(油井縣)提取表內的元素時,我得到的只是一個空輸出,即[]。

這發生在我嘗試訪問頁面的任何元素。

這裏是我的蜘蛛:

import scrapy 
import json 
class coloradoSpider(scrapy.Spider): 
    name = "colorado" 
    allowed_domains = ["cogcc.state.co.us"] 
    start_urls = ["https://cogcc.state.co.us/cogis/ProductionWellMonthly.asp?APICounty=123&APISeq=07555&APIWB=00&Year=All"] 
    def parse(self, response): 
     url = response.url 
     response.selector.remove_namespaces() 
     variable = (response.xpath("/html/body/blockquote/font/font/table/tbody/tr[3]/th[3]").extract()) 
     print url, variable 

這是輸出:

2015-05-13 20:14:54+0530 [scrapy] INFO: Scrapy 0.24.6 started (bot: tutorial) 
2015-05-13 20:14:54+0530 [scrapy] INFO: Optional features available: ssl, http11 
2015-05-13 20:14:54+0530 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE' 
: 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'BOT_NAME': 'tutor 
ial'} 
2015-05-13 20:14:54+0530 [scrapy] INFO: Enabled extensions: LogStats, TelnetCons 
ole, CloseSpider, WebService, CoreStats, SpiderState 
2015-05-13 20:14:55+0530 [scrapy] INFO: Enabled downloader middlewares: HttpAuth 
Middleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, Def 
aultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, Redirec 
tMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 
2015-05-13 20:14:55+0530 [scrapy] INFO: Enabled spider middlewares: HttpErrorMid 
dleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddlew 
are 
2015-05-13 20:14:56+0530 [scrapy] INFO: Enabled item pipelines: 
2015-05-13 20:14:56+0530 [colorado] INFO: Spider opened 
2015-05-13 20:14:56+0530 [colorado] INFO: Crawled 0 pages (at 0 pages/min), scra 
ped 0 items (at 0 items/min) 
2015-05-13 20:14:56+0530 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6 
023 
2015-05-13 20:14:56+0530 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080 
2015-05-13 20:15:02+0530 [colorado] DEBUG: Crawled (200) <GET https://cogcc.stat 
e.co.us/cogis/ProductionWellMonthly.asp?APICounty=123&APISeq=07555&APIWB=00&Year 
=All> (referer: None) 
https://cogcc.state.co.us/cogis/ProductionWellMonthly.asp?APICounty=123&APISeq=0 
7555&APIWB=00&Year=All [] 
2015-05-13 20:15:02+0530 [colorado] INFO: Closing spider (finished) 
2015-05-13 20:15:02+0530 [colorado] INFO: Dumping Scrapy stats: 
     {'downloader/request_bytes': 292, 
     'downloader/request_count': 1, 
     'downloader/request_method_count/GET': 1, 
     'downloader/response_bytes': 366770, 
     'downloader/response_count': 1, 
     'downloader/response_status_count/200': 1, 
     'finish_reason': 'finished', 
     'finish_time': datetime.datetime(2015, 5, 13, 14, 45, 2, 349000), 
     'log_count/DEBUG': 3, 
     'log_count/INFO': 7, 
     'response_received_count': 1, 
     'scheduler/dequeued': 1, 
     'scheduler/dequeued/memory': 1, 
     'scheduler/enqueued': 1, 
     'scheduler/enqueued/memory': 1, 
     'start_time': datetime.datetime(2015, 5, 13, 14, 44, 56, 77000)} 
2015-05-13 20:15:02+0530 [colorado] INFO: Spider closed (finished) 

如果我回去了幾個節點上的XPath,我得到其中Scrapy返回表的輸出HTML。

謝謝!

+0

什麼是U正是想要的網站,'ĴSAND'例如在? – eLRuLL

回答

1

似乎是一個xpath問題,在這個網站的開發過程中,他們可能會省略tbody,但瀏覽器自動插入瀏覽器時會自動插入瀏覽器。你可以從here得到更多關於這方面的信息。

因此,你需要在給定的頁面全縣值(WELD #123),則可能xpath會,

In [20]: response.xpath('/html/body/font/table/tr[6]/td[2]//text()').extract() 
Out[20]: [u'WELD        #123'] 
+0

謝謝,刪除'tbody'使它工作! – user3266563

0

它看起來是一個XPath問題,也許試試這個

//blockquote/font/font/table//tr/td[3]//text()

我認爲你不需要TBODY標籤。