2013-07-01 75 views
0

您好,我正在嘗試獲取類listCell的標題和文本的xpath。我相信我做得很對,因爲我沒有錯誤,但是當我將它顯示在csv文件中時,我在輸出文件中沒有任何內容。我還測試了我的scrapy在其他網站,如亞馬遜,它運行良好,但不適用於這個網站。請幫忙!!無法使用scrapy檢索xpath

def parse(self, response): 
    self.log("\n\n\n We got data! \n\n\n") 
    hxs = HtmlXPathSelector(response) 
    sites = hxs.select('//form[@id=\'listForm\']/table/tbody/tr') 
    items = [] 
    for site in sites: 
     item = CarrierItem() 
     item['title'] = site.select('.//td[@class\'listCell\']/a/text()').extract() 
     item['link'] = site.select('.//td[@class\'listCell\']/a/@href').extract() 
     items.append(item) 
    return items 

這是我的html。它可能是不可能的,因爲它在html中有javascript?

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> 
<title> Carrier IQ DIS 2.4 :: All Devices</title> 
<script type="text/javascript" src="/dis/js/main.js"> 
<script type="text/javascript" src="/dis/js/validate.js"> 
<link rel="stylesheet" type="text/css" href="/dis/css/portal.css"> 
<link rel="stylesheet" type="text/css" href="/dis/css/style.css"> 
<script type="text/javascript"> 

    .... 

<form id="listForm" name="listForm" method="POST" action=""> 
<table> 
<thead> 
<tbody> 
<tr> 
<td class="crt">1</td> 
<td class="listCell" align="center"> 
<a href="/dis/packages.jsp?view=list&show=perdevice&device_gid=3651746C4173775343535452414567746D75643855673D3D53564A6151624D41716D534C68395A6337634E2F62413D3D&hwdid=probe0&mdn=6505550000&subscrbid=6505550000&maxlength=100">6505550000</a> 
</td> 
<td class="listCell" align="center"> 
<a href="/dis/packages.jsp?view=list&show=perdevice&device_gid=3651746C4173775343535452414567746D75643855673D3D53564A6151624D41716D534C68395A6337634E2F62413D3D&hwdid=probe0&subscrbid=6505550000&mdn=6505550000&maxlength=100">probe0</a> 
</td> 
<td class="listCell" align="center"> 
<td class="listCell" align="center"> 
<td class="cell" align="center">2013-07-01 13:39:38.820</td> 
<td class="cell" align="left">1 - SMS_PullRequest_CS</td> 
<td class="listCell" align="right"> 
<td class="listCell" align="center"> 
<td class="listCell" align="center"> 
</tr> 
</tbody> 
</table> 
</form> 

輸出

C:\Users\ye831c\Documents\Big Data\Scrapy\carrier>scrapy crawl dis -o iqDis.csv 
-t csv 
2013-07-01 10:50:18-0500 [scrapy] INFO: Scrapy 0.16.5 started (bot: carrier) 
2013-07-01 10:50:18-0500 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogSt 
ats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState 
2013-07-01 10:50:19-0500 [scrapy] DEBUG: Enabled downloader middlewares: HttpAut 
hMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, De 
faultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMi 
ddleware, ChunkedTransferMiddleware, DownloaderStats 
2013-07-01 10:50:19-0500 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMi 
ddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddle 
ware 
2013-07-01 10:50:19-0500 [scrapy] DEBUG: Enabled item pipelines: 
2013-07-01 10:50:19-0500 [dis] INFO: Spider opened 
2013-07-01 10:50:19-0500 [dis] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 
items (at 0 items/min) 
2013-07-01 10:50:19-0500 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:602 
3 
2013-07-01 10:50:19-0500 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080 
2013-07-01 10:50:19-0500 [dis] DEBUG: Crawled (200) <GET https://qvpweb01.ciq.la 
bs.att.com:8080/dis/login.jsp> (referer: None) 
2013-07-01 10:50:19-0500 [dis] DEBUG: Redirecting (302) to <GET https://qvpweb01 
.ciq.labs.att.com:8080/dis/> from <POST https://qvpweb01.ciq.labs.att.com:8080/d 
is/login> 
2013-07-01 10:50:20-0500 [dis] DEBUG: Crawled (200) <GET https://qvpweb01.ciq.la 
bs.att.com:8080/dis/> (referer: https://qvpweb01.ciq.labs.att.com:8080/dis/login 
.jsp) 
2013-07-01 10:50:20-0500 [dis] DEBUG: 


    Successfully logged in. Let's start crawling! 



2013-07-01 10:50:21-0500 [dis] DEBUG: Crawled (200) <GET https://qvpweb01.ciq.la 
bs.att.com:8080/dis/> (referer: https://qvpweb01.ciq.labs.att.com:8080/dis/) 
2013-07-01 10:50:21-0500 [dis] DEBUG: 


    We got data! 



2013-07-01 10:50:21-0500 [dis] INFO: Closing spider (finished) 
2013-07-01 10:50:21-0500 [dis] INFO: Dumping Scrapy stats: 
    {'downloader/request_bytes': 1382, 
    'downloader/request_count': 4, 
    'downloader/request_method_count/GET': 3, 
    'downloader/request_method_count/POST': 1, 
    'downloader/response_bytes': 147888, 
    'downloader/response_count': 4, 
    'downloader/response_status_count/200': 3, 
    'downloader/response_status_count/302': 1, 
    'finish_reason': 'finished', 
    'finish_time': datetime.datetime(2013, 7, 1, 15, 50, 21, 221000), 
    'log_count/DEBUG': 12, 
    'log_count/INFO': 4, 
    'request_depth_max': 2, 
    'response_received_count': 3, 
    'scheduler/dequeued': 4, 
    'scheduler/dequeued/memory': 4, 
    'scheduler/enqueued': 4, 
    'scheduler/enqueued/memory': 4, 
    'start_time': datetime.datetime(2013, 7, 1, 15, 50, 19, 42000)} 
2013-07-01 10:50:21-0500 [dis] INFO: Spider closed (finished) 

回答

0

嘗試簡化的XPath:

sites = hxs.select('//form[@id="listForm"]//tr') 

作爲tbody元件是(在一些情況下)不存在於HTML,而是由瀏覽器生成的。

+0

我試過你建議Sjaak沒有工作我沒有看到沒有得到提取和沒有錯誤一樣。 – Gio