0
您好,我正在嘗試獲取類listCell的標題和文本的xpath。我相信我做得很對,因爲我沒有錯誤,但是當我將它顯示在csv文件中時,我在輸出文件中沒有任何內容。我還測試了我的scrapy在其他網站,如亞馬遜,它運行良好,但不適用於這個網站。請幫忙!!無法使用scrapy檢索xpath
def parse(self, response):
self.log("\n\n\n We got data! \n\n\n")
hxs = HtmlXPathSelector(response)
sites = hxs.select('//form[@id=\'listForm\']/table/tbody/tr')
items = []
for site in sites:
item = CarrierItem()
item['title'] = site.select('.//td[@class\'listCell\']/a/text()').extract()
item['link'] = site.select('.//td[@class\'listCell\']/a/@href').extract()
items.append(item)
return items
這是我的html。它可能是不可能的,因爲它在html中有javascript?
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title> Carrier IQ DIS 2.4 :: All Devices</title>
<script type="text/javascript" src="/dis/js/main.js">
<script type="text/javascript" src="/dis/js/validate.js">
<link rel="stylesheet" type="text/css" href="/dis/css/portal.css">
<link rel="stylesheet" type="text/css" href="/dis/css/style.css">
<script type="text/javascript">
....
<form id="listForm" name="listForm" method="POST" action="">
<table>
<thead>
<tbody>
<tr>
<td class="crt">1</td>
<td class="listCell" align="center">
<a href="/dis/packages.jsp?view=list&show=perdevice&device_gid=3651746C4173775343535452414567746D75643855673D3D53564A6151624D41716D534C68395A6337634E2F62413D3D&hwdid=probe0&mdn=6505550000&subscrbid=6505550000&maxlength=100">6505550000</a>
</td>
<td class="listCell" align="center">
<a href="/dis/packages.jsp?view=list&show=perdevice&device_gid=3651746C4173775343535452414567746D75643855673D3D53564A6151624D41716D534C68395A6337634E2F62413D3D&hwdid=probe0&subscrbid=6505550000&mdn=6505550000&maxlength=100">probe0</a>
</td>
<td class="listCell" align="center">
<td class="listCell" align="center">
<td class="cell" align="center">2013-07-01 13:39:38.820</td>
<td class="cell" align="left">1 - SMS_PullRequest_CS</td>
<td class="listCell" align="right">
<td class="listCell" align="center">
<td class="listCell" align="center">
</tr>
</tbody>
</table>
</form>
輸出
C:\Users\ye831c\Documents\Big Data\Scrapy\carrier>scrapy crawl dis -o iqDis.csv
-t csv
2013-07-01 10:50:18-0500 [scrapy] INFO: Scrapy 0.16.5 started (bot: carrier)
2013-07-01 10:50:18-0500 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogSt
ats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-07-01 10:50:19-0500 [scrapy] DEBUG: Enabled downloader middlewares: HttpAut
hMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, De
faultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMi
ddleware, ChunkedTransferMiddleware, DownloaderStats
2013-07-01 10:50:19-0500 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMi
ddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddle
ware
2013-07-01 10:50:19-0500 [scrapy] DEBUG: Enabled item pipelines:
2013-07-01 10:50:19-0500 [dis] INFO: Spider opened
2013-07-01 10:50:19-0500 [dis] INFO: Crawled 0 pages (at 0 pages/min), scraped 0
items (at 0 items/min)
2013-07-01 10:50:19-0500 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:602
3
2013-07-01 10:50:19-0500 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-07-01 10:50:19-0500 [dis] DEBUG: Crawled (200) <GET https://qvpweb01.ciq.la
bs.att.com:8080/dis/login.jsp> (referer: None)
2013-07-01 10:50:19-0500 [dis] DEBUG: Redirecting (302) to <GET https://qvpweb01
.ciq.labs.att.com:8080/dis/> from <POST https://qvpweb01.ciq.labs.att.com:8080/d
is/login>
2013-07-01 10:50:20-0500 [dis] DEBUG: Crawled (200) <GET https://qvpweb01.ciq.la
bs.att.com:8080/dis/> (referer: https://qvpweb01.ciq.labs.att.com:8080/dis/login
.jsp)
2013-07-01 10:50:20-0500 [dis] DEBUG:
Successfully logged in. Let's start crawling!
2013-07-01 10:50:21-0500 [dis] DEBUG: Crawled (200) <GET https://qvpweb01.ciq.la
bs.att.com:8080/dis/> (referer: https://qvpweb01.ciq.labs.att.com:8080/dis/)
2013-07-01 10:50:21-0500 [dis] DEBUG:
We got data!
2013-07-01 10:50:21-0500 [dis] INFO: Closing spider (finished)
2013-07-01 10:50:21-0500 [dis] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1382,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 3,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 147888,
'downloader/response_count': 4,
'downloader/response_status_count/200': 3,
'downloader/response_status_count/302': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2013, 7, 1, 15, 50, 21, 221000),
'log_count/DEBUG': 12,
'log_count/INFO': 4,
'request_depth_max': 2,
'response_received_count': 3,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'start_time': datetime.datetime(2013, 7, 1, 15, 50, 19, 42000)}
2013-07-01 10:50:21-0500 [dis] INFO: Spider closed (finished)
我試過你建議Sjaak沒有工作我沒有看到沒有得到提取和沒有錯誤一樣。 – Gio