2014-10-20 46 views
0

我正在嘗試獲取_Comment的內容。我已經研究瞭如何做,但我不知道如何從td元素訪問函數以抓取文本。如果有幫助的話,我使用python Scrapy模塊的xpaths。從lxml中獲取文本評論

td = None [_Element] 
    <built-in function Comment> = None [_Comment] 
    a = None [_Element] 

用於td元件的HTML是:

<table class="crIFrameReviewList"> 

    <tr> 
     <td> 

<!-- BOUNDARY --> 
<a name="R2L4AFEICL8GG6"></a><br /> 


<div style="margin-left:0.5em;"> 

     <div style="margin-bottom:0.5em;"> 
     304 of 309 people found the following review helpful 
     </div> 
     <div style="margin-bottom:0.5em;"> 
     <span style='margin-left: -5px;'><img src="http://g-ecx.images-amazon.com/images/G/01/x-locale/common/customer-reviews/stars-5-0._V192240867_.gif" width="64" alt="5.0 out of 5 stars" title="5.0 out of 5 stars" height="12" border="0" /> </span> 
     <b>Great Travel Zoom</b>, <nobr>April 9, 2014</nobr> 
     </div> 
     <div style="margin-bottom:0.5em;"> 

     <div class="tiny" style="margin-bottom:0.5em;"> 
     <span class="crVerifiedStripe"><b class="h3color tiny" style="margin-right: 0.5em;">Verified Purchase</b><span class="tiny verifyWhatsThis">(<a href="http://www.amazon.com/gp/community-help/amazon-verified-purchase" target="AmazonHelp" onclick="amz_js_PopWin('http://www.amazon.com/gp/community-help/amazon-verified-purchase', 'AmazonHelp', 'width=400,height=500,resizable=1,scrollbars=1,toolbar=0,status=1');return false; ">What's this?</a>)</span></span> 
     </div> 
     <div class="tiny" style="margin-bottom:0.5em;"> 
     <b><span class="h3color tiny">This review is from: </span>Canon PowerShot SX700 HS Digital Camera (Black) (Electronics)</b> 
     </div> 

For the recent few years Canon has made great efforts to improve their travel-zoom compact cameras, and the new SX700 is their next remarkable achievement on that way. It's a little bit bigger than its predecessor (SX280) but it is very well built and has an attractive look and feel (I like the black one). It also got a new front grip which makes one-hand shooting more convenient, even when shooting video, since the Video button was moved from the back to the top and you can now use your thumb solely for holding the camera.<br /><br />Here is a brief list of the new camera pros & cons:<br /><br />PROS:<br />* A very good design and build quality with the attractive finish.<br />* A new powerful 30x optical zoom lens in just a pocket-size body.<br />* Incredible range from 25mm wide to 750mm telephoto for stills and video.<br />* Zoom Framing Assist - very useful new feature to compose your pictures at long telephoto.<br />* Very effective optical Intelligent Image Stabilization for... 


<a href="http://rads.stackoverflow.com/amzn/click/B00I58M26Y" target="_top">Read more</a> 
     <div style="padding-top: 10px; clear: both; width: 100%;"> 
+0

你能展示一個你正在處理的HTML嗎?另外,所需的輸出將有所幫助。 – alecxe 2014-10-20 19:16:37

+0

@alecxe我更新了它的編輯。我已經將html轉換爲lxml.etree.HTML對象 – robert 2014-10-20 19:31:20

+0

是否有'class =「reviewText」'多個'div'元素?否則,你可以使用''div [@ class =「reviewText」]/text()''作爲你的'xpath'。 – 2014-10-20 19:36:21

回答

1

查找divclass="reviewText"使用.//div[@class="reviewText"] XPath表達式和轉儲使用tostring()text方法的元素串:

import lxml.html 

data = """ 
your html here 
""" 

td = lxml.html.fromstring(data) 
review = td.find('.//div[@class="reviewText"]') 
print lxml.html.tostring(review, method="text") 

打印:

54,000 RPM - It has a spinning disk drive that is way beyond our time...I bought 10 of these just for the hard drive, they blow SSD's out of the water. 
Seriously though... how does a well known computer company mistype an important spec? 
+0

我編輯它以從亞馬遜獲得給我的東西。他們爲他們的網站實施反刮擦技術,儘管我認爲自己在xpaths中體驗得很好,但這一點讓我感到困惑。最接近我的答案是'<內置函數Comment>'。如果只有我可以訪問它。 – robert 2014-10-20 20:10:32

+0

@robert ok,你可以添加你現在使用的代碼嗎?謝謝。 – alecxe 2014-10-20 20:26:05