2017-06-20 20 views
0

我從一個網站刮的一些數據,有時他們顯示milage,其他時間就是在車輛描述 這裏顯示MPG的是我使用XPath和要到HTMLPython的硒刮不一致場

僅僅是爲了

這裏走的是有關部分:

def init_driver(): 
    options = webdriver.ChromeOptions() 
    options.binary_location = '/usr/bin/google-chrome-stable' 
    options.add_argument('headless') 
    options.add_argument('window-size=1200x600') 
    driver = webdriver.Chrome(chrome_options=options) 
    driver.wait = WebDriverWait(driver, 5) 
    return driver 


def scrape(driver): 

    #Tymm = year make model All three attributes are in the Header, Parse and separate before insterting to SQL 
    ymm_element = driver.find_elements_by_xpath('//*[@id="compareForm"]/div/div/ul/li/div/div/h3') 
    engine_element = driver.find_elements_by_xpath('//*[@id="compareForm"]/div/div/ul/li/div/div/div[3]/dl[1]/dd[1]') 
    trans_element = driver.find_elements_by_xpath('//*[@id="compareForm"]/div/div/ul/li/div/div/div[3]/dl[1]/dd[2]') 
    milage_element = driver.find_elements_by_xpath('//*[@id="compareForm"]/div/div/ul/li/div/div/div[3]/dl[1]/dd[3]') 

因爲該元素的順序並不是所有的車輛一樣,我需要寫它,所以它可以檢索的頭銜,我希望後面的文本。

這裏是從元件檢查,對鉻HTML複製後的HTML:

<div class="description"> 
    <dl> <dt>Engine:</dt> <dd>2.5L I-5 cyl<span class="separator">,</span> 
    </dd> <dt>Transmission:</dt> <dd>Manual<span class="separator">,</span></dd> <dt>Mileage:</dt> <dd>37,171 miles<span class="separator">,</span></dd> <dt>MPG Range:</dt> <dd>22/31<span class="separator">,</span></dd></dl><dl class="last"> <dt>Exterior Color:</dt> <dd>Reflex Silver Metallic<span class="separator">,</span></dd> <dt>Interior Color:</dt> <dd>Titan Black<span class="separator">,</span></dd> <dt>Stock #:</dt> <dd>P3229</dd></dl> <span class="ddc-more">More<span class="hellip">…</span></span> 
<div class="calloutDetails"> 
<ul class="list-unstyled"> 
<li class="certified" style="margin-bottom: 10px;"><div class="badge "><img class="align-center" src="https://static.dealer.com/v8/global/images/franchise/white/en_US/logo-certified-volkswagen.gif?r=1356028132000" alt="Certified"></div></li><li class="carfax" style="margin-bottom: 10px;"><a href="http://www.carfax.com/cfm/ccc_displayhistoryrpt.cfm?partner=DLR_3&amp;vin=3VWHX7AT1EM600723" class="badge carfax-one-owner pointer" target="_blank"><img class="align-center" src="https://static.dealer.com/v8/global/images/franchise/white/logo-certified-carfax-one-owner-lrg.png?r=1405027620000" alt="Carfax One Owner"></a></li> 
</ul> 
</div> 
<div class="hproductDynamicArea"></div> 
</div> 

基本上我需要編號的XPath的標題後要搜索的文字。

我一年的品牌和型號都在同一個元素「標籤,你能指出我在正確的方向或建議庫駁頭

回答

0

首先,使用XPath可以使用包含,像這樣:

driver.find_elements_by_xpath('//dt[contains(text(),'Engine')]') 

它看起來更清潔,更容易使用和更強大的

二,閱讀下面的XPath,兄弟姐妹,前同輩,父母和祖先它會幫助你建立整齊的XPath定位:。

driver.find_elements_by_xpath('//dt[contains(text(),'Engine:')]/following-sibling::dd') 
driver.find_elements_by_xpath('//dt[contains(text(),'Transmission:')]/following-sibling::dd') 
driver.find_elements_by_xpath('//dt[contains(text(),'Mileage:')]/following-sibling::dd') 

以上的XPath將工作無論哪個命令你的HTML元素都位於。

+0

謝謝你,我會這麼做,我只好換到雙引號,但它的工作原理就像一個魅力。我也會逐一循環每輛車,以避免出現差異。 –

+0

對不起,你再次通過Web元素循環瀏覽:def scrape(driver): cars = driver.find_elements_by_xpath('// div [@ class =「description」]') 汽車中的汽車: milestone = car.find_element_by_xpath(「// dt [contains(text()())包含(text(),'Engine')]/following-sibling :: dd」) mileage = car.find_element_by_xpath ,'Mileage')]/following-sibling :: dd「) print(mileage.text,engine.text) –

+0

def scrape(driver): cars = driver.find_elements_by_xpath('// div [@ class =」描述「]') 汽車中的汽車: engine = car.find_element_by_xpath(」// dt [contains(text(),'Engine')]/following-sibling :: dd「) mileage = car.find_element_by_xpath(「// dt [contains(text(),'Mileage')]/following-sibling :: dd」) print(mileage.text,engine.text) –