2016-05-15 71 views
0

我試圖提取類似的應用程序從谷歌Play商店中從這裏的鏈接(使用XPath)XPath來提取鏈接或HREF中

https://play.google.com/store/apps/details?id=com.mojang.minecraftpe 

下面是我想的鏈接(標記爲綠色)的屏幕截圖提取 enter image description here

HTML樣品

<div class="details"> 
    <a href="/store/apps/details?id=com.imangi.templerun" class="card-click-target"></a> 
    <a title="Temple Run" href="/store/apps/details?id=com.imangi.templerun" class="title">Temple Run 
    <span class="paragraph-end"/> 
    </a> 
    <div>....</div> 
    <div>....</div> 
</div> 

我用下面的XPath chrome console找到一個鏈接,但它做esnt返回標籤的href屬性。但對於其他屬性(例如「標題」)起作用。

下面的XPath不工作(摘錄的 「href」)

//*[@id="body-content"]/div/div/div[2]/div[1]//*/a[2]/@href 

下面XPath的工作原理(摘錄 「稱號」)

//*[@id="body-content"]/div/div/div[2]/div[1]//*/a[2]/@title 

enter image description here

Python代碼

回答

0

HTML的鏈接頁面右側的各個圖塊的格式如下*:

<div class="details"> 
    <a href="/store/apps/details?id=com.imangi.templerun" class="card-click-target"></a> 
    <a title="Temple Run" href="/store/apps/details?id=com.imangi.templerun" class="title">Temple Run 
    <span class="paragraph-end"/> 
    </a> 
    <div>....</div> 
    <div>....</div> 
</div> 

原來,<a>class="title"元素唯一識別你的目標在頁面<a>元素。

//a[@class="title"]/@href 

無論如何,你注意到這個問題似乎是具體到Chrome XPath計算器**:那麼作爲的XPath可以很簡單。既然你提到的有關Python,簡單的Python代碼證明了XPath的應該只是罰款:

>>> from urllib2 import urlopen 
>>> from lxml import html 
>>> req = urlopen('https://play.google.com/store/apps/details?id=com.mojang.minecraftpe') 
>>> raw = req.read() 
>>> root = html.fromstring(raw) 
>>> [h for h in root.xpath("//a[@class='title']/@href")] 
['/store/apps/details?id=com.imangi.templerun', '/store/apps/details?id=com.lego.superheroes.dccomicsteamup', '/store/apps/details?id=com.turner.freefurall', '/store/apps/details?id=com.mtvn.Nickelodeon.GameOn', '/store/apps/details?id=com.disney.disneycrossyroad_goo', '/store/apps/details?id=com.rovio.angrybirdsstarwars.ads.iap', '/store/apps/details?id=com.rovio.angrybirdstransformers', '/store/apps/details?id=com.disney.dinostampede_goo', '/store/apps/details?id=com.turner.atskisafari', '/store/apps/details?id=com.moose.shopville', '/store/apps/details?id=com.DisneyDigitalBooks.SevenDMineTrain', '/store/apps/details?id=com.turner.copatoon', '/store/apps/details?id=com.turner.wbb2016', '/store/apps/details?id=com.tov.google.ben10Xenodrome', '/store/apps/details?id=com.turner.ggl.gumballrainbowruckus', '/store/apps/details?id=com.lego.starwars.theyodachronicles', '/store/apps/details?id=com.mojang.scrolls'] 

*)精簡版。您可以將其作爲提供最小HTML樣本的示例。

**)我可以重現此問題,@href s在我的Chrome控制檯中打印爲空字符串。同樣的問題也發生在其他人身上:Chrome element inspector Xpath with @href won't show link text