使用python選擇特定的錨標記

這裏是一些鏈接，我從一個網站上覆制我正在刮。問題是，在那裏的網站地圖一些主要類別出現不止一次，如：「時尚」，「視聽」和「計算機服務器」。但我只需要這些鏈接一次。我怎麼能實現這一點，我用VAR「計數器」來檢查第二次發生，但也沒有幫助。使用python選擇特定的錨標記

<a href="http://www.example.com/networking-storage">Networking Storage</a> 
<a href="http://www.example.com/mobiles-tablets">Mobiles Tablets</a> 
<a href="http://www.example.com/fashion">Fashion</a> 
<a href="http://www.example.com/fashion">Fashion</a> 
<a href="http://www.example.com/printers-scanners">Printers Scanners</a> 
<a href="http://www.example.com/audio-visual">Audio Visual</a> 
<a href="http://www.example.com/audio-visual">Audio Visual</a> 
<a href="http://www.example.com/cameras">Cameras</a> 
<a href="http://www.example.com/computers-servers">Computers Servers</a> 
<a href="http://www.example.com/computers-servers">Computers Servers</a>

這裏是我的Python代碼來獲取這些鏈接：

mainPage = requests.get("http://www.example.com/catalog/seo_sitemap/category/?p=1") 
mainTree = html.fromstring(mainPage.text) 

for mainCat in mainTree.cssselect('a'): 
    print (mainCat.get('href'))

它prints-

http://www.example.com/mobiles-tablets 
http://www.example.com/fashion 
http://www.example.com/fashion 
http://www.example.com/printers-scanners 
http://www.example.com/audio-visual 
http://www.example.com/audio-visual 
http://www.example.com/cameras 
http://www.example.com/computers-servers 
http://www.example.com/computers-servers

雖然我需要這樣的：

http://www.example.com/mobiles-tablets 
http://www.example.com/fashion 
http://www.example.com/printers-scanners 
http://www.example.com/audio-visual 
http://www.example.com/cameras 
http://www.example.com/computers-servers

來源

2015-10-17 Mansoor Akram

下面的代碼正在爲我工作 -

import requests 
from lxml.cssselect import CSSSelector 
from lxml import html 


s='''<a href="http://www.example.com/mobiles-tablets">Mobiles Tablets</a> 
<a href="http://www.example.com/fashion">Fashion</a> 
<a href="http://www.example.com/fashion">Fashion</a> 
<a href="http://www.example.com/printers-scanners">Printers Scanners</a> 
<a href="http://www.example.com/audio-visual">Audio Visual</a> 
<a href="http://www.example.com/audio-visual">Audio Visual</a> 
<a href="http://www.example.com/cameras">Cameras</a> 
<a href="http://www.example.com/computers-servers">Computers Servers</a> 
<a href="http://www.example.com/computers-servers">Computers Servers</a>''' 


#mainPage = requests.get("http://www.example.com/catalog/seo_sitemap/category/?p=1") 
mainTree = html.fromstring(s) 

mainTree = html.fromstring(s) 
lnks = set([i.get('href') for i in mainTree.cssselect('a')]) 
for i in lnks: 
    print i

它prints-

http://www.example.com/mobiles-tablets 
http://www.example.com/printers-scanners 
http://www.example.com/fashion 
http://www.example.com/audio-visual 
http://www.example.com/computers-servers 
http://www.example.com/cameras

來源

2015-10-17 09:55:39 SIslam

是的，我知道，但它打印一些鏈接兩次，因爲這些是在這樣的網頁，我需要他們只有一次。 –

鏈接在網站頁面源中是否重複！ – SIslam

是的鏈接就像是頁面源代碼。但我需要以某種方式編程實現這一點。 –

使用python選擇特定的錨標記

回答

相關問題