2015-10-17 27 views
1

這裏是一些鏈接,我從一個網站上覆制我正在刮。問題是,在那裏的網站地圖一些主要類別出現不止一次,如:「時尚」,「視聽」和「計算機服務器」。但我只需要這些鏈接一次。我怎麼能實現這一點,我用VAR「計數器」來檢查第二次發生,但也沒有幫助。使用python選擇特定的錨標記

<a href="http://www.example.com/networking-storage">Networking Storage</a> 
<a href="http://www.example.com/mobiles-tablets">Mobiles Tablets</a> 
<a href="http://www.example.com/fashion">Fashion</a> 
<a href="http://www.example.com/fashion">Fashion</a> 
<a href="http://www.example.com/printers-scanners">Printers Scanners</a> 
<a href="http://www.example.com/audio-visual">Audio Visual</a> 
<a href="http://www.example.com/audio-visual">Audio Visual</a> 
<a href="http://www.example.com/cameras">Cameras</a> 
<a href="http://www.example.com/computers-servers">Computers Servers</a> 
<a href="http://www.example.com/computers-servers">Computers Servers</a> 

這裏是我的Python代碼來獲取這些鏈接:

mainPage = requests.get("http://www.example.com/catalog/seo_sitemap/category/?p=1") 
mainTree = html.fromstring(mainPage.text) 

for mainCat in mainTree.cssselect('a'): 
    print (mainCat.get('href')) 

它prints-

http://www.example.com/mobiles-tablets 
http://www.example.com/fashion 
http://www.example.com/fashion 
http://www.example.com/printers-scanners 
http://www.example.com/audio-visual 
http://www.example.com/audio-visual 
http://www.example.com/cameras 
http://www.example.com/computers-servers 
http://www.example.com/computers-servers 

雖然我需要這樣的:

http://www.example.com/mobiles-tablets 
http://www.example.com/fashion 
http://www.example.com/printers-scanners 
http://www.example.com/audio-visual 
http://www.example.com/cameras 
http://www.example.com/computers-servers 

回答

1

下面的代碼正在爲我工​​作 -

import requests 
from lxml.cssselect import CSSSelector 
from lxml import html 


s='''<a href="http://www.example.com/mobiles-tablets">Mobiles Tablets</a> 
<a href="http://www.example.com/fashion">Fashion</a> 
<a href="http://www.example.com/fashion">Fashion</a> 
<a href="http://www.example.com/printers-scanners">Printers Scanners</a> 
<a href="http://www.example.com/audio-visual">Audio Visual</a> 
<a href="http://www.example.com/audio-visual">Audio Visual</a> 
<a href="http://www.example.com/cameras">Cameras</a> 
<a href="http://www.example.com/computers-servers">Computers Servers</a> 
<a href="http://www.example.com/computers-servers">Computers Servers</a>''' 


#mainPage = requests.get("http://www.example.com/catalog/seo_sitemap/category/?p=1") 
mainTree = html.fromstring(s) 

mainTree = html.fromstring(s) 
lnks = set([i.get('href') for i in mainTree.cssselect('a')]) 
for i in lnks: 
    print i 

它prints-

http://www.example.com/mobiles-tablets 
http://www.example.com/printers-scanners 
http://www.example.com/fashion 
http://www.example.com/audio-visual 
http://www.example.com/computers-servers 
http://www.example.com/cameras 
+0

是的,我知道,但它打印一些鏈接兩次,因爲這些是在這樣的網頁,我需要他們只有一次。 –

+0

鏈接在網站頁面源中是否重複! – SIslam

+0

是的鏈接就像是頁面源代碼。但我需要以某種方式編程實現這一點。 –