2017-05-04 61 views
0

運行我用python編寫的腳本我可以看到一堆重複的結果。有沒有任何解決辦法擺脫這些重複?這裏是我的腳本:抓取時刪除重複鏈接

import requests 
from lxml import html 

def Startpoint(): 
    default="http://tennishub.co.uk" 
    link="http://tennishub.co.uk/" 
    response = requests.get(link) 
    tree = html.fromstring(response.text) 
    titles = tree.xpath('//div[@class="countylist"]') 
    for title in titles: 
     links = title.xpath('.//a/@href') 
     for link in links: 
      page = default + link 
      Midpoint(page) 

def Midpoint(address): 
    default="http://tennishub.co.uk" 
    response = requests.get(address) 
    tree = html.fromstring(response.text) 
    titles = tree.xpath('//div[@class="pagination"]') 
    for title in titles: 
     links = title.xpath('.//a/@href') 
     for link in links: 
      mlink = default + link 
      print(mlink) 

Startpoint() 

這裏就是我得到的截圖:

enter image description here

+1

當你湊一個鏈接,將該URL添加到'set'。在抓取鏈接之前,檢查它是否在集合中。 – Barmar

+0

感謝主席巴爾瑪,感謝您的回覆。我聽說很多關於使用set來刪除重複的東西,但事情是我不能使用它,我的意思是我不知道在哪裏以及如何放置它。 – SIM

+0

更新了答案,希望它有幫助 –

回答

2

如果順序並不顯著然後纏繞在你的links對象set因爲將擺脫重複的str實例是hashable

links = title.xpath('.//a/@href') 
links = set(links) 

如果您要跨越那麼所有的頁面,你需要過濾掉像

import requests 
from lxml import html 


def Startpoint(): 
    default = "http://tennishub.co.uk" 
    link = "http://tennishub.co.uk/" 
    response = requests.get(link) 
    tree = html.fromstring(response.text) 
    titles = tree.xpath('//div[@class="countylist"]') 
    processed_links = set() 
    for title in titles: 
     unprocessed_links = set(title.xpath('.//a/@href')) - processed_links 
     for link in unprocessed_links: 
      page = default + link 
      Midpoint(page) 
     processed_links |= unprocessed_links 


def Midpoint(address): 
    default = "http://tennishub.co.uk" 
    response = requests.get(address) 
    tree = html.fromstring(response.text) 
    titles = tree.xpath('//div[@class="pagination"]') 
    processed_links = set() 
    for title in titles: 
     unprocessed_links = set(title.xpath('.//a/@href')) - processed_links 
     for link in unprocessed_links: 
      mlink = default + link 
      print(mlink) 
     processed_links |= unprocessed_links 


Startpoint() 

輸出每title未處理的鏈接(可從你的不同,因爲set s爲無序的)你的鏈接是唯一的:

http://tennishub.co.uk/tennis-clubs-by-county/Middlesex/3 
http://tennishub.co.uk/tennis-clubs-by-county/Middlesex/10 
http://tennishub.co.uk/tennis-clubs-by-county/Middlesex/2 
http://tennishub.co.uk/tennis-clubs-by-county/Middlesex/4 
http://tennishub.co.uk/tennis-clubs-by-county/Hampshire/4 
http://tennishub.co.uk/tennis-clubs-by-county/Hampshire/7 
http://tennishub.co.uk/tennis-clubs-by-county/Hampshire/2 
http://tennishub.co.uk/tennis-clubs-by-county/Hampshire/3 
http://tennishub.co.uk/tennis-clubs-by-county/Oxfordshire/4 
http://tennishub.co.uk/tennis-clubs-by-county/Oxfordshire/2 
http://tennishub.co.uk/tennis-clubs-by-county/Oxfordshire/3 
http://tennishub.co.uk/tennis-clubs-by-county/Buckinghamshire/4 
http://tennishub.co.uk/tennis-clubs-by-county/Buckinghamshire/3 
http://tennishub.co.uk/tennis-clubs-by-county/Buckinghamshire/5 
http://tennishub.co.uk/tennis-clubs-by-county/Buckinghamshire/2 
http://tennishub.co.uk/tennis-clubs-by-county/Berkshire/3 
http://tennishub.co.uk/tennis-clubs-by-county/Berkshire/2 
http://tennishub.co.uk/tennis-clubs-by-county/Berkshire/4 
http://tennishub.co.uk/tennis-clubs-by-county/West Sussex/4 
http://tennishub.co.uk/tennis-clubs-by-county/West Sussex/3 
http://tennishub.co.uk/tennis-clubs-by-county/West Sussex/2 
http://tennishub.co.uk/tennis-clubs-by-county/East Sussex/3 
http://tennishub.co.uk/tennis-clubs-by-county/East Sussex/2 
http://tennishub.co.uk/tennis-clubs-by-county/Kent/8 
http://tennishub.co.uk/tennis-clubs-by-county/Kent/3 
http://tennishub.co.uk/tennis-clubs-by-county/Kent/4 
http://tennishub.co.uk/tennis-clubs-by-county/Kent/2 
http://tennishub.co.uk/tennis-clubs-by-county/Surrey/3 
http://tennishub.co.uk/tennis-clubs-by-county/Surrey/4 
http://tennishub.co.uk/tennis-clubs-by-county/Surrey/2 
http://tennishub.co.uk/tennis-clubs-by-county/Surrey/14 
http://tennishub.co.uk/tennis-clubs-by-county/Suffolk/2 
http://tennishub.co.uk/tennis-clubs-by-county/Suffolk/3 
http://tennishub.co.uk/tennis-clubs-by-county/Bedfordshire/2 
http://tennishub.co.uk/tennis-clubs-by-county/Hertfordshire/2 
http://tennishub.co.uk/tennis-clubs-by-county/Hertfordshire/3 
http://tennishub.co.uk/tennis-clubs-by-county/Hertfordshire/7 
http://tennishub.co.uk/tennis-clubs-by-county/Hertfordshire/4 
http://tennishub.co.uk/tennis-clubs-by-county/Cambridgeshire/4 
http://tennishub.co.uk/tennis-clubs-by-county/Cambridgeshire/3 
http://tennishub.co.uk/tennis-clubs-by-county/Cambridgeshire/2 
http://tennishub.co.uk/tennis-clubs-by-county/Norfolk/2 
http://tennishub.co.uk/tennis-clubs-by-county/Norfolk/3 
http://tennishub.co.uk/tennis-clubs-by-county/Essex/4 
http://tennishub.co.uk/tennis-clubs-by-county/Essex/2 
http://tennishub.co.uk/tennis-clubs-by-county/Essex/7 
http://tennishub.co.uk/tennis-clubs-by-county/Essex/3 
http://tennishub.co.uk/tennis-clubs-by-county/Cheshire/3 
http://tennishub.co.uk/tennis-clubs-by-county/Cheshire/4 
http://tennishub.co.uk/tennis-clubs-by-county/Cheshire/2 
http://tennishub.co.uk/tennis-clubs-by-county/Cheshire/7 
http://tennishub.co.uk/tennis-clubs-by-county/Cumbria/2 
http://tennishub.co.uk/tennis-clubs-by-county/Lancashire/4 
http://tennishub.co.uk/tennis-clubs-by-county/Lancashire/9 
http://tennishub.co.uk/tennis-clubs-by-county/Lancashire/3 
http://tennishub.co.uk/tennis-clubs-by-county/Lancashire/2 
http://tennishub.co.uk/tennis-clubs-by-county/Warwickshire/6 
http://tennishub.co.uk/tennis-clubs-by-county/Warwickshire/2 
http://tennishub.co.uk/tennis-clubs-by-county/Warwickshire/3 
http://tennishub.co.uk/tennis-clubs-by-county/Warwickshire/4 
http://tennishub.co.uk/tennis-clubs-by-county/Staffordshire/2 
http://tennishub.co.uk/tennis-clubs-by-county/Shropshire/2 
http://tennishub.co.uk/tennis-clubs-by-county/Worcestershire/3 
http://tennishub.co.uk/tennis-clubs-by-county/Worcestershire/2 
http://tennishub.co.uk/tennis-clubs-by-county/South Yorkshire/2 
http://tennishub.co.uk/tennis-clubs-by-county/West Yorkshire/3 
http://tennishub.co.uk/tennis-clubs-by-county/West Yorkshire/2 
http://tennishub.co.uk/tennis-clubs-by-county/West Yorkshire/4 
http://tennishub.co.uk/tennis-clubs-by-county/West Yorkshire/5 
http://tennishub.co.uk/tennis-clubs-by-county/Northumberland/2 
http://tennishub.co.uk/tennis-clubs-by-county/East Yorkshire/2 
http://tennishub.co.uk/tennis-clubs-by-county/Durham/2 
http://tennishub.co.uk/tennis-clubs-by-county/North Yorkshire/2 
http://tennishub.co.uk/tennis-clubs-by-county/North Yorkshire/3 
http://tennishub.co.uk/tennis-clubs-by-county/Devon/5 
http://tennishub.co.uk/tennis-clubs-by-county/Devon/4 
http://tennishub.co.uk/tennis-clubs-by-county/Devon/2 
http://tennishub.co.uk/tennis-clubs-by-county/Devon/3 
http://tennishub.co.uk/tennis-clubs-by-county/Wiltshire/3 
http://tennishub.co.uk/tennis-clubs-by-county/Wiltshire/2 
http://tennishub.co.uk/tennis-clubs-by-county/Dorset/2 
http://tennishub.co.uk/tennis-clubs-by-county/Dorset/3 
http://tennishub.co.uk/tennis-clubs-by-county/Somerset/2 
http://tennishub.co.uk/tennis-clubs-by-county/Somerset/4 
http://tennishub.co.uk/tennis-clubs-by-county/Somerset/3 
http://tennishub.co.uk/tennis-clubs-by-county/Gloucestershire/3 
http://tennishub.co.uk/tennis-clubs-by-county/Gloucestershire/4 
http://tennishub.co.uk/tennis-clubs-by-county/Gloucestershire/5 
http://tennishub.co.uk/tennis-clubs-by-county/Gloucestershire/2 
http://tennishub.co.uk/tennis-clubs-by-county/Cornwall/2 
http://tennishub.co.uk/tennis-clubs-by-county/Nottinghamshire/2 
http://tennishub.co.uk/tennis-clubs-by-county/Nottinghamshire/3 
http://tennishub.co.uk/tennis-clubs-by-county/Lincolnshire/2 
http://tennishub.co.uk/tennis-clubs-by-county/Derbyshire/2 
http://tennishub.co.uk/tennis-clubs-by-county/Derbyshire/3 
http://tennishub.co.uk/tennis-clubs-by-county/Leicestershire/3 
http://tennishub.co.uk/tennis-clubs-by-county/Leicestershire/2 
http://tennishub.co.uk/tennis-clubs-by-county/Leicestershire/4 
http://tennishub.co.uk/tennis-clubs-by-county/Northamptonshire/3 
http://tennishub.co.uk/tennis-clubs-by-county/Northamptonshire/2 
+0

這樣可以擺脫單個頁面內的重複內容,而不是跨越不同的頁面。 – Barmar

+0

感謝Azat Ibrakov,您的回覆。但它對重複結果根本沒有任何影響。 – SIM

+0

感謝Azat Ibrakov,爲您提供強大的解決方案。它完美的作品。 – SIM