2017-01-12 104 views
0

我有一個表,我希望拿起所有的鏈接,通過鏈接,並在td class =馬內刮取物品。刮錶鏈接,點擊鏈接和刮數據

主頁所在的表是所有環節具有下面的代碼:

<table border="0" cellspacing="0" cellpadding="0" class="full-calendar"> 
    <tr> 
     <th width="160">&nbsp;</th> 
     <th width="105"><a href="/FreeFields/Calendar.aspx?State=NSW">NSW</a></th> 
     <th width="105"><a href="/FreeFields/Calendar.aspx?State=VIC">VIC</a></th> 
     <th width="105"><a href="/FreeFields/Calendar.aspx?State=QLD">QLD</a></th> 
     <th width="105"><a href="/FreeFields/Calendar.aspx?State=WA">WA</a></th> 
     <th width="105"><a href="/FreeFields/Calendar.aspx?State=SA">SA</a></th> 
     <th width="105"><a href="/FreeFields/Calendar.aspx?State=TAS">TAS</a></th> 
     <th width="105"><a href="/FreeFields/Calendar.aspx?State=ACT">ACT</a></th> 
     <th width="105"><a href="/FreeFields/Calendar.aspx?State=NT">NT</a></th> 
    </tr> 


    <tr class="rows"> 
     <td> 
      <p><span>FRIDAY 13 JAN</span></p> 
     </td> 

       <td> 
        <p> 

          <a href="/FreeFields/Form.aspx?Key=2017Jan13,NSW,Ballina">Ballina</a><br> 

          <a href="/FreeFields/Form.aspx?Key=2017Jan13,NSW,Gosford">Gosford</a><br> 

        </p> 
       </td> 

       <td> 
        <p> 

          <a href="/FreeFields/Form.aspx?Key=2017Jan13,VIC,Ararat">Ararat</a><br> 

          <a href="/FreeFields/Form.aspx?Key=2017Jan13,VIC,Cranbourne">Cranbourne</a><br> 

        </p> 
       </td> 

       <td> 
        <p> 

          <a href="/FreeFields/Form.aspx?Key=2017Jan13,QLD,Doomben">Doomben</a><br> 

        </p> 
       </td> 

我目前擁有的代碼來查找表並打印鏈接

from selenium import webdriver 
import requests 
from bs4 import BeautifulSoup 

#path to chromedriver 
path_to_chromedriver = '/Users/Kirsty/Downloads/chromedriver' 

#ensure browser is set to Chrome 
browser = webdriver.Chrome(executable_path= path_to_chromedriver) 

#set browser to Racing Australia Home Page 
url = 'http://www.racingaustralia.horse/' 
r = requests.get(url) 

soup=BeautifulSoup(r.content, "html.parser") 

#looks up to find the table & prints link for each page 
table = soup.find('table',attrs={"class" : "full-calendar"}). find_all('a') 
for link in table: 
     print link.get('href') 

想知道任何人都可以協助我如何獲得代碼點擊表中的所有鏈接&對每個頁面執行以下操作:

g data = soup.findall("td",{"class":"horse"}) 
for item in g_data: 
    print item.text 

在此先感謝

+0

你是什麼意思的「點擊鏈接」?意思是,進入鏈接的頁面,然後抓取那裏的所有鏈接? – Signal

+0

是的,所以表格由下面的數據組成,例如

\t \t \t \t \t \t​​

FRIDAY 1月13日

\t \t \t​​

Ballina
\t \t \t \t \t \t \t \t \t \t \t \t \t \t \t Gosford
\t \t \t \t \t \t \t

\t \t \t \t​​

\t \t \t \t \t \t Ararat
\t \t \t \t \t \t \t \t \t \t \t \t \t \t \t Cranbourne

Kirsty

+0

@KirstyDent請把任何相關的數據,就像在您的評論的HTML以上,到問題本身這樣以後的讀者會更容易找到。 – JeffC

回答

0
import requests, bs4, re 
from urllib.parse import urljoin 
start_url = 'http://www.racingaustralia.horse/' 

def make_soup(url): 
    r = requests.get(url) 
    soup = bs4.BeautifulSoup(r.text, 'lxml') 
    return soup 

def get_links(url): 
    soup = make_soup(url) 
    a_tags = soup.find_all('a', href=re.compile(r"^/FreeFields/")) 
    links = [urljoin(start_url, a['href'])for a in a_tags] # convert relative url to absolute url 
    return links 

def get_tds(link): 
    soup = make_soup(link) 
    tds = soup.find_all('td', class_="horse") 
    if not tds: 
     print(link, 'do not find hours tag') 
    else: 
     for td in tds: 
      print(td.text) 

if __name__ == '__main__': 
    links = get_links(start_url) 
    for link in links: 
     get_tds(link) 

出來:

http://www.racingaustralia.horse/FreeFields/GroupAndListedRaces.aspx do not find hours tag 
http://www.racingaustralia.horse/FreeFields/Calendar.aspx?State=NSW do not find hours tag 
http://www.racingaustralia.horse/FreeFields/Calendar.aspx?State=VIC do not find hours tag 
http://www.racingaustralia.horse/FreeFields/Calendar.aspx?State=QLD do not find hours tag 
http://www.racingaustralia.horse/FreeFields/Calendar.aspx?State=WA do not find hours tag 
....... 

WEARETHECHAMPIONS 
STORMY HORIZON 
OUR RED JET 
SAPPER TOM 
MY COUSIN BOB 
ALL TOO HOT 
SAGA DEL MAR 
ZIGZOFF 
SASHAY AWAY 
SO SHE IS 
MILADY DUCHESS 

BS4 +請求能滿足您的需要。

+0

非常感謝!現在就試試這個:) – Kirsty