2014-06-21 42 views
0

所以,我想要做的是創建一個Python函數,它允許我將它傳遞給我想要下載的播客的年,月和日。然後它將通過HTML解析並返回當天播客的鏈接。例如:解析可怕的結構化HTML?

>>> get_download_links(year, month, day) 
['https://www.tytnetwork.com/?tytpm=44279&type=audio', # Hr 1 (audio) 
'https://www.tytnetwork.com/?tytpm=44277&type=audio'] # Hr 2 (audio) 

我試圖通過解析的頁面是http://www.tytnetwork.com/annual-archives/2014-main-show-archives/

這裏是(包括平日標籤)每月第一週的一個例子:

<tr> 
      <th class="tytca-mosname" colspan="5"> 
      <h3> 
      June 2014 
      </h3> 
      </th> 
      </tr> 
      <tr> 
      <th class="tytca-dayname"> 
      <h3> 
      Mon 
      </h3> 
      </th> 
      <th class="tytca-dayname"> 
      <h3> 
      Tue 
      </h3> 
      </th> 
      <th class="tytca-dayname"> 
      <h3> 
      Wed 
      </h3> 
      </th> 
      <th class="tytca-dayname"> 
      <h3> 
      Thu 
      </h3> 
      </th> 
      <th class="tytca-dayname"> 
      <h3> 
      Fri 
      </h3> 
      </th> 
      </tr> 
      <tr> 
      <td class="tytca-td"> 
      <div class="tytca-daynum"> 
      2 
      </div> 
      <p> 
      <a class="tytca-audio" href="https://www.tytnetwork.com/?tytpm=42848&amp;type=audio" title="Click to download audio file"> 
       Hr 1 
      </a> 
      <br/> 
      <a class="tytca-audio" href="https://www.tytnetwork.com/?tytpm=42851&amp;type=audio" title="Click to download audio file"> 
       Hr 2 
      </a> 
      <br/> 
      <a class="tytca-video" href="https://www.tytnetwork.com/?tytpm=42848&amp;type=video" title="Click to download video file"> 
       Hr 1 
      </a> 
      <br/> 
      <a class="tytca-video" href="https://www.tytnetwork.com/?tytpm=42851&amp;type=video" title="Click to download video file"> 
       Hr 2 
      </a> 
      <br/> 
      <a class="tytca-video-watch" href="https://www.tytnetwork.com/2014/06/02/tyt-june-2-2014-hour-1/" title="Click to watch the video"> 
       Hr 1 
      </a> 
      <br/> 
      <a class="tytca-video-watch" href="https://www.tytnetwork.com/2014/06/02/tyt-june-2-2014-hour-2/" title="Click to watch the video"> 
       Hr 2 
      </a> 
      </p> 
      </td> 
      <td class="tytca-td"> 
      <div class="tytca-daynum"> 
      3 
      </div> 
      <p> 
      <a class="tytca-audio" href="https://www.tytnetwork.com/?tytpm=43325&amp;type=audio" title="Click to download audio file"> 
       Hr 1 
      </a> 
      <br/> 
      <a class="tytca-audio" href="https://www.tytnetwork.com/?tytpm=43324&amp;type=audio" title="Click to download audio file"> 
       Hr 2 
      </a> 
      <br/> 
      <a class="tytca-video" href="https://www.tytnetwork.com/?tytpm=43325&amp;type=video" title="Click to download video file"> 
       Hr 1 
      </a> 
      <br/> 
      <a class="tytca-video" href="https://www.tytnetwork.com/?tytpm=43324&amp;type=video" title="Click to download video file"> 
       Hr 2 
      </a> 
      <br/> 
      <a class="tytca-video-watch" href="https://www.tytnetwork.com/2014/06/03/tyt-june-3-2014-hour-1/" title="Click to watch the video"> 
       Hr 1 
      </a> 
      <br/> 
      <a class="tytca-video-watch" href="https://www.tytnetwork.com/2014/06/03/tyt-june-3-2014-hour-2/" title="Click to watch the video"> 
       Hr 2 
      </a> 
      </p> 
      </td> 
      <td class="tytca-td"> 
      <div class="tytca-daynum"> 
      4 
      </div> 
      <p> 
      <a class="tytca-audio" href="https://www.tytnetwork.com/?tytpm=43635&amp;type=audio" title="Click to download audio file"> 
       Hr 1 
      </a> 
      <br/> 
      <a class="tytca-audio" href="https://www.tytnetwork.com/?tytpm=43633&amp;type=audio" title="Click to download audio file"> 
       Hr 2 
      </a> 
      <br/> 
      <a class="tytca-video" href="https://www.tytnetwork.com/?tytpm=43635&amp;type=video" title="Click to download video file"> 
       Hr 1 
      </a> 
      <br/> 
      <a class="tytca-video" href="https://www.tytnetwork.com/?tytpm=43633&amp;type=video" title="Click to download video file"> 
       Hr 2 
      </a> 
      <br/> 
      <a class="tytca-video-watch" href="https://www.tytnetwork.com/2014/06/04/tyt-june-4-2014-hour-1/" title="Click to watch the video"> 
       Hr 1 
      </a> 
      <br/> 
      <a class="tytca-video-watch" href="https://www.tytnetwork.com/2014/06/04/tyt-june-4-2014-hour-2/" title="Click to watch the video"> 
       Hr 2 
      </a> 
      </p> 
      </td> 
      <td class="tytca-td"> 
      <div class="tytca-daynum"> 
      5 
      </div> 
      <p> 
      <a class="tytca-audio" href="https://www.tytnetwork.com/?tytpm=44046&amp;type=audio" title="Click to download audio file"> 
       Hr 1 
      </a> 
      <br/> 
      <a class="tytca-audio" href="https://www.tytnetwork.com/?tytpm=44044&amp;type=audio" title="Click to download audio file"> 
       Hr 2 
      </a> 
      <br/> 
      <a class="tytca-video" href="https://www.tytnetwork.com/?tytpm=44046&amp;type=video" title="Click to download video file"> 
       Hr 1 
      </a> 
      <br/> 
      <a class="tytca-video" href="https://www.tytnetwork.com/?tytpm=44044&amp;type=video" title="Click to download video file"> 
       Hr 2 
      </a> 
      <br/> 
      <a class="tytca-video-watch" href="https://www.tytnetwork.com/2014/06/05/tyt-june-5-2014-hour-1/" title="Click to watch the video"> 
       Hr 1 
      </a> 
      <br/> 
      <a class="tytca-video-watch" href="https://www.tytnetwork.com/2014/06/05/tyt-june-5-2014-hour-2/" title="Click to watch the video"> 
       Hr 2 
      </a> 
      </p> 
      </td> 
      <td class="tytca-td"> 
      <div class="tytca-daynum"> 
      6 
      </div> 
      <p> 
      <a class="tytca-audio" href="https://www.tytnetwork.com/?tytpm=44279&amp;type=audio" title="Click to download audio file"> 
       Hr 1 
      </a> 
      <br/> 
      <a class="tytca-audio" href="https://www.tytnetwork.com/?tytpm=44277&amp;type=audio" title="Click to download audio file"> 
       Hr 2 
      </a> 
      <br/> 
      <a class="tytca-video" href="https://www.tytnetwork.com/?tytpm=44279&amp;type=video" title="Click to download video file"> 
       Hr 1 
      </a> 
      <br/> 
      <a class="tytca-video" href="https://www.tytnetwork.com/?tytpm=44277&amp;type=video" title="Click to download video file"> 
       Hr 2 
      </a> 
      <br/> 
      <a class="tytca-video-watch" href="https://www.tytnetwork.com/2014/06/06/tyt-june-6-2014-hour-1/" title="Click to watch the video"> 
       Hr 1 
      </a> 
      <br/> 
      <a class="tytca-video-watch" href="https://www.tytnetwork.com/2014/06/06/tyt-june-6-2014-hour-2/" title="Click to watch the video"> 
       Hr 2 
      </a> 
      </p> 
      </td> 
      </tr> 

我試過使用美麗的湯,但問題是,頁面結構很差,似乎沒有辦法做到我想要的。

在這一點上,我將這轉交給這裏的Python專家來幫助我。

+0

我假設你有一個帳戶,你已經處理了認證部分? – merlin2011

+0

@ merlin2011當然可以。 :D – Soviero

+0

在「將它交給Python專家」之前:請努力構建一個可以脫機運行的最小示例,並且只包含您在解析時遇到的相關HTML代碼段,以及您嘗試這樣做的嘗試。 –

回答

1
import requests 
import bs4 
import re 
url = "http://www.tytnetwork.com/annual-archives/{year}-main-show-archives/" 


def getPodCasts(m,d,y): 
    my_url = url.format(year=y) 
    print my_url 
    soup = bs4.BeautifulSoup(requests.get(my_url,headers={'User-agent': 'Mozilla/5.0'}).content) 
    calendar_row_for_month=soup.findAll(text=re.compile("^%s.*%s"%(m,y)))[0].parent.parent.parent 
    for sib in calendar_row_for_month.findNextSiblings(): 
     if ">%02d<"%d in str(sib): 
      break 
    assert ">%02d<"%d in str(sib), "Error Date %s/%s/%s Not Found"%(m,d,y) 
    audios = sib.find(text="%02d"%d).next.next 
    return re.findall('https?:[^" ]*',str(audios)) 


print getPodCasts("June",12,2014) 
+0

哇......我需要花一個星期的時間來咀嚼所有的東西,然後弄清楚它究竟發生了什麼。謝謝! ;) – Soviero