0
所以,我想要做的是創建一個Python函數,它允許我將它傳遞給我想要下載的播客的年,月和日。然後它將通過HTML解析並返回當天播客的鏈接。例如:解析可怕的結構化HTML?
>>> get_download_links(year, month, day)
['https://www.tytnetwork.com/?tytpm=44279&type=audio', # Hr 1 (audio)
'https://www.tytnetwork.com/?tytpm=44277&type=audio'] # Hr 2 (audio)
我試圖通過解析的頁面是http://www.tytnetwork.com/annual-archives/2014-main-show-archives/
這裏是(包括平日標籤)每月第一週的一個例子:
<tr>
<th class="tytca-mosname" colspan="5">
<h3>
June 2014
</h3>
</th>
</tr>
<tr>
<th class="tytca-dayname">
<h3>
Mon
</h3>
</th>
<th class="tytca-dayname">
<h3>
Tue
</h3>
</th>
<th class="tytca-dayname">
<h3>
Wed
</h3>
</th>
<th class="tytca-dayname">
<h3>
Thu
</h3>
</th>
<th class="tytca-dayname">
<h3>
Fri
</h3>
</th>
</tr>
<tr>
<td class="tytca-td">
<div class="tytca-daynum">
2
</div>
<p>
<a class="tytca-audio" href="https://www.tytnetwork.com/?tytpm=42848&type=audio" title="Click to download audio file">
Hr 1
</a>
<br/>
<a class="tytca-audio" href="https://www.tytnetwork.com/?tytpm=42851&type=audio" title="Click to download audio file">
Hr 2
</a>
<br/>
<a class="tytca-video" href="https://www.tytnetwork.com/?tytpm=42848&type=video" title="Click to download video file">
Hr 1
</a>
<br/>
<a class="tytca-video" href="https://www.tytnetwork.com/?tytpm=42851&type=video" title="Click to download video file">
Hr 2
</a>
<br/>
<a class="tytca-video-watch" href="https://www.tytnetwork.com/2014/06/02/tyt-june-2-2014-hour-1/" title="Click to watch the video">
Hr 1
</a>
<br/>
<a class="tytca-video-watch" href="https://www.tytnetwork.com/2014/06/02/tyt-june-2-2014-hour-2/" title="Click to watch the video">
Hr 2
</a>
</p>
</td>
<td class="tytca-td">
<div class="tytca-daynum">
3
</div>
<p>
<a class="tytca-audio" href="https://www.tytnetwork.com/?tytpm=43325&type=audio" title="Click to download audio file">
Hr 1
</a>
<br/>
<a class="tytca-audio" href="https://www.tytnetwork.com/?tytpm=43324&type=audio" title="Click to download audio file">
Hr 2
</a>
<br/>
<a class="tytca-video" href="https://www.tytnetwork.com/?tytpm=43325&type=video" title="Click to download video file">
Hr 1
</a>
<br/>
<a class="tytca-video" href="https://www.tytnetwork.com/?tytpm=43324&type=video" title="Click to download video file">
Hr 2
</a>
<br/>
<a class="tytca-video-watch" href="https://www.tytnetwork.com/2014/06/03/tyt-june-3-2014-hour-1/" title="Click to watch the video">
Hr 1
</a>
<br/>
<a class="tytca-video-watch" href="https://www.tytnetwork.com/2014/06/03/tyt-june-3-2014-hour-2/" title="Click to watch the video">
Hr 2
</a>
</p>
</td>
<td class="tytca-td">
<div class="tytca-daynum">
4
</div>
<p>
<a class="tytca-audio" href="https://www.tytnetwork.com/?tytpm=43635&type=audio" title="Click to download audio file">
Hr 1
</a>
<br/>
<a class="tytca-audio" href="https://www.tytnetwork.com/?tytpm=43633&type=audio" title="Click to download audio file">
Hr 2
</a>
<br/>
<a class="tytca-video" href="https://www.tytnetwork.com/?tytpm=43635&type=video" title="Click to download video file">
Hr 1
</a>
<br/>
<a class="tytca-video" href="https://www.tytnetwork.com/?tytpm=43633&type=video" title="Click to download video file">
Hr 2
</a>
<br/>
<a class="tytca-video-watch" href="https://www.tytnetwork.com/2014/06/04/tyt-june-4-2014-hour-1/" title="Click to watch the video">
Hr 1
</a>
<br/>
<a class="tytca-video-watch" href="https://www.tytnetwork.com/2014/06/04/tyt-june-4-2014-hour-2/" title="Click to watch the video">
Hr 2
</a>
</p>
</td>
<td class="tytca-td">
<div class="tytca-daynum">
5
</div>
<p>
<a class="tytca-audio" href="https://www.tytnetwork.com/?tytpm=44046&type=audio" title="Click to download audio file">
Hr 1
</a>
<br/>
<a class="tytca-audio" href="https://www.tytnetwork.com/?tytpm=44044&type=audio" title="Click to download audio file">
Hr 2
</a>
<br/>
<a class="tytca-video" href="https://www.tytnetwork.com/?tytpm=44046&type=video" title="Click to download video file">
Hr 1
</a>
<br/>
<a class="tytca-video" href="https://www.tytnetwork.com/?tytpm=44044&type=video" title="Click to download video file">
Hr 2
</a>
<br/>
<a class="tytca-video-watch" href="https://www.tytnetwork.com/2014/06/05/tyt-june-5-2014-hour-1/" title="Click to watch the video">
Hr 1
</a>
<br/>
<a class="tytca-video-watch" href="https://www.tytnetwork.com/2014/06/05/tyt-june-5-2014-hour-2/" title="Click to watch the video">
Hr 2
</a>
</p>
</td>
<td class="tytca-td">
<div class="tytca-daynum">
6
</div>
<p>
<a class="tytca-audio" href="https://www.tytnetwork.com/?tytpm=44279&type=audio" title="Click to download audio file">
Hr 1
</a>
<br/>
<a class="tytca-audio" href="https://www.tytnetwork.com/?tytpm=44277&type=audio" title="Click to download audio file">
Hr 2
</a>
<br/>
<a class="tytca-video" href="https://www.tytnetwork.com/?tytpm=44279&type=video" title="Click to download video file">
Hr 1
</a>
<br/>
<a class="tytca-video" href="https://www.tytnetwork.com/?tytpm=44277&type=video" title="Click to download video file">
Hr 2
</a>
<br/>
<a class="tytca-video-watch" href="https://www.tytnetwork.com/2014/06/06/tyt-june-6-2014-hour-1/" title="Click to watch the video">
Hr 1
</a>
<br/>
<a class="tytca-video-watch" href="https://www.tytnetwork.com/2014/06/06/tyt-june-6-2014-hour-2/" title="Click to watch the video">
Hr 2
</a>
</p>
</td>
</tr>
我試過使用美麗的湯,但問題是,頁面結構很差,似乎沒有辦法做到我想要的。
在這一點上,我將這轉交給這裏的Python專家來幫助我。
我假設你有一個帳戶,你已經處理了認證部分? – merlin2011
@ merlin2011當然可以。 :D – Soviero
在「將它交給Python專家」之前:請努力構建一個可以脫機運行的最小示例,並且只包含您在解析時遇到的相關HTML代碼段,以及您嘗試這樣做的嘗試。 –