使用beautifulsoup和python

如何獲得只有MP3鏈接這是我的代碼：使用beautifulsoup和python

from bs4 import BeautifulSoup 
import urllib.request 
import re 

url = urllib.request.urlopen("http://www.djmaza.info/Abhi-Toh-Party-Khubsoorat-Full-Song-MP3-2014-Singles.html") 
content = url.read() 
soup = BeautifulSoup(content) 
for a in soup.findAll('a',href=True): 
    if re.findall('http',a['href']): 
     print ("URL:", a['href'])

輸出這段代碼：

URL: http://twitter.com/mp3khan 
URL: http://www.facebook.com/pages/MP3KhanCom-Music-Updates/233163530138863 
URL: https://plus.google.com/114136514767143493258/posts 
URL: http://www.djhungama.com 
URL: http://www.djhungama.com 
URL: http://songs.djmazadownload.com/music/Singles/Abhi Toh Party (Khoobsurat) -190Kbps [DJMaza.Info].mp3 
URL: http://songs.djmazadownload.com/music/Singles/Abhi Toh Party (Khoobsurat) -190Kbps [DJMaza.Info].mp3 
URL: http://songs.djmazadownload.com/music/Singles/Abhi Toh Party (Khoobsurat) -320Kbps [DJMaza.Info].mp3 
URL: http://songs.djmazadownload.com/music/Singles/Abhi Toh Party (Khoobsurat) -320Kbps [DJMaza.Info].mp3 
URL: http://www.htmlcommentbox.com 
URL: http://www.djmaza.com 
URL: http://www.djhungama.com

我只需要MP3播放鏈接。

那麼，我應該如何重寫代碼？

謝謝

來源

2014-08-29 Muneeb K

更改findAll使用正則表達式做匹配，如：

for a in soup.findAll('a',href=re.compile('http.*\.mp3')): 
    print ("URL:", a['href'])

與更新評論：

我需要存儲上的那些鏈接數組下載。我怎樣才能做到這一點？

您可以使用列表理解，而不是建立一個列表：

links = [a['href'] for a in soup.find_all('a',href=re.compile('http.*\.mp3'))]

來源

2014-08-29 08:56:28

非常感謝你...... D – 2014-08-29 09:22:05

@MuneebK不客氣。另一方面，當你使用'bs4'時 - 你可能想使用'.find_all'而不是'findAll'，因爲後者是BS3風格，並且爲了向後兼容而保留，但可能在某些時候被刪除 - 所以最好養成使用'something_something'函數而不是'somethingSomething'函數的習慣。 – 2014-08-29 09:24:59

我需要將這些鏈接存儲在數組上進行下載。我怎樣才能做到這一點？ – 2014-08-29 09:36:46

你可以使用.endswith()。例如：

if re.findall('http',a['href']) and a['href'].endswith(".mp3"):

來源

2014-08-29 08:55:28 fredtantini

謝謝你的偉大的答案！這對我幫助很大。我想翻譯這篇文章與我的韓國朋友分享。它會發布[這裏ctrlaltdel]（http://ctrlaltdel.co.kr/）請讓我知道你是否介意它。那麼我會刪除它。 – 2014-08-29 09:04:10

我完全不介意。請翻譯和分享。 – fredtantini 2014-08-29 12:08:43

如果只擴展的利益，你，那麼你必須知道，endswith()返回一個布爾值而不是文件的擴展名。這是更好地建立自己的函數用於此目的是這樣的：

if re.findall('http',a['href']) and isMP3file(a['href'])):

現在你可以定義函數是這樣的：

import os 
def isMP3file(link): 
    name, ext = os.path.splitext(link) 
    return ext.lower() == '.mp3'

來源

2014-08-29 09:08:12

謝謝Beguradj – 2014-08-29 09:22:20

使用beautifulsoup和python

回答

相關問題