如何編寫下載的python腳本？

我想從本網站下載一些文件：http://www.emuparadise.me/soundtracks/highquality/index.php 如何編寫下載的python腳本？

但我只想得到某些。

有沒有辦法編寫一個python腳本來做到這一點？我有蟒蛇的中級知識

我只是找了一些指導，請點我朝着一個wiki或庫來完成這個

感謝，灌木

Here's a link to my code

來源

2012-09-25 Rishub Nagpal

我會推薦使用BeautifulSoup解析頁面並提取所需的鏈接。從那裏，只需製作幾個方法，將問題分解成簡單的步驟。 – Blender

對於下載部分，內置'urllib2'（http://docs.python.org/library/urllib2）是最簡單的方法;在文檔中的示例代碼很容易遵循。對於解析，BeautifulSoup是刮掉任意HTML的最佳方式;如果您知道您擁有有效的XHTML或HTML 5，還有其他選項;如果你可以用機器可讀的XML或JSON獲取信息，而不是首先使用人類可讀的HTML，那麼我會這麼做。 – abarnert

@abarnert如何使用'wget'和'subprocess'模塊下載文件？ –

我看了看頁面。鏈接似乎重定向到其他頁面，文件託管在該頁面，單擊該頁面下載文件。

我會使用mechanize按照所需的鏈接到正確的頁面，然後使用BeautifulSoup或lxml來解析生成的頁面以獲取文件名。

然後，它的開放使用urlopen文件並進行寫入其內容爲本地文件，像這樣的一個簡單的問題：

f = open(localFilePath, 'w') 
f.write(urlopen(remoteFilePath).read()) 
f.close()

希望幫助

來源

2012-09-25 22:34:34 inspectorG4dget

非常感謝！我是否用我想要的目錄替換localFilePath和remoteFilePath？ –

'localFilePath'是你想要的完整路徑名 - 目錄加文件名。 'remoteFilePath'是URL。 – abarnert

'localFilePath'將包含你想保存音樂的目錄。例如'localFilePath'可以是'/ home/username/Downloads/OnlineMusic/file1.flac' – inspectorG4dget

-1

我會使用wget的組合下載 - http://www.thegeekstuff.com/2009/09/the-ultimate-wget-download-guide-with-15-awesome-examples/#more-1885和BeautifulSoup http://www.crummy.com/software/BeautifulSoup/bs4/doc/用於解析下載的文件

來源

2012-09-25 22:40:09 Manan

爲什麼使用wget？如果你只是想下載一個URL的全部內容，那就是'urllib2.urlopen'的一行，而'wget'則需要至少一行'subprocess'和另一行來讀取文件，再加上你已經得到一個臨時文件，你不需要並且必須管理等等。另外，這意味着如果你在Mac，Win，FreeBSD，任何沒有安裝它的linux發行版中，你需要安裝wget默認情況下等。 – abarnert

爲頁面發出url請求。獲得源代碼後，過濾掉並獲取網址。

您要下載的文件是包含特定擴展名的URL。正因爲如此，您可以對符合條件的所有網址進行正則表達式搜索。過濾後，然後爲每個匹配的url數據做一個url請求並將其寫入內存。

樣品的編號：

#!/usr/bin/python 
import re 
import sys 
import urllib 

#Your sample url 
sampleUrl = "http://stackoverflow.com" 
urlAddInfo = urllib.urlopen(sampleUrl) 
data = urlAddInfo.read() 

#Sample extensions we'll be looking for: pngs and pdfs 
TARGET_EXTENSIONS = "(png|pdf)" 
targetCompile = re.compile(TARGET_EXTENSIONS, re.UNICODE|re.MULTILINE) 

#Let's get all the urls: match criteria{no spaces or " in a url} 
urls = re.findall('(https?://[^\s"]+)', data, re.UNICODE|re.MULTILINE) 

#We want these folks 
extensionMatches = filter(lambda url: url and targetCompile.search(url), urls) 

#The rest of the unmatched urls for which the scrapping can also be repeated. 
nonExtMatches = filter(lambda url: url and not targetCompile.search(url), urls) 


def fileDl(targetUrl): 
    #Function to handle downloading of files. 
    #Arg: url => a String 
    #Output: Boolean to signify if file has been written to memory 

    #Validation of the url assumed, for the sake of keeping the illustration short 
    urlAddInfo = urllib.urlopen(targetUrl) 
    data = urlAddInfo.read() 
    fileNameSearch = re.search("([^\/\s]+)$", targetUrl) #Text right before the last slash '/' 
    if not fileNameSearch: 
    sys.stderr.write("Could not extract a filename from url '%s'\n"%(targetUrl)) 
    return False 
    fileName = fileNameSearch.groups(1)[0] 
    with open(fileName, "wb") as f: 
    f.write(data) 
    sys.stderr.write("Wrote %s to memory\n"%(fileName)) 
    return True 

#Let's now download the matched files 
dlResults = map(lambda fUrl: fileDl(fUrl), extensionMatches) 
successfulDls = filter(lambda s: s, dlResults) 
sys.stderr.write("Downloaded %d files from %s\n"%(len(successfulDls), sampleUrl)) 

#You can organize the above code into a function to repeat the process for each of the 
#other urls and in that way you can make a crawler.

上述代碼主要是寫爲Python2.X。但是，I wrote a crawler that works on any version starting from 2.X

來源

2013-07-31 16:27:10

如何編寫下載的python腳本？

回答

相關問題