設置python腳本的輸出位置

我想保存站點中的所有圖像。 wget是可怕的，至少對於http://www.leveldesigninspirationmachine.tumblr.com，因爲在圖像文件夾中它只是放棄HTML文件，並沒有作爲擴展。設置python腳本的輸出位置

我發現了一個python腳本，使用是這樣的：

[python] ImageDownloader.py URL MaxRecursionDepth DownloadLocationPath MinImageFileSize

最後我得到了一些BeautifulSoup問題後運行該腳本。但是，我無法在任何地方找到文件。我也試過「/」作爲輸出目錄，希望這些圖像能夠成爲我HD的根源，但沒有運氣。有人可以幫助我簡化腳本，使其在終端中設置的cd目錄中輸出。或者給我一個應該工作的命令。我沒有python的經驗，我真的不想爲一個2歲的腳本學習python，這可能甚至不會按我想要的方式工作。

另外，我怎樣才能傳遞一個網站的數組？有了很多刮板，它給了我頁面的前幾個結果。 tumblr對滾動負載但沒有任何效果，所以我想提前

# imageDownloader.py 
# Finds and downloads all images from any given URL recursively. 
# FB - 201009094 
import urllib2 
from os.path import basename 
import urlparse 
#from BeautifulSoup import BeautifulSoup # for HTML parsing 
import bs4 
from bs4 import BeautifulSoup 

global urlList 
urlList = [] 

# recursively download images starting from the root URL 
def downloadImages(url, level, minFileSize): # the root URL is level 0 
    # do not go to other websites 
    global website 
    netloc = urlparse.urlsplit(url).netloc.split('.') 
    if netloc[-2] + netloc[-1] != website: 
     return 

    global urlList 
    if url in urlList: # prevent using the same URL again 
     return 

    try: 
     urlContent = urllib2.urlopen(url).read() 
     urlList.append(url) 
     print url 
    except: 
     return 

    soup = BeautifulSoup(''.join(urlContent)) 
    # find and download all images 
    imgTags = soup.findAll('img') 
    for imgTag in imgTags: 
     imgUrl = imgTag['src'] 
     # download only the proper image files 
     if imgUrl.lower().endswith('.jpeg') or \ 
      imgUrl.lower().endswith('.jpg') or \ 
      imgUrl.lower().endswith('.gif') or \ 
      imgUrl.lower().endswith('.png') or \ 
      imgUrl.lower().endswith('.bmp'): 
      try: 
       imgData = urllib2.urlopen(imgUrl).read() 
       if len(imgData) >= minFileSize: 
        print " " + imgUrl 
        fileName = basename(urlsplit(imgUrl)[2]) 
        output = open(fileName,'wb') 
        output.write(imgData) 
        output.close() 
      except: 
       pass 
    print 
    print 

    # if there are links on the webpage then recursively repeat 
    if level > 0: 
     linkTags = soup.findAll('a') 
     if len(linkTags) > 0: 
      for linkTag in linkTags: 
       try: 
        linkUrl = linkTag['href'] 
        downloadImages(linkUrl, level - 1, minFileSize) 
       except: 
        pass 

# main 
rootUrl = 'http://www.leveldesigninspirationmachine.tumblr.com' 
netloc = urlparse.urlsplit(rootUrl).netloc.split('.') 
global website 
website = netloc[-2] + netloc[-1] 
downloadImages(rootUrl, 1, 50000)

來源

2014-12-12 clankill3r

程序應該將圖像保存在同一目錄下的程序跑。請注意，你不應該在你的程序中使用'except：pass'，因爲在下載過程中可能發生的任何錯誤只是被抑制，沒有成功或失敗的指示。特別是在嘗試在程序中發現問題時。 – Frxstrem 2014-12-12 23:56:17

由於Frxstream曾評論添加/page1等

感謝，這個程序會在當前目錄文件（即你在哪裏運行它）。運行程序後，運行ls -l（或dir）查找它創建的文件。

如果它看起來還沒有創建任何文件，那麼很可能它確實沒有創建任何文件，很可能是因爲您的except: pass隱藏了一個異常。要查看發生了什麼問題，請將try: ... except: pass替換爲...，然後重新運行該程序。（如果你不能理解和解決這個問題，請詢問一個單獨的StackOverflow問題。）

來源

2014-12-13 00:11:31 pts

不看錯誤就很難分辨（+1關閉你的try/except塊，所以你可以看到異常）但我看到一個錯字這裏：

fileName = basename(urlsplit(imgUrl)[2])

你沒有「從進口裏urlparse urlsplit」你有「進口裏urlparse」所以你需要把它稱爲urlparse.urlsplit（），你必須在其他地方，所以應該是這樣的

fileName = basename(urlparse.urlsplit(imgUrl)[2])

來源

2014-12-13 03:34:32 brobas

設置python腳本的輸出位置

回答

相關問題