使用Python 2.7從網頁獲取下載鏈接

所以我讓這個程序讓重複性的任務變得更加惱人。這是假設採取一個鏈接，篩選「下載STV演示」按鈕，從該按鈕抓取URL並使用它下載。從url下載文件工作正常，我只是無法打開網址。它將從stackoverflow下載，而不是我想要的網站。我得到了403 Forbidden錯誤。任何人都有想法，如何得到這個工作http://sizzlingstats.com/stats/479453，也爲該下載stv按鈕過濾？使用Python 2.7從網頁獲取下載鏈接

import random, sys, urllib2, httplib2, win32clipboard, requests, urlparse 
from copy import deepcopy 
from bs4 import SoupStrainer 
from bs4 import BeautifulSoup 
from urllib2 import Request 
from urllib2 import urlopen 
#When I wrote this, only God and I knew what I was writing 
#Now only God knows 

page = raw_input("Please copy the .ss link and hit enter... ") 
win32clipboard.OpenClipboard() 
page = win32clipboard.GetClipboardData() 
win32clipboard.CloseClipboard() 
s = page 
try: 
    page = s.replace("http://","http://www.") 
    print page + " Found..." 
except: 
    page = s.replace("www.","http://www.") 
    print page 

req = urllib2.Request(page, '', headers = { 'User-Agent' : 'Mozilla/5.0' }) 
req.headers['User-agent'] = 'Mozilla/5.0' 
req.add_header('User-agent', 'Mozilla/5.0') 
print req 
soup = BeautifulSoup(page, 'html.parser') 
print soup.prettify() 
links = soup.find_all("Download STV Demo") 
for tag in links: 
    link = links.get('href',None) 
    if "Download STV Demo" in link: 
     print link 

file_name = page.split('/')[-1] 
u = urllib2.urlopen(page) 
f = open(file_name, 'wb') 
meta = u.info() 
file_size = int(meta.getheaders("Content-Length")[0]) 
print "Downloading: %s Bytes: %s" % (file_name, file_size) 

file_size_dl = 0 
block_sz = 8192 
while True: 
    buffer = u.read(block_sz) 
    if not buffer: 
     break 
    file_size_dl += len(buffer) 
    f.write(buffer) 
    status = r"%10d [%3.2f%%]" % (file_size_dl, file_size_dl * 100./file_size) 
    status = status + chr(8)*(len(status)+1) 
    print status, 
f.close()

來源

2016-06-28 Ben Smith

您需要添加一個用戶代理，你爲什麼要傳遞的raw_input輸出到BS4？ –

我有一個用戶代理，並且來自原始輸入的頁面被快速覆蓋。但我會刪除，因爲它不需要 –

無論如何，它是無關緊要的，因爲頁面內容是動態創建的。查看chrome開發工具的xhr選項卡下的內容，您可以獲取所需的所有json格式的數據。 –

該頁面的內容是通過Javascript從API動態生成的。

>>> import requests 
>>> 
>>> requests.get('http://sizzlingstats.com/api/stats/479453').json()['stats']['stvUrl'] 
u'http://sizzlingstv.s3.amazonaws.com/stv/479453.zip'

因爲他們阻止了用戶代理，所以你得到了403。

您已經使用用戶代理創建了req對象，但是您不使用它，而是使用urllib2.urlopen(page)代替。

您還將page傳遞給BeautifulSoup，這是一個錯誤。

soup = BeautifulSoup(page, 'html.parser')

來源

2016-06-28 09:26:13

讓我們看一下你的代碼：首先要導入你不使用（也許這不是孔代碼）很多模塊和其他一些人使用，但你不會需要他們，其實你只需要：

from urllib2 import urlopen

（後面我們會看到爲什麼），也許win32clipboard你的投入，你的投入是確定的，所以我會離開的這部分代碼：

import win32clipboard 
page = input("Please copy the .ss link and hit enter... ") 
win32clipboard.OpenClipboard() 
page = win32clipboard.GetClipboardData() 
win32clipboard.CloseClipboard()

但我真的d on't看到這些類型的輸入的目的，是不是容易，只需使用類似：

page = raw_input("Please enter the .ss link: ")

那麼這部分代碼實際上是不必要的：

s = page 
try:            
    page = s.replace("http://","http://www.") 
    print page + " Found..."     
except:            
    page = s.replace("www.","http://www.")  
    print page

所以我就刪除它，接下來的部分應該是這樣的：

from urllib2 import Request, urlopen 
from bs4 import BeautifulSoup 
req = Request(page, headers = { 'User-Agent' : 'Mozilla/5.0' }) 
#req.headers['User-agent'] = 'Mozilla/5.0'  # you don't need this 
#req.add_header('User-agent', 'Mozilla/5.0') # you don't need this 
print req 
html = urlopen(req)  #you need to open page with urlopen before using BeautifulSoup 
# it is to fix this error: 
##  UserWarning: "b'http://www.sizzlingstats.com/stats/479453'" looks like a URL. 
##  Beautiful Soup is not an HTTP client. You should probably use an HTTP client 
##  to get the document behind the URL, and feed that document to Beautiful Soup. 
soup = BeautifulSoup(html, 'html.parser') # variable page changed to html 
# print soup.prettify()   # I commented this because you don't need to print html 
           # but if you want to see that it's work just uncomment it

我不會使用此代碼，我會解釋爲什麼，但是但是如果你需要用刮某些其它BeautifulSoup頁面，那麼你可以使用它。

你並不需要它，因爲這部分的：

links = soup.find_all("Download STV Demo")

，所以這個問題是沒有在HTML代碼「下載STV演示」，至少不會在湯的HTML代碼，因爲頁面是由JavaScript創建的，所以你要找到任何鏈接，你可以用print(links)看到links == []，正因爲如此，你也不需要就此別過：

for tag in links:      
    link = links.get('href',None)  like I said there is no use of this 
    if "Download STV Demo" in link: because variable links is empty list 
     print link

所以就像我說的網頁的一部分，這裏是連接我們需要的是用JavaScript創建，所以你可以刮腳本找到它，但它是很多更難做到這一點，但如果你看的網址，我們正在努力尋找它看起來像這樣：

http://sizzlingstv.s3.amazonaws.com/stv/479453.zip

所以現在看網址你有，它看起來像這樣：

http://sizzlingstats.com/stats/479453

得到這個鏈接http://sizzlingstv.s3.amazonaws.com/stv/479453.zip你只需要找到鏈接的最後部分，在這種情況下，它是479453，你有它，你的鏈接（http://sizzlingstats.com/stats/479453），這也就是它的最後一部分。你甚至使用該號碼作爲file_name。下面是代碼正是這麼做的：在那之後我會複製一些代碼的

file_name = page.split('/')[-1] 
download_link = 'http://sizzlingstv.s3.amazonaws.com/stv/' + file_name + '.zip'

：

u = urlopen(download_link) 
meta = u.info()  
file_size = int(meta.getheaders("Content-Length")[0]) 
print "Downloading: %s Bytes: %s" % (file_name, file_size)

這部分如下工作：

f = open(file_name + '.zip', 'wb') # I added '.zip' 
file_size_dl = 0 
block_sz = 8192 
while True: 
    buffer = u.read(block_sz) 
    if not buffer: 
     break 
    file_size_dl += len(buffer) 
    f.write(buffer) 
    status = r"%10d [%3.2f%%]" % (file_size_dl, file_size_dl * 100./file_size) 
    status = status + chr(8)*(len(status)+1) 
    print status 
f.close()

，也許你想看到下載消息，但我認爲它更容易使用：

f = open(file_name + '.zip', 'wb') 
f.write(u.read()) 
print "Downloaded" 
f.close()

這裏只是代碼：

from urllib2 import urlopen 

import win32clipboard 
page = input("Please copy the .ss link and hit enter... ") 
win32clipboard.OpenClipboard() 
page = win32clipboard.GetClipboardData() 
win32clipboard.CloseClipboard() 

# or use: 
# page = raw_input("Please enter the .ss link: ") 

file_name = page.split('/')[-1] 
download_link = 'http://sizzlingstv.s3.amazonaws.com/stv/' + file_name + '.zip' 
u = urlopen(download_link) 
meta = u.info()  
file_size = int(meta.getheaders("Content-Length")[0]) 
print "Downloading: %s Bytes: %s" % (file_name, file_size) 

f = open(file_name + '.zip', 'wb') # I added '.zip' 
file_size_dl = 0 
block_sz = 8192 
while True: 
    buffer = u.read(block_sz) 
    if not buffer: 
     break 
    file_size_dl += len(buffer) 
    f.write(buffer) 
    status = r"%10d [%3.2f%%]" % (file_size_dl, file_size_dl * 100./file_size) 
    status = status + chr(8)*(len(status)+1) 
    print(status) 
f.close() 

# or use: 
##f = open(file_name + '.zip', 'wb') 
##f.write(u.read()) 
##print "Downloaded" 
##f.close()

來源

2016-06-28 19:23:56 ands

很遺憾，在沒有評論的情況下看到類似這樣的答案。還有一點可以證明，OP沒有接受你的答案，或者高估了答案。感謝您的努力和詳細的解釋。 – xverges

使用Python 2.7從網頁獲取下載鏈接

回答

相關問題