2013-12-21 298 views
1

我試圖從網站下載並保存演講視頻。雖然我已成功下載文件,但他們不會在我的媒體播放器中播放。下面是我使用的代碼:使用Python下載* .mp4文件

from bs4 import BeautifulSoup 
import re 
import urllib2 

snippet = open('Python/SNA Page Source Revised.txt', 'r') 
soup = BeautifulSoup(snippet) 

links = [link.get('href') for link in soup.find_all('a')] 

videos = [] 

for link in links: 
    match = re.search('.*mp4.*', link) 
    if match: 
    videos.append(link) 

vidNum = 1 

for video in videos: 
    f = urllib2.urlopen(video) 
    with open('Data Analysis/Social Network Analysis/Video '+vidNum+'.mp4', 'wb') as code: 
    code.write(f.read()) 
    vidNum += 1 

一切似乎都做工精細,但當我嘗試播放的影片之一,我得到這個錯誤: 「巨​​蟒(V2.7)需要安裝插件播放以下類型的媒體文件:text/html decoder「此外,如果我手動從網站上下載視頻,該文件大約爲22.8MB,但是當我使用我的腳本時,文件僅爲7.8kB。

我在做我下載文件的方式有問題嗎?任何幫助將不勝感激。

另外:我在使用Python v2.7的Ubuntu 12.04 LTS操作系統上運行。

****編輯* ***

這是我收到的代碼我使用的是基於響應:

import requests 

r = requests.get('https://class.coursera.org/sna-003/lecture/download.mp4?lecture_id=2', auth=('myUsername', 'myPassword')) 

with open('Data Analysis/TestFile.mp4', 'wb') as fd: 
    fd.write(r.content) 

這裏是r.content的輸出:

<!DOCTYPE html> 
<html itemtype="http://schema.org" xmlns:fb="http://ogp.me/ns/fb#"><head><meta content="IE=Edge,chrome=IE7" http-equiv="X-UA-Compatible"/><meta content="!" name="fragment"/><meta content="NOODP" name="robots"/><meta charset="utf-8"/><meta content="Coursera" property="og:title"/><meta content="website" property="og:type"/><meta content="http://s3.amazonaws.com/coursera/media/Coursera_Computer_Narrow.png" property="og:image"/><meta content="https://www.coursera.org/" property="og:url"/><meta content="Coursera" property="og:site_name"/><meta content="en_US" property="og:locale"/><meta content="Take free online classes from 80+ top universities and organizations. Coursera is a social entrepreneurship company partnering with Stanford University, Yale University, Princeton University and others around the world to offer courses online for anyone to take, for free. We believe in connecting people to a great education so that anyone around the world can learn without limits." property="og:description"/><meta content="727836538,4807654" property="fb:admins"/><meta content="274998519252278" property="fb:app_id"/><meta content="Take free online classes from 80+ top universities and organizations. Coursera is a social entrepreneurship company partnering with Stanford University, Yale University, Princeton University and others around the world to offer courses online for anyone to take, for free. We believe in connecting people to a great education so that anyone around the world can learn without limits." name="description"/><meta content="http://s3.amazonaws.com/coursera/media/Coursera_Computer_Narrow.png" name="image"/><meta content="app-id=736535961" name="apple-itunes-app"/><script>window.onerror = function(message, url, lineNum) { 

    // First check the URL and line number of the error 
    url = url || window.location.href; 
    // 99% of the time, errors without line numbers arent due to our code, 
    // they are due to third party plugins and browser extensions 
    if (lineNum === undefined || lineNum == null) return; 

    // Now figure out the actual error message 
    // If it's an event, as triggered in several browsers 
    if (message.target &amp;&amp; message.type) { 
    message = message.type; 
    } 
    if (!message.indexOf) { 
    message = 'Non-string, non-event error: ' + (typeof message); 
    } 

    var errorDescrip = { 
    message: message, 
    script: url, 
    line: lineNum, 
    url: document.URL 
    } 

    var err = { 
    key: 'page.error.javascript', 
    value: errorDescrip 
    } 

    window._204 = window._204 || []; 
    window._204.push(err); 

    window._gaq = window._gaq || []; 
    window._gaq.push(err); 
}</script><title>Coursera.org</title><link href="https://d1rlkby5e91r2j.cloudfront.net/e47434615f57601f9b9ccaf255a589e8550d328d/css/home.css" rel="stylesheet" type="text/css"/><link href="https://d1rlkby5e91r2j.cloudfront.net/e47434615f57601f9b9ccaf255a589e8550d328d/pages/auth/css/auth.css" rel="stylesheet" type="text/css"/><script data-baseurl="https://d1rlkby5e91r2j.cloudfront.net/e47434615f57601f9b9ccaf255a589e8550d328d/" id="_mobile">(function(el) { 
    // Override certian behaviour if the page is for our mobile app. 
    // TODO(priya) Remove this conditional behaviour once I want to push this behaviour 
    // for regular authentication pages on mobile/smaller screens as well. 
    // Currently I'm keeping existing behaviour same and only adding mobile specific 
    // layouts ot /mobilesignup page (which is what isMobileApp = true signifies). 
    if ("false" == "true") { 
    var head = document.getElementsByTagName('head')[0]; 
    // Add viewport meta tag 
    var viewport = document.querySelector('meta[name=viewport]'); 
    var viewportContent = 'width=device-width, initial-scale=1.0, user-scalable=no'; 
    if (!viewport) { 
     viewport = document.createElement('meta'); 
     viewport.setAttribute('name', 'viewport'); 
     head.appendChild(viewport); 
    } 
    viewport.setAttribute('content', viewportContent); 

    // Add responsive css 
    var link = document.createElement('link'); 
    link.rel = 'stylesheet'; 
    link.type = 'text/css'; 
    link.href = el.getAttribute("data-baseurl") + "pages/auth/css/auth_responsive.css"; 
    head.appendChild(link); 
    } 
})(document.getElementById("_mobile")); 
</script></head><body><div id="fb-root"></div><div id="origami"><div style="position:absolute;top:0px;left:0px;width:100%;height:100%;background:#f5f5f5;padding-top:5%;"><div id="coursera-loading-nojs" style="text-align:center; margin-bottom:10px;display:none;">Please use a <a href="/browsers">modern browser </a> with JavaScript enabled to use Coursera.</div><div><span id="coursera-loading-js" style="display: none; padding-left:45%">loading   <img src="https://d2wvvaown1ul17.cloudfront.net/site-static/images/icons/loading.gif"/></span></div><noscript><div style="text-align:center; margin-bottom:10px;">Please use a <a href="/browsers">modern browser </a> with JavaScript enabled to use Coursera.</div></noscript></div></div><!--[if gte IE 8]&gt;&lt;script&gt;document.getElementById("coursera-loading-js").style.display = 'block';&lt;/script&gt;&lt;![endif]--> 
<!--[if lte IE 7]&gt;&lt;script&gt;document.getElementById("coursera-loading-nojs").style.display = 'block'; 
window._204 = window._204 || []; 
window._gaq = window._gaq || []; 

window._gaq.push(
    ['_setAccount', 'UA-28377374-1'], 
    ['_setDomainName', window.location.hostname], 
    ['_setAllowLinker', true], 
    ['_trackPageview', window.location.pathname]); 

window._204.push(
    ['client', 'home'], 
    {key:"pageview", value:window.location.pathname}); 
    &lt;/script&gt;&lt;script src="https://eventing.coursera.org/204.min.js"&gt;&lt;/script&gt;&lt;script src="https://ssl.google-analytics.com/ga.js"&gt;&lt;/script&gt;&lt;![endif]--> 
<!--[if !IE]&gt; --><script>document.getElementById("coursera-loading-js").style.display = 'block';</script><!-- &lt;![endif]--><script src="https://d1rlkby5e91r2j.cloudfront.net/e47434615f57601f9b9ccaf255a589e8550d328d/js/core/require.js" type="text/javascript"></script><script data-baseurl="https://d1rlkby5e91r2j.cloudfront.net/e47434615f57601f9b9ccaf255a589e8550d328d/" data-debug="0" data-locale="" data-timestamp="1386838999742" data-version="e47434615f57601f9b9ccaf255a589e8550d328d" id="_require" type="text/javascript">if(document.getElementById("coursera-loading-js").style.display == 'block') { 
    (function(el) { 
    // prevent throw 
    require.onError = function(err) { 
     window._204 = window._204 || []; 
     window._204.push({key: 'requireErr', value: err}); 
    }; 

    define("pages/auth/authConfig", 
     function() { 
      return {"coursera_url": "https://www.coursera.org/", 
        "environment": "production"}; 
    } 
    ); 

    require.config({ 
     enforceDefine: false, 
     waitSeconds: 14, 
     baseUrl: el.getAttribute("data-baseurl"), 
     urlArgs: el.getAttribute("data-debug") == "1" ? "v=" + el.getAttribute("data-timestamp") : "", 
     shim: { 
      "underscore": { 
      exports: '_' 
      }, 
      "backbone": { 
      deps: ['underscore', 'jquery'], 
      exports: 'Backbone' 
      } 
     }, 
     paths: { 
      "jquery":  "js/core/jquery", 
      "underscore": "js/core/underscore", 
      "backbone":  "js/core/backbone", 
      "i18n":   "js/core/i18n._t" 
     }, 
     callback: function() { 
     require(["pages/auth/routes"]); // bootup coursera 
     }, 
     config: { 
     i18n: { 
      locale: (window.localStorage ? localStorage.getItem("locale") : '') || el.getAttribute("data-locale") 
     } 
     } 
    }); 
    })(document.getElementById("_require")); 
}</script><script type="text/javascript">define("pages/home/models/user.json", [], function(){ 
    return null; 
}); 
</script></body></html> 

雖然我覺得這很奇怪,因爲它看起來像網站的源代碼,但是當我查看r.url時,我得到一個實際的網站,我可以在瀏覽器中加載它,並提示我保存或查看視頻。即使我嘗試傳遞我從中獲得的新網址(我假設它包含我的cookie信息),我仍然會收到相同的內容。我不明白我要去哪裏錯了。

+3

你可能下載的下載頁面的HTML而不是文件本身。您是否在瀏覽器中嘗試過該網址? –

+0

是的,我有。以下是一個示例URL:https://class.coursera.org/sna-003/lecture/download.mp4?lecture_id=2 當我查看urllib2.urlopen(video).read()的輸出時,它是XML數據並且有一個數據庫,但該URL不能在瀏覽器中加載。這裏是一個例子:https://d1rlkby5e91r2j.cloudfront.net/e47434615f57601f9b9ccaf255a589e8550d328d – brittenb

+1

顯然你需要設置一些類型的cookie,以便下載不會失敗。 – plaes

回答

1

您需要有一個有效的cookie,以便您不下載登錄頁面。

這裏是如何你的urllib2

import urllib2 
opener = urllib2.build_opener() 
opener.addheaders.append(('Cookie', 'cookiename=cookievalue')) 
f = opener.open("http://example.com/") 

設置cookies,您也​​可以使用cookielib有更多的網頁瀏覽器一樣的行爲做一個登錄過程,並得到正確的cookie來下載電影。

另一種方式是使用Requests,這是類似urllib2,更容易,使自動登錄過程。

+0

因此,使用請求,我的代碼是這樣的: 視頻中的視頻: r = requests.get(video,auth =('user','pass')); vidFile = open('Data Analysis/Video'+ vidNum +'。mp4','wb'); vidFile.write(r.content); vidNUm + = 1; (用於顯示換行符的分號)我從未使用請求,現在正在閱讀文檔。只是想知道我是否在正確的道路上。 – brittenb

1

我首先將文件保存爲.html而不是.mp4,以便您100%確定它不是登錄頁面/錯誤頁面或其他雜項垃圾。有些網站需要cookies,特定的用戶代理(阻止機器人/刮板/自動漏洞掃描程序),Referrer's以及類似的東西。

我個人使用tamper-data或live http頭文件來確保我的程序在調試時工作。

如果您收到一個雲端響應,那麼您可能無法正確處理cookies/user-agents/refferer。

我剛剛檢查了鏈接,並且還有一個CSRF cookie {csrf_token = toNQOP7stgOREzrDcbPc},您將100%查看通過登錄頁面的任何內容。

3

首先,下載並安裝requests package

然後使用此代碼:

import requests 

def downloadfile(name,url): 
    name=name+".mp4" 
    r=requests.get('url') 
    print "****Connected****" 
    f=open(name,'wb'); 
    print "Donloading....." 
    for chunk in r.iter_content(chunk_size=255): 
     if chunk: # filter out keep-alive new chunks 
      f.write(chunk) 
    print "Done" 
    f.close()