2014-04-11 48 views
0

我正試圖從infoweb.newsbank.com的數據庫中收集關於我在大學所做研究的文章。到目前爲止,這是我的代碼:從需要cookie的網站收集Python文章

from bs4 import BeautifulSoup 
import requests 
import urllib 
from requests import session 
import http.cookiejar 


mainLink = "http://infoweb.newsbank.com.proxy.lib.uiowa.edu/iw-search/we/InfoWeb?p_product=AWNB&p_theme=aggregated5&p_action=doc&p_docid=14D12E120CD13C18&p_docnum=2&p_queryname=4" 




def articleCrawler(mainUrl): 
    response = urllib.request.urlopen(mainUrl) 
    soup = BeautifulSoup(response) 
    linkList = [] 
    for link in soup.find_all('a'): 
     print(link) 

articleCrawler(mainLink) 

Unfortunatrly我回來這樣的響應:

<html> 
<head> 
<title>Cookie Required</title> 
</head> 
<body> 
This is cookie.htm from the doc subdirectory. 
<p> 
<hr> 
<p> 

Licensing agreements for these databases require that access be extended 
only to authorized users. Once you have been validated by this system, 
a "cookie" is sent to your browser as an ongoing indication of your authorization to 
access these databases. It will only need to be set once during login. 
<p> 
As you access databases, they may also use cookies. Your ability to use those databases 
may depend on whether or not you allow those cookies to be set. 
<p> 
To login again, click <a href="login">here</a>. 
</p></p></p></hr></p></body> 
</html> 

<a href="login">here</a> 

我使用http.cookiejar嘗試過,但我不熟悉的圖書館。我正在使用Python 3.有誰知道如何接受cookie並訪問文章?謝謝。

回答

2

我對Python3並不熟悉,但在Python2中接受cookie的標準方法是將HTTPCookieProcessor作爲您的OpenerDirector中的一個處理程序。

所以,這樣的事情:

import cookielib, urllib, urllib2 
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookielib.CookieJar())) 

opener現在準備打開一個URL(可能使用用戶名和密碼),並把它收到任何cookie到其綜合CookieJar:

params = urllib.urlencode({'username': 'someuser', 'password': 'somepass'}) 
opener.open(LOGIN_URL, params) 

如果登錄成功,opener現在將擁有任何身份驗證令牌,服務器會以Cookie形式圍繞它進行訪問。然後你只需訪問你首先想要的鏈接:

f = opener.open(mainLink) 

同樣,不知道如何轉換爲Python3,但我認爲你至少要cookielib.CookieJar,而不是http.cookiejar。我認爲後者是用於創建HTTP cookie內容作爲服務器,而不是作爲客戶端接收cookie內容。

+0

好的,我會檢查出來並在以後回覆。謝謝。 –