我正試圖從infoweb.newsbank.com的數據庫中收集關於我在大學所做研究的文章。到目前爲止,這是我的代碼:從需要cookie的網站收集Python文章
from bs4 import BeautifulSoup
import requests
import urllib
from requests import session
import http.cookiejar
mainLink = "http://infoweb.newsbank.com.proxy.lib.uiowa.edu/iw-search/we/InfoWeb?p_product=AWNB&p_theme=aggregated5&p_action=doc&p_docid=14D12E120CD13C18&p_docnum=2&p_queryname=4"
def articleCrawler(mainUrl):
response = urllib.request.urlopen(mainUrl)
soup = BeautifulSoup(response)
linkList = []
for link in soup.find_all('a'):
print(link)
articleCrawler(mainLink)
Unfortunatrly我回來這樣的響應:
<html>
<head>
<title>Cookie Required</title>
</head>
<body>
This is cookie.htm from the doc subdirectory.
<p>
<hr>
<p>
Licensing agreements for these databases require that access be extended
only to authorized users. Once you have been validated by this system,
a "cookie" is sent to your browser as an ongoing indication of your authorization to
access these databases. It will only need to be set once during login.
<p>
As you access databases, they may also use cookies. Your ability to use those databases
may depend on whether or not you allow those cookies to be set.
<p>
To login again, click <a href="login">here</a>.
</p></p></p></hr></p></body>
</html>
<a href="login">here</a>
我使用http.cookiejar嘗試過,但我不熟悉的圖書館。我正在使用Python 3.有誰知道如何接受cookie並訪問文章?謝謝。
好的,我會檢查出來並在以後回覆。謝謝。 –