我正在編寫一個Google Scholar解析器,並且基於this answer,我在抓取HTML之前設置了Cookie。這是我cookies.txt
文件的內容:爲什麼Python說這個Netscape cookie文件無效?
# Netscape HTTP Cookie File
# http://curlm.haxx.se/rfc/cookie_spec.html
# This file was generated by libcurl! Edit at your own risk.
.scholar.google.com TRUE / FALSE 2147483647 GSP ID=353e8f974d766dcd:CF=2
.google.com TRUE / FALSE 1317124758 PREF ID=353e8f974d766dcd:TM=1254052758:LM=1254052758:S=_biVh02e4scrJT1H
.scholar.google.co.uk TRUE / FALSE 2147483647 GSP ID=f3f18b3b5a7c2647:CF=2
.google.co.uk TRUE / FALSE 1317125123 PREF ID=f3f18b3b5a7c2647:TM=1254053123:LM=1254053123:S=UqjRcTObh7_sARkN
,這是我用搶的HTML代碼:
import http.cookiejar
import urllib.request, urllib.parse, urllib.error
def get_page(url, headers="", params=""):
filename = "cookies.txt"
request = urllib.request.Request(url, None, headers, params)
cookies = http.cookiejar.MozillaCookieJar(filename, None, None)
cookies.load()
cookie_handler = urllib.request.HTTPCookieProcessor(cookies)
redirect_handler = urllib.request.HTTPRedirectHandler()
opener = urllib.request.build_opener(redirect_handler,cookie_handler)
response = opener.open(request)
return response
start = 0
search = "Ricardo Altamirano"
results_per_fetch = 20
host = "http://scholar.google.com"
base_url = "/scholar"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; U; ru; rv:5.0.1.6) Gecko/20110501 Firefox/5.0.1 Firefox/5.0.1'}
params = urllib.parse.urlencode({'start' : start,
'q': '"' + search + '"',
'btnG' : "",
'hl' : 'en',
'num': results_per_fetch,
'as_sdt' : '1,14'})
url = base_url + "?" + params
resp = get_page(host + url, headers, params)
完整回溯是:
Traceback (most recent call last):
File "C:/Users/ricardo/Desktop/Google-Scholar/BibTex/test.py", line 29, in <module>
resp = get_page(host + url, headers, params)
File "C:/Users/ricardo/Desktop/Google-Scholar/BibTex/test.py", line 8, in get_page
cookies.load()
File "C:\Python32\lib\http\cookiejar.py", line 1767, in load
self._really_load(f, filename, ignore_discard, ignore_expires)
File "C:\Python32\lib\http\cookiejar.py", line 1997, in _really_load
filename)
http.cookiejar.LoadError: 'cookies.txt' does not look like a Netscape format cookies file
我已經瀏覽了關於Netscape cookie文件格式的文檔,但是我找不到任何能夠解決問題的東西。是否需要包含換行符?爲了防萬一,我將行結尾更改爲Unix樣式,但這並沒有解決問題。我能找到的最接近的規格是this,這並不表示我失蹤。最後四行中每一行的字段都由製表符分隔,而不是空格,而其他所有內容對我來說都是正確的。
[netscape cookie規範](http://curl.haxx.se/rfc/cookie_spec.html)曾經在某人(AOL?)破壞歷史之前被netscape.com託管。 – n611x007 2013-11-03 18:28:51
更新規格爲[rfc2965](http://tools.ietf.org/html/rfc2965.html)* Set-Cookie2 * – n611x007 2013-11-03 18:31:33
對於任何有興趣的人,實際上你可以做'cookies.save(cookie_file,ignore_discard = True ,ignore_expires = True)'創建一個有效的cookie文件作爲實例,與無效的cookies.txt進行比較。逐行或逐字節比較,並逐一刪除該行,最終找到原因。 – 2014-02-21 10:24:59