2012-07-14 60 views
1

我正在使用Python 3.2.3的urllib.request模塊下載谷歌搜索結果,但我得到一個奇怪的錯誤,在urlopen與谷歌搜索結果鏈接,但不是谷歌學術搜索。在此示例中,我正在搜索"JOHN SMITH"。此代碼成功打印HTML:爲什麼urlopen可以下載Google搜索頁面而不是Google Scholar搜索頁面?

from urllib.request import urlopen, Request 
from urllib.error import URLError 

# Google 
try: 
    page_google = '''http://www.google.com/#hl=en&sclient=psy-ab&q=%22JOHN+SMITH%22&oq=%22JOHN+SMITH%22&gs_l=hp.3..0l4.129.2348.0.2492.12.10.0.0.0.0.154.890.6j3.9.0...0.0...1c.gjDBcVcGXaw&pbx=1&bav=on.2,or.r_gc.r_pw.r_qf.,cf.osb&fp=dffb3b4a4179ca7c&biw=1366&bih=649''' 
    req_google = Request(page_google) 
    req_google.add_header('User Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20120427 Firefox/15.0a1') 
    html_google = urlopen(req_google).read() 
    print(html_google[0:10]) 
except URLError as e: 
    print(e) 

但是這個代碼,做同樣的谷歌學術,提出了一個URLError例外:

from urllib.request import urlopen, Request 
from urllib.error import URLError 

# Google Scholar 
try: 
    page_scholar = '''http://scholar.google.com/scholar?hl=en&q=%22JOHN+SMITH%22&btnG=&as_sdt=1%2C14''' 
    req_scholar = Request(page_scholar) 
    req_scholar.add_header('User Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20120427 Firefox/15.0a1') 
    html_scholar = urlopen(req_scholar).read() 
    print(html_scholar[0:10]) 
except URLError as e: 
    print(e) 

回溯:

Traceback (most recent call last): 
    File "/home/ak5791/Desktop/code-sandbox/scholar/crawler.py", line 6, in <module> 
    html = urlopen(page).read() 
    File "/usr/lib/python3.2/urllib/request.py", line 138, in urlopen 
    return opener.open(url, data, timeout) 
    File "/usr/lib/python3.2/urllib/request.py", line 369, in open 
    response = self._open(req, data) 
    File "/usr/lib/python3.2/urllib/request.py", line 387, in _open 
    '_open', req) 
    File "/usr/lib/python3.2/urllib/request.py", line 347, in _call_chain 
    result = func(*args) 
    File "/usr/lib/python3.2/urllib/request.py", line 1155, in http_open 
    return self.do_open(http.client.HTTPConnection, req) 
    File "/usr/lib/python3.2/urllib/request.py", line 1138, in do_open 
    raise URLError(err) 
urllib.error.URLError: <urlopen error [Errno -5] No address associated with hostname> 

我獲得這些鏈接由在Chrome中搜索並從那裏複製鏈接。一位評論者報告了403錯誤,我有時也會這樣做。我認爲這是因爲Google不支持Scolar的刮蹭。但是,更改用戶代理字符串不會修復此或原始問題,因爲大部分時間我都得到URLErrors。  

+1

我得到一個403(禁止),這可能意味着Google不希望你從Scholar搜索中獲取信息。服務條款可能會禁止這一點(我沒有檢查它們)。 – 2012-07-14 14:02:51

+0

@SvenMarnach我更新了問題,因爲我嘗試更改用戶代理字符串。有時我得到一個'URLError',有時候我得到了一個403錯誤。但是,大部分時間我都能找到前者。 – 2012-07-14 14:27:19

回答

3

This PHP script似乎表明你需要設置一些餅乾谷歌提供的結果之前:

/* 

Need a cookie file (scholar_cookie.txt) like this: 

# Netscape HTTP Cookie File 
# http://curlm.haxx.se/rfc/cookie_spec.html 
# This file was generated by libcurl! Edit at your own risk. 

.scholar.google.com  TRUE /  FALSE 2147483647  GSP  ID=353e8f974d766dcd:CF=2 
.google.com  TRUE /  FALSE 1317124758  PREF ID=353e8f974d766dcd:TM=1254052758:LM=1254052758:S=_biVh02e4scrJT1H 
.scholar.google.co.uk TRUE /  FALSE 2147483647  GSP  ID=f3f18b3b5a7c2647:CF=2 
.google.co.uk TRUE /  FALSE 1317125123  PREF ID=f3f18b3b5a7c2647:TM=1254053123:LM=1254053123:S=UqjRcTObh7_sARkN 

*/ 

這是由Python recipe for Google Scholar comment證實,其中包括,谷歌檢測腳本,如果將禁止你一個警告你太多地使用它了。

+0

Python配方非常棒,儘管由於結果頁面的HTML佈局已經改變,所以有點過時。雖然有一些調整,但它完全按照需要工作。 – 2012-07-16 13:10:17

+0

我不喜歡將此問題鏈接到一個新問題,但是您是否有任何想法,爲什麼您在那裏描述的cookie沒有被驗證爲正確的cookie?我一直試圖[弄明白](http://stackoverflow.com/questions/11529428/why-does-python-say-this-netscape-cookie-file-isnt-valid),因爲它阻止了我的整個應用。 – 2012-07-17 20:41:14