1
我有一個小腳本,我很高興能夠從剪貼板中讀取一個或多個書目參考,並從Google學術搜索獲得學術論文的信息,然後將其送入SciHub以獲得pdf。由於某種原因,它停止了工作,我花了很多年時間試圖找出原因。向SciHub發送表單請求不再使用urllib urllib2 python
測試表明該程序的Google(scholarly.py)部分工作正常,這是SciHub的一部分是問題所在。
任何想法?澳大利亞佩斯市Appleard,S.J.,Angeloni,J。和Watkins,R。(2006)一個城市地區出現乾旱和人口密度增加的富砷地下水。 Applied Geochemistry 21(1),83-97。
'''Program to automatically find and download items from a bibliography or references list.
This program uses the 'scihub' website to obtain the full-text paper where
available, if no entry is found the paper is ignored and the failed downloads
are listed at the end'''
import scholarly
import win32clipboard
import urllib
import urllib2
import webbrowser
import re
'''Select and then copy the bibliography entries you want to download the
papers for, python reads the clipboard'''
win32clipboard.OpenClipboard()
c = win32clipboard.GetClipboardData()
win32clipboard.EmptyClipboard()
'''Cleans up the text. removes end lines and double spaces etc.'''
c = c.replace('\n', ' ')
c = c.replace('\r', ' ')
while c.find(' ') != -1:
c = c.replace(' ', ' ')
win32clipboard.SetClipboardText(c)
win32clipboard.CloseClipboard()
print "Working..."
'''bit of regex to extract the title of the paper,
IMPORTANT: bibliography has to be in
author date format or you will need to revise this,
at the moment it looks for year date in brackets, then copies all the text until it
reaches a full-stop, assuming that this is the paper title. If it is not, it
will either fail or will be using inappropriate search terms.'''
paper_info= re.findall(r"(\d{4}[a-z]*)([). ]+)([ \"])+([\w\s_():,-]*)(.)",c)
print "Analysing titles"
print "The following titles found:"
print "*************************"
list_of_titles= list()
for i in paper_info:
print '%s...' % (i[3][:50])
Paper_title=str(i[3])
list_of_titles.append(Paper_title)
failed=list()
for title in list_of_titles:
try:
search_query = scholarly.search_pubs_query(title)
info= (next(search_query))
print "Querying Google Scholar"
print "**********************"
print "Looking up paper title:"
print "**********************"
print title
print "**********************"
url=info.bib['url']
print "Journal URL found "
print url
#url=next(search_query)
print "Sending URL: ", url
site='http://sci-hub.cc/'
data = urllib.urlencode({'request': url})
print data
results = urllib2.urlopen(site, data) #this is where it fails
with open("results.html", "w") as f:
f.write(results.read())
webbrowser.open_new("results.html")
except:
print "**********************"
print "No valid journal found for:"
print title
print "**********************"
print "Continuing..."
failed.append(title)
continue
if len(failed)==0:
print 'Complete'
else:
print '*************************************'
print 'The following titles did not download: '
print '*************************************'
print failed
print "Please check that these are valid entries"
你有一個裸'不同的是:在你的代碼是吃的每一個例外,用無用的錯誤消息替換它'塊。嘗試刪除它,看看問題究竟是什麼。 – Blender
感謝攪拌機,我得到HTTP錯誤403:禁止 – flashliquid
我認爲我需要欺騙標題看起來不是一個Python腳本。我無法讓它工作。我目前正在使用「請求」而不是URLlib和URLlib2重寫有問題的部分。這很令人困惑,因爲它在數週內工作正常。 – flashliquid