2016-11-22 45 views
1

我有一個小腳本,我很高興能夠從剪貼板中讀取一個或多個書目參考,並從Google學術搜索獲得學術論文的信息,然後將其送入SciHub以獲得pdf。由於某種原因,它停止了工作,我花了很多年時間試圖找出原因。向SciHub發送表單請求不再使用urllib urllib2 python

測試表明該程序的Google(scholarly.py)部分工作正常,這是SciHub的一部分是問題所在。

任何想法?澳大利亞佩斯市Appleard,S.J.,Angeloni,J。和Watkins,R。(2006)一個城市地區出現乾旱和人口密度增加的富砷地下水。 Applied Geochemistry 21(1),83-97。

'''Program to automatically find and download items from a bibliography or references list. 
This program uses the 'scihub' website to obtain the full-text paper where 
available, if no entry is found the paper is ignored and the failed downloads 
are listed at the end''' 

import scholarly 
import win32clipboard 
import urllib 
import urllib2 
import webbrowser 
import re 

'''Select and then copy the bibliography entries you want to download the 
papers for, python reads the clipboard''' 
win32clipboard.OpenClipboard() 
c = win32clipboard.GetClipboardData() 
win32clipboard.EmptyClipboard() 

'''Cleans up the text. removes end lines and double spaces etc.''' 
c = c.replace('\n', ' ') 
c = c.replace('\r', ' ') 
while c.find(' ') != -1: 
    c = c.replace(' ', ' ') 
win32clipboard.SetClipboardText(c) 
win32clipboard.CloseClipboard() 
print "Working..." 

'''bit of regex to extract the title of the paper, 
IMPORTANT: bibliography has to be in 
author date format or you will need to revise this, 
at the moment it looks for year date in brackets, then copies all the text until it 
reaches a full-stop, assuming that this is the paper title. If it is not, it 
will either fail or will be using inappropriate search terms.''' 


paper_info= re.findall(r"(\d{4}[a-z]*)([). ]+)([ \"])+([\w\s_():,-]*)(.)",c) 
print "Analysing titles" 
print "The following titles found:" 
print "*************************" 
list_of_titles= list() 
for i in paper_info: 
    print '%s...' % (i[3][:50]) 
    Paper_title=str(i[3]) 
    list_of_titles.append(Paper_title) 

failed=list() 
for title in list_of_titles: 
    try: 
     search_query = scholarly.search_pubs_query(title) 

     info= (next(search_query)) 

     print "Querying Google Scholar" 
     print "**********************" 
     print "Looking up paper title:" 
     print "**********************" 
     print title 
     print "**********************" 

     url=info.bib['url'] 
     print "Journal URL found " 
     print url 
     #url=next(search_query) 
     print "Sending URL: ", url 


     site='http://sci-hub.cc/' 
     data = urllib.urlencode({'request': url}) 

     print data 
     results = urllib2.urlopen(site, data) #this is where it fails 


     with open("results.html", "w") as f: 
      f.write(results.read()) 

     webbrowser.open_new("results.html") 


    except: 
     print "**********************" 
     print "No valid journal found for:" 
     print title 
     print "**********************" 
     print "Continuing..." 
     failed.append(title) 
    continue 

if len(failed)==0: 
    print 'Complete' 

else: 
    print '*************************************' 
    print 'The following titles did not download: ' 
    print '*************************************' 
    print failed 
    print "Please check that these are valid entries" 
+0

你有一個裸'不同的是:在你的代碼是吃的每一個例外,用無用的錯誤消息替換它'塊。嘗試刪除它,看看問題究竟是什麼。 – Blender

+0

感謝攪拌機,我得到HTTP錯誤403:禁止 – flashliquid

+0

我認爲我需要欺騙標題看起來不是一個Python腳本。我無法讓它工作。我目前正在使用「請求」而不是URLlib和URLlib2重寫有問題的部分。這很令人困惑,因爲它在數週內工作正常。 – flashliquid

回答

1

現在可以使用了,我添加了「User-Agent」標題並重新調整了URLlib的內容。現在看來它更明顯。一個嘗試和錯誤的過程,嘗試從網絡上獲取的許多不同的代碼片段。希望我的老闆不會問我今天取得了什麼。有人應該建立一個論壇,在這裏人們可以得到答案的編碼問題...

'''Program to automatically find and download items from a bibliography or references list here are some journal papers in bibliographic format. Just copy the text to clipboard and run the script. 

Ghaffour, N., T. M. Missimer and G. L. Amy (2013). "Technical review and evaluation of the economics of water desalination: Current and future challenges for better water supply sustainability." Desalination 309(0): 197-207. 

Gutiérrez Ortiz, F. J., P. G. Aguilera and P. Ollero (2014). "Biogas desulfurization by adsorption on thermally treated sewage-sludge." Separation and Purification Technology 123(0): 200-213. 

This program uses the 'scihub' website to obtain the full-text paper where 
available, if no entry is found the paper is ignored and the failed downloads are listed at the end''' 

    import scholarly 
    import win32clipboard 
    import urllib 
    import urllib2 
    import webbrowser 
    import re 


    '''Select and then copy the bibliography entries you want to download the 
    papers for, python reads the clipboard''' 
    win32clipboard.OpenClipboard() 
    c = win32clipboard.GetClipboardData() 
    win32clipboard.EmptyClipboard() 

    '''Cleans up the text. removes end lines and double spaces etc.''' 
    c = c.replace('\n', ' ') 
    c = c.replace('\r', ' ') 
    while c.find(' ') != -1: 
     c = c.replace(' ', ' ') 
    win32clipboard.SetClipboardText(c) 
    win32clipboard.CloseClipboard() 
    print "Working..." 

    '''bit of regex to extract the title of the paper, 
    IMPORTANT: bibliography has to be in 
    author date format or you will need to revise this, 
    at the moment it looks for date in brackets, then copies all the text until it 
    reaches a full-stop, assuming that this is the paper title. If it is not, it 
    will either fail or will be using inappropriate search terms.''' 

    paper_info= re.findall(r"(\d{4}[a-z]*)([). ]+)([ \"])+([\w\s_():,-]*)(.)",c) 
    print "Analysing titles" 
    print "The following titles found:" 
    print "*************************" 
    list_of_titles= list() 
    for i in paper_info: 
     print '%s...' % (i[3][:50]) 
     Paper_title=str(i[3]) 
     list_of_titles.append(Paper_title) 
    paper_number=0 
    failed=list() 
    for title in list_of_titles: 
     try: 
      search_query = scholarly.search_pubs_query(title) 

      info= (next(search_query)) 
      paper_number+=1 
      print "Querying Google Scholar" 
      print "**********************" 
      print "Looking up paper title:" 
      print title 
      print "**********************" 

      url=info.bib['url'] 
      print "Journal URL found " 
      print url 
      #url=next(search_query) 
      print "Sending URL: ", url 

      site='http://sci-hub.cc/' 

      r = urllib2.Request(url=site) 
      r.add_header('User-Agent','Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11') 
      r.add_data(urllib.urlencode({'request': url})) 
      res= urllib2.urlopen(r) 



      with open("results.html", "w") as f: 
       f.write(res.read()) 


      webbrowser.open_new("results.html") 
      if not paper_number<= len(list_of_titles): 
       print "Next title" 
      else: 
       continue 

     except Exception as e: 
      print repr(e) 
      paper_number+=1 
      print "**********************" 
      print "No valid journal found for:" 
      print title 
      print "**********************" 
      print "Continuing..." 
      failed.append(title) 
     continue 

    if len(failed)==0: 
     print 'Complete' 

    else: 
     print '*************************************' 
     print 'The following titles did not download: ' 
     print '*************************************' 
     print failed 
     print "Please check that these are valid entries" 
相關問題