2017-04-09 220 views
0

我試圖從命令行客戶端訪問sci-hub.io,而不是打敗它的驗證碼系統。當您將doi發佈到其首頁時,它將返回http://moscow.sci-hub.io/abc123blah/foo.pdf表單的pdf網址。如果您然後請求鏈接,您隨機獲得pdf或驗證碼。 CAPTCHA頁面有這個來源:如何在沒有直接鏈接的情況下下載驗證碼圖片

<html> 
    <head> 
     <title>Для просмотра статьи разгадайте капчу</title> 
     <meta charset="UTF-8"> 
     <meta name="viewport" content="width=device-width, initial-scale=1.0"> 
    </head> 
    <body style = "background:white"> 
     <div> 
      <table style = "width:100%;height:100%"><tr><td style = "vertical-align:middle;text-align:center"> 
      <h2 style = "color:gray;font-family:sans-serif;padding:18px">для просмотра статьи разгадайте капчу</h2> 
      <p></p> 
      <form action = "" method = "POST"> 
       <p><img id="captcha" src="/captcha/securimage_show.php" /></p> 
       <input type="text" maxlength="6" name="captcha_code" style = "width:256px;font-size:18px;height:36px;margin-top:18px;text-align:center" autofocus /><br> 
       <a style = "color:gray;text-decoration:none" href="#" onclick="document.getElementById('captcha').src = '/captcha/securimage_show.php?' + Math.random(); return false">[ показать другую картинку ]</a> 
       <p style = "margin-top:22px"><input type = "submit" value= "Продолжить"></p> 
      </form> 
      </td></tr></table> 
     </div> 
    </body> 
</html> 

所有我能想到做的是,要求securimage_show.php,保存圖片,它顯示給用戶,搶它的解碼,然後POST響應。一個例子PDF鏈接是http://moscow.sci-hub.io/291193c259b69cc057d74e3eb4965c4f/ong2014.pdf 喜歡的東西:

import requests 
from PIL import Image 
import io 

pdf_url = "http://moscow.sci-hub.io/3dcd1bf3b82ea549c0a72e9ab195ab78/walter2015.pdf" 

r1 = requests.get(pdf_url) 

if r1.headers['Content-Type'] != 'application/pdf': 
    print("Looks like Sci-hub gave us a captcha") 

    image = requests.get("http://moscow.sci-hub.io/captcha/securimage_show.php").content 
    img = io.BytesIO(image) 
    im = Image.open(img) 
    im.show() 
    captcha_text = input("Enter captcha text: ") 

    r2 = requests.post(pdf_url, data = {'captcha_code': captcha_text}) 

    if r2.headers['Content-Type'] != 'application/pdf': 
     print("Looks like Sci-hub gave us another captcha") 
    else: 
     with open("filename.pdf", 'wb') as f: 
      f.write(r.content) 
     print("saved!") 

else: 
    print("Got a PDF") 
    with open("filename.pdf", 'wb') as f: 
     f.write(r.content) 
    print("saved!") 

我沒有一種方法可以讓我第一次請求PDF時生成的驗證碼原始圖像。當我從securimage_show.php請求另一個驗證碼圖像時,它會生成一個新的圖像,以便POST響應不正確。我怎樣才能解決這個問題?

+1

也許你應該在一個會話中執行兩個操作?請參閱http://docs.python-requests.org/en/master/user/advanced/ –

回答

0

感謝安德魯指引我在正確的方向。我需要與請求建立會話。我假設這個會話來回傳遞一個cookie,以便服務器可以跟蹤它發送給我的最新驗證碼。只是一個猜測,因爲這對我來說仍然有點神奇。

import requests 
from PIL import Image 
from io import BytesIO 

pdf_url = "http://moscow.sci-hub.io/3dcd1bf3b82ea549c0a72e9ab195ab78/walter2015.pdf" 

s = requests.Session() 
r1 = s.get(pdf_url) 

if r1.headers['Content-Type'] != 'application/pdf': 
    print("Looks like Sci-hub gave us a captcha") 

    image = s.get("http://moscow.sci-hub.io/captcha/securimage_show.php").content 
    img = BytesIO(image) 
    im = Image.open(img) 
    im.show() 
    captcha_text = input("Enter captcha text: ") 

    r2 = s.post(pdf_url, data = {'captcha_code': captcha_text}) 

    if r2.headers['Content-Type'] != 'application/pdf': 
     print("Looks like Sci-hub gave us another captcha") 
    else: 
     with open("filename.pdf", 'wb') as f: 
      f.write(r2.content) 
     print("saved!") 

else: 
    print("Got a PDF") 
    with open("filename.pdf", 'wb') as f: 
     f.write(r1.content) 
    print("saved!") 
相關問題