2013-09-25 75 views
0

我是新來編寫代碼,我試圖編寫代碼來刮一個特定的網站。問題是這個網站有一個頁面來接受使用條件和隱私頁面。這可以通過網站看到:http://cpdocket.cp.cuyahogacounty.us/如何使用術語接受頁面刮掉網站?

我需要繞過這個頁面莫名其妙,我不知道如何。我正在用Java編寫我的代碼,到目前爲止,我的工作代碼已經爲任何網站提供了源代碼。此代碼是:

import java.net.URL; 
import java.net.URLConnection; 
import java.io.BufferedReader; 
import java.io.InputStreamReader; 
import java.lang.StringBuilder; 
import java.io.IOException; 

// Scraper class takes an input of a string, and returns the source code of the of the website 
public class Scraper { 

    private static String url; // the input website to be scraped 

    //constructor 
    public Scraper(String url) { 
    this.url = url; 
    } 

    //scrapeWebsite runs the method to scrape the input variable. As of now it retuns a string. This string idealy should be saved 
    //so it is able to be parsed by another method 
public static String scrapeWebsite() throws IOException { 
      URL urlconnect = new URL(url); //creates the url from the variable 
      URLConnection connection = urlconnect.openConnection(); // connects to the created url 
      BufferedReader in = new BufferedReader(new InputStreamReader( 
        connection.getInputStream(), "UTF-8")); // annonymous class to stream the website 
      String inputLine; //creates a new variable of string 
      StringBuilder a = new StringBuilder(); // creates stringbuilder 
      //loop appends to the string builder as long as there is information 
      while ((inputLine = in.readLine()) != null) 
       a.append(inputLine); 
      in.close(); 

      return a.toString(); 
     } 
} 

有關如何去做這個任何建議將不勝感激。

我正在基於ruby代碼重寫代碼。代碼是:

def initializeSession() 
    ## SETUP # POST headers 
    post_header = Hash.new() 
    post_header['Host'] = 'cpdocket.cp.cuyahogacounty.us' 
    post_header['User-Agent'] = 'Mozilla/5.0 (Windows NT 5.1; rv:20.0) Gecko/20100101 Firefox/20.0' 
    post_header['Accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' 
    post_header['Accept-Language'] = 'en-US,en;q=0.5' 
    post_header['Accept-Encoding'] = 'gzip, deflate' 
    post_header['X-Requested-With'] = 'XMLHttpRequest' 
    post_header['X-MicrosoftAjax'] = 'Delta=true' 
    post_header['Cache-Control'] = 'no-cache' 
    post_header['Content-Type'] = 'application/x-www-form-urlencoded; charset=utf-8' 
    post_header['Referer'] = 'http://cpdocket.cp.cuyahogacounty.us/Search.aspx' # may have to alter this per request 
    # post_header['Content-Length'] = '12197' 
    post_header['Connection'] = 'keep-alive' 
    post_header['Pragma'] = 'no-cache' 



    # STEP # set up simulated browser and make first request 
    #browser = SimBrowser.new() 
    #logname = 'log.txt' 
    #s = Scribe.new(logname) 
    session_cookie = 'ASP.NET_SessionId' 
    url = 'http://cpdocket.cp.cuyahogacounty.us/' 
    @browser.http_get(url) 
    #puts browser.get_body() # debug 
    puts 'DEBUG: session cookie: ' + @browser.get_cookie_var(session_cookie) 
    @log.slog('DEBUG: home page response code: expected 200, actual ' + @browser.get_response().code) 
    # s.flog('### HOME PAGE RESPONSE') 
    # s.flog(browser.get_body()) # debug 

    # STEP # send our acceptance of the terms of service 
    data = { 
     'ctl00$SheetContentPlaceHolder$btnYes' => 'Yes', 
     '__EVENTARGUMENT'=>'', 
     '__EVENTTARGET'=>'', 
     '__EVENTVALIDATION'=>'/wEWBwKc78CQCQLn3/HqCQLZw/fZCgLipuudAQK42duKDQL33NjnAwKn6+K4CIM3TSmrbrsn2xBRJf2DRwg01Vsbdk+oJV9lhG/in+xD', 
     '__VIEWSTATE'=>'/wEPDwUKLTI4MzA1ODM0OA9kFgJmD2QWAgIDD2QWDgIDD2QWAgIBD2QWCAIBDxYCHgRUZXh0BQ9BbmRyZWEgRi4gUm9jY29kAgMPFgIfAAUfQ3V5YWhvZ2EgQ291bnR5IENsZXJrIG9mIENvdXJ0c2QCBQ8PFgIeB1Zpc2libGVoZGQCBw8PFgIfAWhkZAIHDw9kFgIeB29uY2xpY2sFGmphdmFzY3JpcHQ6d2luZG93LnByaW50KCk7ZAILDw9kFgIfAgUiamF2YXNjcmlwdDpvbkNsaWNrPXdpbmRvdy5jbG9zZSgpO2QCDw8PZBYCHwIFRmRpc3BsYXlQb3B1cCgnaF9EaXNjbGFpbWVyLmFzcHgnLCdteVdpbmRvdycsMzcwLDIyMCwnbm8nKTtyZXR1cm4gZmFsc2VkAhMPZBYCZg8PFgIeC05hdmlnYXRlVXJsBRMvVE9TLmFzcHg/aXNwcmludD1ZZGQCFQ8PZBYCHwIFRWRpc3BsYXlQb3B1cCgnaF9RdWVzdGlvbnMuYXNweCcsJ215V2luZG93JywzNzAsMzcwLCdubycpO3JldHVybiBmYWxzZWQCFw8WAh8ABQYxLjAuNTRkZEnXSWiVLEPsDmlc7dX4lH/53vU1P1SLMCBNASGt4T3B' 
    } 
    #post_header['Referer'] = url 
    @browser.http_post(url, data, post_header) 
    @log.slog('DEBUG: accept terms response code: expected 200, actual ' + @browser.get_response().code) 
    @log.flog('### TOS ACCPTANCE RESPONSE') 
    # @log.flog(@browser.get_body()) # debug  
    end 

這是否也可以在Java中完成?

回答

0

如果您不明白如何做到這一點,最好的學習方法是在觀看FireBug(Firefox上)或IE,Chrome或Safari等效工具時發生的情況。

當用戶手動接受條款&條件時,您必須在協議中複製協議中發生的任何事情。

您還必須意識到呈現給用戶的UI可能不會直接作爲HTML發送,它可能會由Javascript動態構建,通常在瀏覽器上運行。如果您不準備完全模擬瀏覽器以維護DOM並執行Javascript,那麼這可能是不可能的。