2013-09-26 84 views
0

我正在用Java編寫代碼來檢索和解析源代碼。我試圖訪問的網站是: http://cpdocket.cp.cuyahogacounty.us/SheriffSearch/results.aspx?q=searchType%3dSaleDate%26searchString%3d9%2f30%2f2013%26foreclosureType%3d%27NONT%27%2c+%27PAR%27%2c+%27COMM%27%2c+%27TXLN%27如何在Java代碼中實現下一個按鈕點擊?

的源代碼只針對網頁,即使有11總頁數。要訪問下一頁的源代碼,我必須點擊下一個按鈕,重新加載頁面以查看新的源代碼。我需要在我的代碼中實現這個想法,讓我的代碼檢索源代碼的所有不同頁面。

我已閱讀可能使用PhantomJS或CasperJS來做到這一點,但我不知道我會如何實現這些。

我的代碼如下:

// Scraper class takes an input of a string, and returns the source code of the of the website. Also picks out the needed data 
public class Scraper { 

    private static String url; // the input website to be scraped 

    public static String sourcetext; //The source code that has been scraped 


    //constructor which allows for the input of a URL 
    public Scraper(String url) { 
    this.url = url; 
    } 

    //scrapeWebsite runs the method to scrape the input URL and returns a string to be parsed. 
    public static void scrapeWebsite() throws IOException { 

    URL urlconnect = new URL(url); //creates the url from the variable 
    URLConnection connection = urlconnect.openConnection(); // connects to the created URL 
    BufferedReader in = new BufferedReader(new InputStreamReader( 
                   connection.getInputStream(), "UTF-8")); // annonymous class to stream the website 
    String inputLine; //creates a new variable of string 
    StringBuilder sourcecode = new StringBuilder(); // creates a stringbuilder which contains the sourcecode 

    //loop appends to the string builder as long as there is information 
    while ((inputLine = in.readLine()) != null) 
     sourcecode.append(inputLine);// appends the source code to the sting 
    in.close(); 
    sourcetext = sourcecode.toString(); // Takes the text in stringbuilder and converts it to a string 
    sourcetext = sourcetext.replace('"','*'); //deletes the quotes(") so it can be parsed 
    } 

    //This method parses through the data and adds the necesary information to a specified CSV file 
    public static void getPlaintiff() throws IOException { 

    PrintWriter docketFile = new PrintWriter("tester.csv", "UTF-8"); // creates the csv file. (name must be changed, override deletes file) 

    int i = 0; 

    //While loop runs through all the data in the source code. There is (14) entries per page. 
    while(i<14) { 
     String plaintiffAtty = "PlaintiffAtty_"+i+"*>"; //creates the search string for the plaintiffatty 
     Pattern plaintiffPattern = Pattern.compile("(?<="+Pattern.quote(plaintiffAtty)+").*?(?=</span>)");//creates the pattern for the atty 
     Matcher plaintiffMatcher = plaintiffPattern.matcher(sourcetext); // looks for a match for the atty 

     while (plaintiffMatcher.find()) { 
     docketFile.write(plaintiffMatcher.group().toString()+", "); //writes the found atty to the file 
     } 

     String appraisedValue = "Appraised_"+i+"*>"; //creats the search string for the appraised value 
     Pattern appraisedPattern = Pattern.compile("(?<="+Pattern.quote(appraisedValue)+").*?(?=</span>)");//creates the parren for the value 
     Matcher appraisedMatcher = appraisedPattern.matcher(sourcetext); //looks for a match to the apreaised value 

     while (appraisedMatcher.find()) { 
     docketFile.write(appraisedMatcher.group().toString()+"\n"); //writes the found value to the file 

     } 
     i++; 
    } 
    docketFile.close(); //closes the file 
    } 
} 
+0

你有沒有考慮旋轉類和方法的註釋到javadoc註釋?它會讓你的代碼更好,只需在類/方法之前將它們作爲行註釋。 – AJMansfield

+2

你需要弄清楚下一個按鈕的作用,它調用的URL以及它通過檢索下一頁的參數。如果你可以得到這些信息,那麼你就可以設置。 –

+0

添加到ns47731的評論。我會去了解一下叫做Firefox插件[實時HTTP標頭(https://addons.mozilla.org/En-us/firefox/addon/live-http-headers/)。該程序會告訴你下一個需要調用的URL。如果你必須做所有事情(甚至調用正確的javascript函數),考慮看[Selenium](http://docs.seleniumhq.org/),看看他們如何調用JavaScript等。祝你好運。最後,請查看HttpClient以發出HTTP請求。最好使用URLConnection是詭計多端的。 – hooknc

回答

0

這是你的新,majorly重新格式化,重新設計和翻新代碼;現在它實際上是可以理解的,你可能能夠解決你自己的問題。 (您可能想恢復在try-與資源的一部分,如果你使用的是Java 1.6或更早版本,不過,因爲他們在1.7只加。)

/** 
* This class contains methods for is for picking 
* out needed data from the source of a website. 
*/ 
public class Scraper { 

    /** 
    * This method scrapes the input URL. 
    * @return A string containing the data from the webpage. 
    * @throws IOException if there was a problem with accessing the website. 
    */ 
    public static String scrapeWebsite(String url) throws IOException { 

     String inputLine; 
     StringBuilder sourcetext = new StringBuilder(); 

     URL urlconnect = new URL(url); 
     URLConnection connection = urlconnect.openConnection(); 

     try(BufferedReader in = new BufferedReader(
       new InputStreamReader(connection.getInputStream(), "UTF-8"))){ 

      while ((inputLine = in.readLine()) != null) 
       sourcetext.append(inputLine); 
     } 
     return sourceText.toString().replace('"','*'); 
    } 

    /** 
    * This method parses through the data and adds the necesary information to 
    * a specified .CSV file. 
    * @param source The datasource, for example that returned by 
    *    {@link scrapeWebsite()}. 
    * @param targetFile The file path for the destination .csv file. 
    * @throws IOException if there was a problem with accessing the file. 
    */ 
    public static void getPlaintiff(CharSequence source, String targetFile) 
      throws IOException{ 

     try(PrintWriter docketFile = new PrintWriter("tester.csv", "UTF-8")){ 

      for(int i = 0; i < 14; i++) { 
       Matcher plaintiffMatcher = Pattern.compile(
         "(?<=PlaintiffAtty_" + i + "\\*>).*?(?=</span>)") 
         .matcher(source); 

       while (plaintiffMatcher.find()) 
        docketFile.println(plaintiffMatcher.group()); 

       Matcher appraisedMatcher = Pattern.compile(
         "(?<=Appraised_" + i + "\\*>).*?(?=</span>)") 
         .matcher(source); 

       while (appraisedMatcher.find()) 
        docketFile.println(appraisedMatcher.group()); 
      } 
     } 
    } 
} 

(注意潛在的新的bug可能被引入;只是解決這些問題,沒什麼大不了的)

編輯:意識到匹配創作確實有內循環來完成,因爲需要索引,生成正則表達式;也用一個簡單得多的docketWriter.println陳述取代了docketWriter.write

相關問題