2011-01-23 36 views
3

我試圖從web site retrive一些數據。HTTPclient POST與有問題的網站

我寫了一個java類,它似乎很多網站工作得很好,但它不適用於這個特殊的網站,它使用廣泛的JavaScript在輸入法。

正如您從代碼中所看到的那樣,我指定了從HTML源代碼獲取名稱的輸入字段,但是也許該網站不接受此類POST請求?

如何模擬用戶交互來檢索生成的HTML?

package com.transport.urlRetriver; 

import java.io.BufferedReader; 
import java.io.BufferedWriter; 
import java.io.FileWriter; 
import java.io.IOException; 
import java.io.InputStream; 
import java.io.InputStreamReader; 
import java.util.ArrayList; 

import org.apache.http.HttpEntity; 
import org.apache.http.HttpResponse; 
import org.apache.http.NameValuePair; 
import org.apache.http.client.entity.UrlEncodedFormEntity; 
import org.apache.http.client.methods.HttpPost; 
import org.apache.http.impl.client.DefaultHttpClient; 
import org.apache.http.message.BasicNameValuePair; 

public class UrlRetriver { 


    String stationPoller (String url, ArrayList<NameValuePair> params) { 

     HttpPost postRequest; 
     HttpResponse response; 
     HttpEntity entity; 
     String result = null; 

     DefaultHttpClient httpClient = new DefaultHttpClient(); 


     try { 

      postRequest = new HttpPost(url); 

      postRequest.setEntity((HttpEntity) new UrlEncodedFormEntity(params)); 
      response = httpClient.execute(postRequest); 

      entity = response.getEntity(); 

      if(entity != null){ 
       InputStream inputStream = entity.getContent(); 
       result = convertStreamToString(inputStream); 
      } 



     } catch (Exception e) { 

      result = "We had a problem"; 

     } finally { 

      httpClient.getConnectionManager().shutdown(); 

     } 



     return result; 

    } 





    void ATMtravelPoller() { 




     ArrayList<NameValuePair> params = new ArrayList<NameValuePair>(2); 

     String url = "http://www.atm-mi.it/it/Pagine/default.aspx"; 

     params.add(new BasicNameValuePair("ctl00$SPWebPartManager1$g_afa5adbb_5b60_4e50_8da2_212a1d36e49c$txt_address_s", "Viale romagna 1")); 

     params.add(new BasicNameValuePair("ctl00$SPWebPartManager1$g_afa5adbb_5b60_4e50_8da2_212a1d36e49c$txt_address_e", "Viale Toscana 20")); 

     params.add(new BasicNameValuePair("sf_method", "POST")); 

     String result = stationPoller(url, params); 

     saveToFile(result, "/home/rachele/Documents/atm/out4.html"); 

    } 

    static void saveToFile(String toFile, String pos){ 
      try{ 
       // Create file 
       FileWriter fstream = new FileWriter(pos); 
       BufferedWriter out = new BufferedWriter(fstream); 
       out.write(toFile); 
       //Close the output stream 
       out.close(); 
       }catch (Exception e){//Catch exception if any 
        System.err.println("Error: " + e.getMessage()); 
       } 
       } 

    private static String convertStreamToString(InputStream is) { 
      BufferedReader reader = new BufferedReader(new InputStreamReader(is)); 
      StringBuilder stringBuilder = new StringBuilder(); 

      String line = null; 
      try { 
      while ((line = reader.readLine()) != null) { 
       stringBuilder.append(line + "\n"); 
      } 
      } catch (IOException e) { 
      e.printStackTrace(); 
      } finally { 
      try { 
       is.close(); 
      } catch (IOException e) { 
       e.printStackTrace(); 
      } 
      } 
      return stringBuilder.toString(); 
     } 

} 
+1

這不是一個答案,而是描述發生了什麼。您需要提交大約30個參數,並且動態生成一些參數名稱/值以防止通過腳本或程序獲取內容。每次獲取內容時,您都會對參數名稱進行硬編碼。這些參數不會相同。 – gigadot

+2

不是你的JavaScript的東西(因此評論)的答案,但...請注意,對於很多網站,你需要從Java僞造你的「用戶代理」,否則你不會得到真正的網站。在那裏,這樣做,你**必須**僞造用戶代理;) – SyntaxT3rr0r

+1

對於這個網站,你是否發送用戶代理也沒有什麼不同。我通過從我的Firefox中篩選出用戶代理標題來測試它,結果沒有什麼不同。 – gigadot

回答

1

在我看來,可能會有javascript生成的字段具有動態值,以防止自動代碼抓取該網站。發送你想下載的具體網站。

+0

我已經在原始描述中插入了它:http://www.atm-mi.it/en/Pages/default.aspx – Mascarpone

+1

正如gigadot在上面寫的,你必須做GET請求來獲取隱藏字段(正如我所看到的__REQUESTDIGEST會造成問題),然後發出POST請求。一般在瀏覽器中像用戶一樣行事。 –