2016-11-22 70 views
1

我試圖解析使用JSoup以下URL的HTML:JSoup HttpStatusException

http://brickseek.com/walmart-inventory-checker/ 

當我執行我得到異常下面的程序。我使用jsoup-1.10.1.jar

Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=403, URL=http://brickseek.com/walmart-inventory-checker/ 
    at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:598) 
    at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:548) 
    at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:235) 
    at Third.main(Third.java:22) 

下面是程序:

import java.io.IOException; 

import org.jsoup.Connection.Method; 
import org.jsoup.Connection.Response; 
import org.jsoup.Jsoup; 
import org.jsoup.nodes.Document; 
import org.jsoup.nodes.Element; 
import org.jsoup.select.Elements; 

public class Third { 

    public static void main(String[] args) throws IOException { 

     String uniqueSku ="44656182"; 
     String zipCode ="75160"; 

     Response response = Jsoup.connect("http://brickseek.com/walmart-inventory-checker/") 
       .data("store_type","3", "sku", uniqueSku , "zip" , String.valueOf(zipCode) , "sort" , "distance") 
       .userAgent("Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.121 Safari/535.2") 
       .method(Method.POST) 
       .timeout(0) 
       .execute(); 

       String rawHTML = response.body(); 
       Document parsedDocument = Jsoup.parse(rawHTML); 
       Element bodyElement = parsedDocument.body(); 
       Elements inStockTableElement = bodyElement.getElementsByTag("table"); 



    } 
} 

任何幫助將不勝感激。

+0

這對我的作品。 :O –

+0

它工作嗎?但我仍然遇到同樣的問題。你能告訴我你在使用哪個編輯器(Ex Eclipse)嗎? Java版本和Jsoup版本? 我不知道如何解決這個問題:(你能夠sysout inStockTableElement對象嗎? – Radi

+0

我使用Eclipse Luna,JDK 1.7.0_67,Jsoup 1.10.1。是的,我可以打印'inStockTableElement' 'System.out.println()',並打印出一個'

'。 –

回答

2

服務器可能有某種方法來檢測您是否使用bot來刮取頁面。試着改變你的HTTP標頭是這樣的:

public class Util { 
    public static Connection mask(Connection c) { 
     return c.header("Host", "brickseek.com") 
       .header("Connection", "keep-alive") 
//    .header("Content-Length", ""+c.request().requestBody().length()) 
       .header("Cache-Control", "max-age=0") 
       .header("Origin", "https://brickseek.com/") 
       .header("Upgrade-Insecure-Requests", "1") 
       .header("User-Agent", "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.48 Safari/537.36") 
       .header("Content-Type", "application/x-www-form-urlencoded") 
       .header("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8") 
       .referrer("http://brickseek.com/walmart-inventory-checker/") 
       .header("Accept-Encoding", "gzip, deflate, br") 
       .header("Accept-Language", "en-US,en;q=0.8"); 
    } 
} 

這個頭正好是從谷歌瀏覽器標題複製 - 頭通常,僵屍網絡是由不同的順序頭的檢測,或通過不同的資本化。通過完全複製Google Chrome,您應該能夠繞過它而不被察覺。

一些機器人檢測算法計算每個IP請求的數量,並開始阻止高於某個閾值 - 這就是爲什麼它仍然適用於某些人。

2

只需在代碼中添加ignoreHttpErrors(true)即可​​。

Response response = Jsoup.connect("http://brickseek.com/walmart-inventory-checker/") 
       .data("store_type","3", "sku", uniqueSku , "zip" , String.valueOf(zipCode) , "sort" , "distance") 
       .userAgent("Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.121 Safari/535.2") 
       .method(Method.POST) 
       .timeout(0).ignoreHttpErrors(true) 
       .execute(); 

感謝