2014-10-29 90 views
1

我試圖從一個網站解析以下行:Java網站解析器

<div class="search-result__price">£2,995</div>

我只希望它的2995的一部分,但我有這樣做的難度。這是我的代碼;它目前能夠解析所有包含英鎊符號的行,並在網站上顯示所有貨幣。請幫忙! (!)

public class parser { 

    private static String string1 = "&pound"; 
    private String testURL = "http://www.autotrader.co.uk/search/used/cars/bmw/1_series/postcode/tn126bg/radius/1500/onesearchad/used%2Cnearlynew%2Cnew/quicksearch/true/page/2"; 
    private ArrayList<String> list = new ArrayList<String>(); 
    private ArrayList<Integer> prices = new ArrayList<Integer>(); 
    private int averagePrice; 
    private int start; 
    private int finish; 

    public parser() throws IOException { 

     URL url = new URL(testURL); 
     Scanner scan = new Scanner(url.openStream()); 
     boolean alreadyHit = false; 

     while (scan.hasNext()) { 

      String line = scan.nextLine(); 

      if (line.contains(string1)) { 

       list.add(line); 

       start = line.indexOf("&pound;"); 
       line = line.substring(start); 
       for (int i = 0; i < line.length(); i++) { 

        if (((line.charAt((i)) == ' ') || ((line.charAt((i)) == '<'))) && (alreadyHit == false)) { 
         finish = i; 
         alreadyHit = true; 
        } 
       } 
       alreadyHit = false; 

       line = line.substring(0, finish); 
       line = line.trim(); 
       line = line.replace("&pound;", ""); 
       line = line.replace(",", ""); 

       try { 

        int price = Integer.parseInt(line); 
        prices.add(price); 
       } catch (Exception e) { 

       } 
      } 
     } 
    } 

    public static void main(String args[]) throws IOException { 

     parser p = new parser(); 

     for (Integer x : p.prices) { 

      System.out.println(x); 
     } 
    } 
} 
+0

如果它是目前能夠解析所有網站中的行和顯示貨幣,有什麼問題?還是你的意思是「無法」?如果是這樣,它在做什麼? – RealSkeptic 2014-10-29 21:17:20

+3

*** [不要使用REGEX指定XML/HTML。](http://stackoverflow.com/a/1732454/510036)*** – Qix 2014-10-29 21:22:35

+1

1+對於@Qix剛纔所說的。使用REGEX解析非常規語言會導致瘋狂。 – 2014-10-29 21:23:45

回答

4

而不是使用Scanner去逐行或使用正則表達式明確HTML內容的,你應該使用類似jsoup

Document doc = Jsoup 
    .connect(testURL) 
    .userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0") 
    .timeout(60000).get(); 
Elements elems = doc.select("div .search-result__price");