2016-11-28 40 views
2

我是jsoup的新手,想要更加熟悉如何從網站中提取信息。我試圖做一些簡單的事情:從eBay獲取一些價值。jsoup獲取與它們相關的特定標籤和值

我想獲得項目名稱,HTML鏈接,價格從「熱本週」出售數量(喜歡這裏:http://www.ebay.co.uk/sch/Action-Figures/246/bn_1632128/i.html

但是我不確定如何進行。

package application; 

import java.io.BufferedReader; 
import java.io.InputStreamReader; 
import java.net.URL; 

import javax.swing.JOptionPane; 

import org.jsoup.Jsoup; 
import org.jsoup.nodes.Document; 
import org.jsoup.nodes.Element; 
import org.jsoup.select.Elements; 

public class GetHotSellers { 

    public static void main(String[] args) { 
     Document doc = Jsoup.parse(readURL("http://www.ebay.co.uk/sch/Action-Figures/246/bn_1632128/i.html")); 

     Elements sold_items = doc.getElementsMatchingText("sold$"); 
     for(Element sold : sold_items) { 
       System.out.println(sold.text()); 
     } 
    } 


    public static String readURL(String url) { 

    String fileContents = ""; 
    String currentLine = ""; 

    try { 
     BufferedReader reader = new BufferedReader(new InputStreamReader(new URL(url).openStream())); 
     fileContents = reader.readLine(); 
     while (currentLine != null) { 
      currentLine = reader.readLine(); 
      fileContents += "\n" + currentLine; 
     } 
     reader.close(); 
     reader = null; 
    } catch (Exception e) { 
     JOptionPane.showMessageDialog(null, e.getMessage(), "Error Message", JOptionPane.OK_OPTION); 
     e.printStackTrace(); 

    } 

    return fileContents; 
    } 

} 

這是盡我所能。我是否需要改進我的正則表達式,還是需要使用更適合我的請求的其他函數?

我的電流輸出是這樣的:

2016 8PC Marvel Avengers DC Super Hero Mini Figure Set Fits Lego FROM UK £6.35 381 sold Despicable Me Minions Supervillain Jet Playset -From the Argos Shop on ebay £7.99 187 sold Avengers Marvel Titan 12" figure Spider-man Captain Iron man Wolverine Thor Toy £8.69 174 sold Imaginext Marvel DC Super Hero Squad Figures and Villains Batman Please select £1.99 129 sold Star Wars Episode The Force Awakens Electronic Chewbacca Mask IN STOCK NOW! £24.99 101 sold Jurassic World Indominus Rex Chomping Dinosaur 44cm Figure T-Rex Dino Action Toy £26.99 89 sold 12" Avengers Marvel Titan Figures Spider-Man Captain Iron Man Wolverine Thor Toy £7.45 88 sold Henry Hugglemonster Huggle House Playset. From the Official Argos Shop on ebay £7.99 87 sold 
2016 8PC Marvel Avengers DC Super Hero Mini Figure Set Fits Lego FROM UK £6.35 381 sold Despicable Me Minions Supervillain Jet Playset -From the Argos Shop on ebay £7.99 187 sold Avengers Marvel Titan 12" figure Spider-man Captain Iron man Wolverine Thor Toy £8.69 174 sold Imaginext Marvel DC Super Hero Squad Figures and Villains Batman Please select £1.99 129 sold Star Wars Episode The Force Awakens Electronic Chewbacca Mask IN STOCK NOW! £24.99 101 sold Jurassic World Indominus Rex Chomping Dinosaur 44cm Figure T-Rex Dino Action Toy £26.99 89 sold 12" Avengers Marvel Titan Figures Spider-Man Captain Iron Man Wolverine Thor Toy £7.45 88 sold Henry Hugglemonster Huggle House Playset. From the Official Argos Shop on ebay £7.99 87 sold 
2016 8PC Marvel Avengers DC Super Hero Mini Figure Set Fits Lego FROM UK £6.35 381 sold 
2016 8PC Marvel Avengers DC Super Hero Mini Figure Set Fits Lego FROM UK £6.35 381 sold 
2016 8PC Marvel Avengers DC Super Hero Mini Figure Set Fits Lego FROM UK £6.35 381 sold 
381 sold 
381 sold 
Despicable Me Minions Supervillain Jet Playset -From the Argos Shop on ebay £7.99 187 sold 
Despicable Me Minions Supervillain Jet Playset -From the Argos Shop on ebay £7.99 187 sold 
Despicable Me Minions Supervillain Jet Playset -From the Argos Shop on ebay £7.99 187 sold 
187 sold 
187 sold 
Avengers Marvel Titan 12" figure Spider-man Captain Iron man Wolverine Thor Toy £8.69 174 sold 
Avengers Marvel Titan 12" figure Spider-man Captain Iron man Wolverine Thor Toy £8.69 174 sold 
Avengers Marvel Titan 12" figure Spider-man Captain Iron man Wolverine Thor Toy £8.69 174 sold 
174 sold 
174 sold 
Imaginext Marvel DC Super Hero Squad Figures and Villains Batman Please select £1.99 129 sold 
Imaginext Marvel DC Super Hero Squad Figures and Villains Batman Please select £1.99 129 sold 
Imaginext Marvel DC Super Hero Squad Figures and Villains Batman Please select £1.99 129 sold 
129 sold 
129 sold 
Star Wars Episode The Force Awakens Electronic Chewbacca Mask IN STOCK NOW! £24.99 101 sold 
Star Wars Episode The Force Awakens Electronic Chewbacca Mask IN STOCK NOW! £24.99 101 sold 
Star Wars Episode The Force Awakens Electronic Chewbacca Mask IN STOCK NOW! £24.99 101 sold 
101 sold 
101 sold 
Jurassic World Indominus Rex Chomping Dinosaur 44cm Figure T-Rex Dino Action Toy £26.99 89 sold 
Jurassic World Indominus Rex Chomping Dinosaur 44cm Figure T-Rex Dino Action Toy £26.99 89 sold 
Jurassic World Indominus Rex Chomping Dinosaur 44cm Figure T-Rex Dino Action Toy £26.99 89 sold 
89 sold 
89 sold 
12" Avengers Marvel Titan Figures Spider-Man Captain Iron Man Wolverine Thor Toy £7.45 88 sold 
12" Avengers Marvel Titan Figures Spider-Man Captain Iron Man Wolverine Thor Toy £7.45 88 sold 
12" Avengers Marvel Titan Figures Spider-Man Captain Iron Man Wolverine Thor Toy £7.45 88 sold 
88 sold 
88 sold 
Henry Hugglemonster Huggle House Playset. From the Official Argos Shop on ebay £7.99 87 sold 
Henry Hugglemonster Huggle House Playset. From the Official Argos Shop on ebay £7.99 87 sold 
Henry Hugglemonster Huggle House Playset. From the Official Argos Shop on ebay £7.99 87 sold 
87 sold 
87 sold 

而我想要的輸出例如:

Henry Hugglemonster Huggle House Playset. From the Official Argos Shop on ebay || £7.99 || 87 sold || http://link.com 

編輯:

剛纔試了這樣的事情,但沒有運氣。

for(String categoryURL : categoryLinksArray) { 
    Document doc = Jsoup.parse(readURL(categoryURL)); 
    Elements sold_items = doc.getElementsByClass("b-block-info-container"); 
    for(Element sold : sold_items) { 
      System.out.println("NAME: " + sold.attr("b-block-info-container__title b-block-info-container__title__ListingSummary") + "\n" + 
           "PRICE: " + sold.attr("b-block-info-container__price") + "\n" + 
           "SOLD/week: " + sold.attr("item_quantity__hotness") + "\n" + 
           "URL: " + sold.attr("abs:href")); 
      System.out.println("--------------------------------------"); 
    } 
} 

回答

1

我做到了,但效率不高,因爲它非常慢。

public static void main(String[] args) { 

    ArrayList<String> categoryLinksArray = new ArrayList<>(); 

    Document links = Jsoup.parse(readURL("http://www.ebay.co.uk/sch/allcategories/all-categories")); 
    Elements item_categories = links.getElementsByClass("ch"); 
    for (Element category : item_categories) { 
     categoryLinksArray.add(category.attr("abs:href")); 
    } 

    for (String categoryURL : categoryLinksArray) { 
     Document doc = Jsoup.parse(readURL(categoryURL)); 
     Elements hot_items = doc 
       .getElementsByClass("b-module b-module-carousel b-module-deals topSold b-display--portrait"); 
     for (Element item : hot_items) { 

      Elements hot_items_names = item.getElementsByClass(
        "b-block-info-container__title b-block-info-container__title__ListingSummary"); 
      Elements hot_items_price = item.getElementsByClass("b-block-info-container__price"); 
      Elements hot_items_sold = item.getElementsByClass("item_quantity__hotness"); 
      Elements hot_items_url = item.getElementsByClass("b-block-tile"); 

      HashMap<String, String> hs_items = new HashMap<>(); 

      for (Element item_name : hot_items_names) { 
       hs_items.put("Name", item_name.text()); 
      } 
      for (Element item_price : hot_items_price) { 
       hs_items.put("Price", item_price.text()); 
      } 
      for (Element item_sold : hot_items_sold) { 
       hs_items.put("Sold", item_sold.text()); 
      } 
      for (Element item_url : hot_items_url) { 
       hs_items.put("URL", item_url.attr("abs:href")); 
      } 

      System.out.println("Name: " + hs_items.get("Name") + "\n" + 
           "Price: " + hs_items.get("Price") + "\n" + 
           "Sold: " + hs_items.get("Sold") + "\n" + 
           "URL: " + hs_items.get("URL") + "\n" + 
           "----------------------------------"); 
     } 
    } 
} 
0
import java.io.IOException; 
import org.jsoup.Jsoup; 
import org.jsoup.nodes.Document; 
import org.jsoup.nodes.Element; 
import org.jsoup.select.Elements; 

public class JsoupTest { 
    public static void main(String argv[]) throws IOException {    
     Document doc = Jsoup.connect("http://www.ebay.co.uk/sch/Action-Figures/246/bn_1632128/i.html").get(); //connect to url and get the document 
     Element hotThisWeek = doc.getElementById("w6-2-x-carousel-items"); // select the div by its ID // better than matching text because id is unique 
     Elements items = hotThisWeek.select("li"); // select all li tags   
     for(Element e : items){ 
      System.out.println( e.select("div.b-block-info-container__title").text() // select the div with title text by class name 
        + " || " + e.select("div.b-block-info-container__price").text() // select the price-div by its class name 
        + " || " + e.select("div.item_quantity__hotness").text() // select hotness-div by class name 
        + " || " + e.select("a").attr("href")); //select a tag and get value of attribute href 
     } 
    } 
} 
+0

我試着做所有的類別,但在Jsoup.connect行得到NullPointer。你認爲這是因爲「w6-2-x-carousel-items」是玩具類別所特有的嗎? – lucianozo

+0

是的id是唯一的。所以這不適用於頁面的其餘部分。但是如果你檢查頁面的html代碼,你會看到某種結構。看到我的第二個答案,並在必要時進行修改。 – Eritrean

0

該頁面中的部分組織。這些章節標籤每個都有一個以id =「w2」,id =「w3」... until id =「w10」開頭的Id。例如:

import java.io.IOException; 
import org.jsoup.Jsoup; 
import org.jsoup.nodes.Document; 
import org.jsoup.nodes.Element; 
import org.jsoup.select.Elements; 

public class JsoupTest { 
    public static void main(String argv[]) throws IOException { 
     Document doc = Jsoup.connect("http://www.ebay.co.uk/sch/Action-Figures/246/bn_1632128/i.html").get(); 
     for(int i = 2; i<11;i++){ 
      Element category = doc.getElementById("w"+i); // select section with id = w2 , w3, w4 ... 
      if(!category.select("div.b-module-carousel__title").isEmpty()){ 
       System.out.println(category.select("div.b-module-carousel__title").text()); // the title of the section is either here 
      } 
      else{ 
       System.out.println(category.select("div.b-block-list__header").text()); // or here 
      } 
      Elements items = category.select("li");    
      for(Element e : items){ 
       System.out.println( e.select("div.b-block-info-container__title").text() 
         // to get prices or trending-prices 
         // (some boolean expression which can be true or false)?return this if true:return this part if false 
         + " || " + ((!e.select("div.b-block-info-container__price").isEmpty())?e.select("div.b-block-info-container__price").text():(e.select("div.b-block-info-container__trending-prices-group").text())) 
         + " || " + e.select("div.item_quantity__hotness").text() 
         + " || " + e.select("a").attr("href")); 
      } 
      System.out.println("************************************************************************************"); // just added to separate the categories 
     }    
    } 
}