2013-11-05 101 views
0

我一直在使用jSoup使用此代碼進行試驗。我們的想法是從這個頁面中提取的電影檔期:使用JSOUP從html表格提取非結構化數據

http://www.blitzmegaplex.com/en/schedule_movie.php?id=MOV1970

到目前爲止,我只能單獨提取電影的名稱。由於它使用特定的類名稱(「separator2」)進行標記。而其餘的則被命名爲「分隔符」。

我試圖建立使用循環以下步驟: 對於每一行表:

  1. 拿到電影院標題
  2. (步驟#1行)跳過它下面一行。
  3. 獲得第二個名爲「分隔符」的類
  4. 從下面的所有行中獲取第二個(來自步驟#3的行)。直到它到達包含名爲「separator2」的類的下一行
  5. 重複此過程,直到處理完所有行。

任何人都可以建議我該怎麼做?或者更好的建議?

謝謝。

我迄今爲止代碼:

public void getMovieSchedule(String movieUrl) throws IOException 
{ 


    //URL url = new URL(movieUrl); 
    //Document doc = Jsoup.parse(url, 3000); 

    //Element table = doc.select("table[div=scheduletbl]").first(); 
    //Iterator<Element> ite = table.select("tr").iterator(); 
    //ite.next(); // Skip the first row. 

    // Actual content 
    //print(ite.next().text()); 

    *** CODE ABOVE DOES NOT WORK *** 

    //final String urlSchedule = "http://www.blitzmegaplex.com/en/schedule_movie.php?id=MOV1970"; 

    Document doc = Jsoup.connect(movieUrl).get(); 
    Elements div = doc.select("div.panelbox"); 

    for(Element child : div) 
    { 
     Elements table = child.select("table"); 
     Elements row = table.select("tr"); // The actual content. 

     for (Element a: row) 
     { 
      Elements cinemaName = a.select("td.separator2"); 
      print(cinemaName.text().toString()); 
     } 
    } 
} 

HTML被提取(一些代碼省略):

<table width="95%" border="0" cellpadding="2" cellspacing="0" id="scheduletbl"> 
    <tbody> 

    <tr> 
    <td colspan="3" class="separator2"><strong>BLITZMEGAPLEX - PARIS VAN JAVA, BANDUNG</strong></td> 
    </tr> 

    <tr> 
    <td colspan="3"><img src="../img/ico_rss_schedule_white.gif" width="16" height="16" hspace="5" align="left"><strong><a href="../rss/schedule.php" class="navlink">RSS- Paris van Java</a></strong></td> 
    </tr> 
    <tr> 
    <td class="separator">&nbsp;</td> 
    <td colspan="2" class="separator">TUESDAY, 05 NOVEMBER 2013</td> 
    </tr> 
    <tr> 
    <td class="separator">&nbsp;</td> 
    <td width="20%" class="separator" rel="2D"> 
    10:30&nbsp;&nbsp;&nbsp; 
    </td> 
    <td width="30%" class="separator"> 
    <a href="https://www.blitzmegaplex.com/olb/seats.php?showdate=2013-11-05&amp;cinema=0100&amp;movie=MOV1970&amp;showtime=10:30&amp;suite=N&amp;movieformat=2D" class="navlink" target="_blank">Buy Tickets</a></td> 

    </tr> 
    <tr> 
    <td class="separator">&nbsp;</td> 
    <td width="20%" class="separator" rel="2D"> 
    13:15&nbsp;&nbsp;&nbsp; 
    </td> 
    <td width="30%" class="separator"> 
    <a href="https://www.blitzmegaplex.com/olb/seats.php?showdate=2013-11-05&amp;cinema=0100&amp;movie=MOV1970&amp;showtime=13:15&amp;suite=N&amp;movieformat=2D" class="navlink" target="_blank">Buy Tickets</a></td> 

    </tr> 
    <tr> 
    <td class="separator">&nbsp;</td> 
    <td width="20%" class="separator" rel="2D"> 
    16:00&nbsp;&nbsp;&nbsp; 
    </td> 
    <td width="30%" class="separator"> 
    <a href="https://www.blitzmegaplex.com/olb/seats.php?showdate=2013-11-05&amp;cinema=0100&amp;movie=MOV1970&amp;showtime=16:00&amp;suite=N&amp;movieformat=2D" class="navlink" target="_blank">Buy Tickets</a></td> 

    </tr> 
    <tr> 
    <td class="separator">&nbsp;</td> 
    <td width="20%" class="separator" rel="2D"> 
    18:45&nbsp;&nbsp;&nbsp; 
    </td> 
    <td width="30%" class="separator"> 
    <a href="https://www.blitzmegaplex.com/olb/seats.php?showdate=2013-11-05&amp;cinema=0100&amp;movie=MOV1970&amp;showtime=18:45&amp;suite=N&amp;movieformat=2D" class="navlink" target="_blank">Buy Tickets</a></td> 

    </tr> 
    <tr> 
    <td class="separator">&nbsp;</td> 
    <td width="20%" class="separator" rel="2D"> 
    21:30&nbsp;&nbsp;&nbsp; 
    </td> 
    <td width="30%" class="separator"> 
    <a href="https://www.blitzmegaplex.com/olb/seats.php?showdate=2013-11-05&amp;cinema=0100&amp;movie=MOV1970&amp;showtime=21:30&amp;suite=N&amp;movieformat=2D" class="navlink" target="_blank">Buy Tickets</a></td> 

    </tr> 
    <tr> 
    <td colspan="3" class="separator2"><strong>BLITZMEGAPLEX - GRAND INDONESIA, JAKARTA</strong></td> 
    </tr> 

    <tr> 
    <td colspan="3"><img src="../img/ico_rss_schedule_white.gif" width="16" height="16" hspace="5" align="left"><strong><a href="../rss/schedule.php" class="navlink">RSS- Grand Indonesia</a></strong></td> 
    </tr> 
    <tr> 
    <td class="separator">&nbsp;</td> 
    <td colspan="2" class="separator">TUESDAY, 05 NOVEMBER 2013</td> 
    </tr> 
    <tr> 
    <td class="separator">&nbsp;</td> 
    <td width="20%" class="separator" rel="2D"> 
    10:45&nbsp;&nbsp;&nbsp; 
    </td> 
    <td width="30%" class="separator"> 
    <a href="https://www.blitzmegaplex.com/olb/seats.php?showdate=2013-11-05&amp;cinema=0200&amp;movie=MOV1970&amp;showtime=10:45&amp;suite=N&amp;movieformat=2D" class="navlink" target="_blank">Buy Tickets</a></td> 

    </tr> 
    <tr> 
    <td class="separator">&nbsp;</td> 
    <td width="20%" class="separator" rel="2D"> 
    13:30&nbsp;&nbsp;&nbsp; 
    </td> 
    <td width="30%" class="separator"> 
    <a href="https://www.blitzmegaplex.com/olb/seats.php?showdate=2013-11-05&amp;cinema=0200&amp;movie=MOV1970&amp;showtime=13:30&amp;suite=N&amp;movieformat=2D" class="navlink" target="_blank">Buy Tickets</a></td> 

    </tr> 
    <tr> 
    <td class="separator">&nbsp;</td> 
    <td width="20%" class="separator" rel="2D"> 
    16:15&nbsp;&nbsp;&nbsp; 
    </td> 
    <td width="30%" class="separator"> 
    <a href="https://www.blitzmegaplex.com/olb/seats.php?showdate=2013-11-05&amp;cinema=0200&amp;movie=MOV1970&amp;showtime=16:15&amp;suite=N&amp;movieformat=2D" class="navlink" target="_blank">Buy Tickets</a></td> 

    </tr> 
    <tr> 
    <td class="separator">&nbsp;</td> 
    <td width="20%" class="separator" rel="2D"> 
    19:00&nbsp;&nbsp;&nbsp; 
    </td> 
    <td width="30%" class="separator"> 
    <a href="https://www.blitzmegaplex.com/olb/seats.php?showdate=2013-11-05&amp;cinema=0200&amp;movie=MOV1970&amp;showtime=19:00&amp;suite=N&amp;movieformat=2D" class="navlink" target="_blank">Buy Tickets</a></td> 

    </tr> 
    <tr> 
    <td class="separator">&nbsp;</td> 
    <td width="20%" class="separator" rel="2D"> 
    21:45&nbsp;&nbsp;&nbsp; 
    </td> 
    <td width="30%" class="separator"> 
    <a href="https://www.blitzmegaplex.com/olb/seats.php?showdate=2013-11-05&amp;cinema=0200&amp;movie=MOV1970&amp;showtime=21:45&amp;suite=N&amp;movieformat=2D" class="navlink" target="_blank">Buy Tickets</a></td> 
    </tr> 
    ... MORE <tr> here ... 
    </tbody></table> 

回答

0

如果我正確理解你的問題,你只需要提取一些細節從表(即電影名稱,日期和時間),但你有麻煩,因爲大多數行具有相同的className。

所以此基礎上,這裏是我的解決方案:

Elements e = doc.select("table#scheuletbl > tbody > tr > td"); 
for (Element el : e) { 
    if (el.hasClass("separator2")) System.out.println(el.text()); // cinema name 
    else if (el.toString().contains("colspan=\"2\"")) System.out.println(el.text()); // date 
    else if (el.hasAttr("rel")) System.out.println(el.text()); // times 
} 

這將打印出:

BLITZMEGAPLEX - PARIS VAN JAVA, BANDUNG 
TUESDAY, 05 NOVEMBER 2013 
10:30    
13:15    
16:00    
18:45    
21:30    
BLITZMEGAPLEX - GRAND INDONESIA, JAKARTA 
TUESDAY, 05 NOVEMBER 2013 
10:45    
13:30    
16:15    
19:00    
21:45  

當然,該解決方案具有連接到該網站上的特定表,但由於只要格式不經常變化,並且在整個網站中保持一致,它就會起作用。您可以考慮創建一個類來存儲所有這些信息。