2014-09-23 27 views
1

我在我的java應用程序中使用jsoup來解析html代碼,但現在我需要解析表數據,並且我想獲得第一個<td>元素的第一個值,在<tr>之後,如果第一個數據包含單詞「過期」它將跳過,如果沒有過期,它將解析到第三個表格,並以「.rpm」單詞獲得該值,並且無法使其工作。我嘗試了很多方法,但都不成功,所以如果有人有經驗,我想在這裏嘗試運氣。在Java中使用jsoup的解析元素

public class rpms { 

    public static void getTdSibling(String sourceTd) throws FileNotFoundException, UnsupportedEncodingException { 
     String fragment = sourceTd; 
     Document doc = Jsoup.parseBodyFragment(fragment); 
     Elements myElements = doc.getElementsByClass("confluenceTable tablesorter").first().getElementsByTag("tr"); 
     for (Element element : myElements) { 
      if (element.select("td").contains("Outdated")) { 
       String rpms = element.ownText(); 
       System.out.println(rpms); 
      } 
     } 
    } 

    public static void main(String[] args) { 
     URLget rpms = new URLget(); 
     try { 
      getTdSibling(sendGetRequest(URL).toString()); 

     } catch (MalformedURLException e) { 
      e.printStackTrace(); 
     } catch (IOException e) { 
      e.printStackTrace(); 
     } 
    } 
} 

並請參閱下表中的HTML代碼中元素的解析情況如下:

<table class="confluenceTable tablesorter"> 
    <tbody class=""> 
     <tr> 
      <td colspan="1" class="confluenceTd">RHSA-2014:1172</td> 
      <td colspan="1" class="confluenceTd"> 
       <p>The procmail program is used for local mail delivery. In addition to just 
        <br>delivering mail, procmail can be used for automatic filtering, presorting, 
        <br>and other mail handling jobs.</p> 
       <p>A heap-based buffer overflow flaw was found in procmail's formail utility. 
        <br>A remote attacker could send an email with specially crafted headers that, 
        <br>when processed by formail, could cause procmail to crash or, possibly, 
        <br>execute arbitrary code as the user running formail. (CVE-2014-3618) 
       </p> 
      </td> 
      <td colspan="1" class="confluenceTd">procmail-3.22-17.1.2.x86_64.rpm</td> 
      <td colspan="1" class="confluenceTd"> 
       <img class="emoticon emoticon-tick" src="/s/en_GB-1988229788/4733/f235dd088df5682b0560ab6fc66ed22c9124c0be.57/_/images/icons/emoticons/check.png" data-emoticon-name="tick" alt="(tick)"> 
      </td> 
     </tr> 

     <tr> 
      <td colspan="1" class="confluenceTd">Outdated RHSA-2014:1166</td> 
      <td colspan="1" class="confluenceTd"> 
       <p>Jakarta Commons HTTPClient implements the client side of HTTP standards.</p> 
       <p>It was discovered that the HTTPClient incorrectly extracted host name from 
        <br>an X.509 certificate subject's Common Name (CN) field. A man-in-the-middle 
        <br>attacker could use this flaw to spoof an SSL server using a specially 
        <br>crafted X.509 certificate. (CVE-2014-3577)</p> 
      </td> 
      <td colspan="1" class="confluenceTd"> 
       <p>jakarta-commons-httpclient-3.0-7jpp.4.el5_10.x86_64.rpm</p> 
       <p>jakarta-commons-httpclient-demo-3.0-7jpp.4.el5_10.x86_64.rpm</p> 
       <p>jakarta-commons-httpclient-javadoc-3.0-7jpp.4.el5_10.x86_64.rpm</p> 
       <p>jakarta-commons-httpclient-manual-3.0-7jpp.4.el5_10.x86_64.rpm</p> 
      </td> 
     </tr> 

     <tr> 
      <td colspan="1" class="confluenceTd">RHSA-2014:1148-1</td> 
      <td colspan="1" class="confluenceTd"> 
       <p>A flaw was found in the way Squid handled malformed HTTP Range headers. 
        <br>A remote attacker able to send HTTP requests to the Squid proxy could use 
        <br>this flaw to crash Squid. (CVE-2014-3609) 
       </p> 
       <p>A buffer overflow flaw was found in Squid's DNS lookup module. A remote 
        <br>attacker able to send HTTP requests to the Squid proxy could use this flaw 
        <br>to crash Squid. (CVE-2013-4115)</p> 
      </td> 
      <td colspan="1" class="confluenceTd"><span>squid-2.6.STABLE21-7.el5_10.x86_64.rpm</span> 
      </td> 
      <td colspan="1" class="confluenceTd"></td> 
     </tr> 
</table> 

需要你的幫助。我已經嘗試了很多次,並從這裏閱讀文章,但它不能。謝謝。

回答

0

小心你的元素的存取(見文檔here):

你只能給一個類getElementsByClass

public static void getTdSibling(String sourceTd) throws FileNotFoundException, UnsupportedEncodingException { 
    String fragment = sourceTd; 
    Document doc = Jsoup.parseBodyFragment(fragment); 
    Elements myElements = doc.getElementsByClass("confluenceTable").first().getElementsByTag("tr"); 
    for (Element element : myElements) { 
     // select the TDs 
     Elements tds = element.getElementsByTag("td"); 
     // do you condition here 
     if (tds.first().text().contains("Outdated")) { 
      // access the <p> children of the 3rd td 
      Elements rpms = tds.get(2).children(); 
      for (Element rpm : rpms) { 
       if (rpm.text().contains(".rpm")) { 
        System.out.println(rpm.text()); 
       } 
      } 
     } 
    } 
} 

編輯,現在連續進入第三個TD。

+0

你可以修改這個元素'tds:element.getElementsByTag(「td」);'它是錯誤的。 – user3278908 2014-09-24 03:40:37

+0

我的錯字,抱歉。還有一個失蹤的';' – yunandtidus 2014-09-24 07:37:19