2015-01-05 54 views
0

我有如下示例HTML返回完整的節點:jsoup未能與所有的孩子

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/1999/REC-html401-19991224/loose.dtd"> 
<html lang="en"> 
<head> 
<title>example.com</title> 
</head> 
<body> 

<div> 
    <ul class="mb10"> 
     <li><input class="ript" name="pmtmthd" value="NOLINK" 
      type="radio" id="NOLINK" reqType="ChgPaymentMtd" nodsb="true"> 
      <label for="NOLINK"><img 
       src="https://example.com/example1.gif" 
       height="23" width="147" alt="Credit Card"> 
       <div class="v10777" style="margin-left: 20px">Processed</div> 
      </label> </input> 
     </li> 
     <li><input class="ript" name="pmtmthd" value="SPLLINK" 
      type="radio" id="SPLLINK" reqType="ChgPaymentMtd" nodsb="true" 
      checked="checked"> <label for="SPLLINK"><img 
       src="https://example.com/example2.gif" 
       height="19" width="73" alt="spllink"> 
       </label> </input> 
     </li> 
     </ul> 
    </div> 
</body> 
</html> 

我試圖提取所有無線電元素:

List<Element> radioElements = doc.getElementsByAttributeValue("type", "radio"); 

輸出沒有任何兒童被賦予元素信息如下:

<input class="ript" name="pmtmthd" value="NOLINK" type="radio" id="NOLINK" reqType="ChgPaymentMtd" nodsb="true" /> 

<input class="ript" name="pmtmthd" value="SPLLINK" type="radio" id="SPLLINK" reqType="ChgPaymentMtd" nodsb="true" checked="checked" /> 

如何獲得所有的無線電元素與他們的所有孩子完好?

+0

難道我的回答幫助? – alkis

回答

1

Jsoup試圖規範化HTML,以便它可以糾正任何錯誤(無效的HTML)。在input標記內放置一些內容是無效的HTML(input是一個自閉元素,不允許子元素,只允許屬性),因此將其刪除。如果你想防止這種正常化發生,請使用這樣的不同解析器。

Document doc = Jsoup.parse(html, "", Parser.xmlParser()); 
Elements radios = doc.getElementsByAttributeValue("type", "radio"); 
System.out.println(radios); 

輸出

<input class="ript" name="pmtmthd" value="NOLINK" type="radio" id="NOLINK" reqtype="ChgPaymentMtd" nodsb="true"><label for="NOLINK"><img src="https://example.com/example1.gif" height="23" width="147" alt="Credit Card"> 
    <div class="v10777" style="margin-left: 20px"> 
    Processed 
    </div></img></label></input> 
<input class="ript" name="pmtmthd" value="SPLLINK" type="radio" id="SPLLINK" reqtype="ChgPaymentMtd" nodsb="true" checked="checked"><label for="SPLLINK"><img src="https://example.com/example2.gif" height="19" width="73" alt="spllink" /></label></input>