2015-01-06 96 views
1

我試圖解析嵌入在下面的HTML文件中的XML。下面是從標籤中的一個細節:將HTML標記解析爲XML

  DOM<tr class="iris_table_row"> 
       <td style=" width:37.50%; text-align:left; " class="ta_10"><span class="ta_10">Tangible assets</span></td> 
       <td style=" width:2.50%; text-align:right; " class="ta_10"><span class="ta_10">2</span></td> 
       <td style=" width:30.00%; text-align:right; " class="ta_61"><ix:nonFraction contextRef="cfwd_31_03_2014" name="ns5:TangibleFixedAssets" unitRef="GBP" decimals="0" format="ixt2:numdotdecimal" scale="0" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">7,956</ix:nonFraction></td> 
       <td style=" width:1.25%; " class="ta_61" /> 
       <td style=" width:26.25%; text-align:right; " class="ta_60"><ix:nonFraction contextRef="cfwd_31_03_2013" name="ns5:TangibleFixedAssets" unitRef="GBP" decimals="0" format="ixt2:numdotdecimal" scale="0" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">5,402</ix:nonFraction></td> 
       <td style=" width:1.25%; " class="ta_60" /> 
       <td style=" width:1.25%; " class="ta_10" /> 
      </tr> 

我使用DOM解析器的java做這種嘗試,但它不能識別XML標籤。

下面的代碼中的db.parse(fXmlFile)的值是「null」。

File fXmlFile = new File("Prod223_1254_04903825_20140331 copy.xml"); 

    DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance(); 
    dbf.setValidating(false); 
    dbf.setNamespaceAware(true); 
    dbf.setIgnoringComments(false); 
    dbf.setIgnoringElementContentWhitespace(false); 
    dbf.setExpandEntityReferences(false); 
    DocumentBuilder db = dbf.newDocumentBuilder(); 

    System.out.println(db.parse(fXmlFile)); 

我怎樣才能得到所有的標籤和信息到java?理想情況下,我可以將它們加載到一個bean中。

這是我試圖解析的文件類型的一個例子。

<?xml version="1.0" encoding="utf-8"?><html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL" xmlns:ixt="http://www.xbrl.org/inlineXBRL/transformation/2010-04-20" xmlns:ixt2="http://www.xbrl.org/inlineXBRL/transformation/2011-07-31" xmlns:link="http://www.xbrl.org/2003/linkbase" xmlns:xbrli="http://www.xbrl.org/2003/instance" xmlns:xbrldi="http://xbrl.org/2006/xbrldi" xmlns:xl="http://www.xbrl.org/2003/XLink" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:fn="http://www.w3.org/2005/xpath-functions" xmlns:iris="http://www.iris.co.uk/ixbrl" xmlns:ns0="http://www.xbrl.org/uk/gaap/core-full/2009-09-01" xmlns:ns5="http://www.xbrl.org/uk/gaap/core/2009-09-01" xmlns:ns6="http://www.xbrl.org/uk/reports/direp/2009-09-01" xmlns:ns7="http://www.xbrl.org/uk/cd/business/2009-09-01" xmlns:ns8="http://www.xbrl.org/uk/all/types/2009-09-01" xmlns:ns9="http://xbrl.org/2005/xbrldt" xmlns:ns10="http://www.xbrl.org/uk/all/common/2009-09-01" xmlns:ns11="http://www.xbrl.org/2006/ref" xmlns:ns12="http://www.xbrl.org/uk/cd/countries/2009-09-01" xmlns:ns13="http://www.xbrl.org/uk/all/ref/2009-09-01" xmlns:ns14="http://www.xbrl.org/uk/cd/currencies/2009-09-01" xmlns:ns15="http://www.xbrl.org/uk/cd/exchanges/2009-09-01" xmlns:ns16="http://www.xbrl.org/uk/cd/languages/2009-09-01" xmlns:ns17="http://www.xbrl.org/2004/ref" xmlns:ns18="http://www.xbrl.org/uk/all/gaap-ref/2009-09-01" xmlns:ns19="http://www.xbrl.org/uk/reports/aurep/2009-09-01" xmlns:iso4217="http://www.xbrl.org/2003/iso4217" xmlns:ns20="http://www.govtalk.gov.uk/uk/fr/tax/full-gaap-dpl/2013-10-01" xmlns:ns21="http://www.govtalk.gov.uk/uk/fr/tax/dpl-gaap-main/2013-10-01" xmlns:ns22="http://www.govtalk.gov.uk/uk/fr/tax/dpl-gaap/2013-10-01" xmlns:ns23="http://www.govtalk.gov.uk/uk/fr/tax/dpl-core/2013-10-01"> 
<head> 
    <meta name="PostingEntryNumber" content="4" /> 
    <meta name="PeriodRecordNumber" content="2341" /> 
    <meta content="application/xhtml+xml; charset=UTF-8" http-equiv="Content-Type" /> 
    <meta name="description" content="iXBRL report production" /> 
    <meta name="Mode" content="CH" /> 
    <meta http-equiv="X-UA-Compatible" content="IE=8" /> 

    <title>Shortt Orthopaedics Limited - Limited company - abbreviated - 11.6</title> 
    <style type="text/css"> 
     @media print 
     { 
      hr { display:none; } 
      .portraitpage 
      { 
       min-height:273mm; 
       max-width:170mm; 
      } 
      .landscapepage 
      { 
       min-height:170mm; 
       max-width:273mm; 
      } 
     } 
     @media screen 
     { 
      .portraitpage 
      { 
       max-width:170mm; 
       min-height:273mm; 
       margin:12mm 20mm 12mm 20mm; 
      } 
      .landscapepage 
      { 
       max-width:273mm; 
       min-height:170mm; 
       margin:12mm 20mm 12mm 20mm; 
      } 
     } 
     body{ margin:0px; font-size:1.3em; } 
     td{ padding:0px; } 
     div.portraitpage{ page-break-after:always; position:relative; } 
     div.landscapepage{ page-break-after:always; position:relative; } 
      div.header{ position:relative; } 
      div.footer{ left:0px; right:0px; bottom:0px; text-align:center; position:absolute; } 
    div.container{ position:relative; } 
        div.maintext{ width:100.00%; position:relative; } 
        div.tagged_blob{ width:100.00%; position:relative; } 
           table.iris_table{ width:100.00%; border-collapse:collapse; } 
       table.iris_table_header{ width:100.00%; border-collapse:collapse; } 
       table.iris_table_footer{ width:100.00%; border-collapse:collapse; } 
     div.hr.iris_hr{ width:100.00%; } 
      td.total_single{ border-top:thin solid black; } 
      td.total_double{ border-top:double black; } 
     .ta_10{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; } 
     .ta_11{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; } 
     .ta_12{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; } 
     .ta_13{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; } 
     .ta_20{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; } 
     .ta_21{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; } 
     .ta_22{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; } 
     .ta_23{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; } 
     .ta_30{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; } 
     .ta_31{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; } 
     .ta_32{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; } 
     .ta_33{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; } 
     .ta_40{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; } 
     .ta_41{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; } 
     .ta_42{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; } 
     .ta_43{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; } 
     .ta_50{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; } 
     .ta_51{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; } 
     .ta_52{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; } 
     .ta_53{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; } 
     .ta_60{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; } 
     .ta_61{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; } 
     .ta_62{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; } 
     .ta_63{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; } 
     .ta_70{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; } 
     .ta_71{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; } 
     .ta_72{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; } 
     .ta_73{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; } 
     .ta_80{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; } 
     .ta_81{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; } 
     .ta_82{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; } 
     .ta_83{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; } 
     .ta_90{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; } 
     .ta_91{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; } 
     .ta_92{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; } 
     .ta_93{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; } 
     .ta_100{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; } 
     .ta_101{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; } 
     .ta_102{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; } 
     .ta_103{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; } 
     .ta_110{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; } 
     .ta_111{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; } 
     .ta_112{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; } 
     .ta_113{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; } 
     .ta_120{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; } 
     .ta_121{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; } 
     .ta_122{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; } 
     .ta_123{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; } 
     .ta_130{ color:rgb(0, 0, 0); font-family:"Courier New"; font-size:13px; font-weight:400; } 
     .ta_131{ color:rgb(0, 0, 0); font-family:"Courier New"; font-size:13px; font-weight:700; } 
     .ta_132{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Courier New"; font-size:13px; font-weight:700; } 
     .ta_133{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Courier New"; font-size:13px; font-weight:400; } 
     .ta_140{ color:rgb(0, 0, 0); font-family:"Arial"; font-size:13px; font-weight:400; } 
     .ta_141{ color:rgb(0, 0, 0); font-family:"Arial"; font-size:13px; font-weight:400; } 
     .ta_142{ color:rgb(0, 0, 0); font-family:"Arial"; font-size:13px; font-weight:400; } 
     .ta_143{ color:rgb(0, 0, 0); font-family:"Arial"; font-size:13px; font-weight:400; } 
    </style> 
</head> 
<body xml:lang="en"> 
    <div style="display:none"> 
     <ix:header> 
      <ix:hidden> 
       <ix:nonNumeric contextRef="FY_31_03_2014" name="ns7:NameAuthor" order="1" tupleRef="XBRLDocumentAuthorGrouping_Group45" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL"></ix:nonNumeric> 
       <ix:nonNumeric contextRef="FY_31_03_2014" name="ns7:DescriptionOrTitleAuthor" order="2" tupleRef="XBRLDocumentAuthorGrouping_Group45" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL"></ix:nonNumeric> 
       <ix:nonNumeric contextRef="FY_31_03_2014" name="ns7:UKCompaniesHouseRegisteredNumber" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">07189486</ix:nonNumeric> 
       <ix:nonNumeric contextRef="CountriesHypercube_FY_31_03_2014_Set1" name="ns7:CountryFormationOrIncorporation" format="ixt2:nocontent" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL" /> 
       <ix:nonNumeric contextRef="CurrenciesHypercube_FY_31_03_2014_Set2" name="ns7:PrincipalCurrencyUsedInBusinessReport" format="ixt2:nocontent" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL" /> 
       <ix:nonNumeric contextRef="EntityOfficersHypercube_FY_31_03_2014_Set3" name="ns5:NameDirectorSigningAccounts" format="ixt2:nocontent" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL" /> 
       <ix:nonNumeric contextRef="cfwd_31_03_2014" name="ns7:StartDateForPeriodCoveredByReport" format="ixt2:datedaymonthyear" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">1.4.13</ix:nonNumeric> 
       <ix:nonNumeric contextRef="cfwd_31_03_2014" name="ns7:EndDateForPeriodCoveredByReport" format="ixt2:datedaymonthyear" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">31.3.14</ix:nonNumeric> 
       <ix:nonNumeric contextRef="cfwd_31_03_2014" name="ns7:BalanceSheetDate" format="ixt2:datedaymonthyear" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">31.3.14</ix:nonNumeric> 
       <ix:nonNumeric contextRef="FY_31_03_2014" name="ns7:EntityAccountsType" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">Company accounts</ix:nonNumeric> 
       <ix:nonNumeric contextRef="FY_31_03_2014" name="ns7:LegalFormOfEntity" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">Private Limited Company</ix:nonNumeric> 
       <ix:nonNumeric contextRef="FY_31_03_2014" name="ns7:DescriptionPeriodCoveredByReport" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">FY</ix:nonNumeric> 
       <ix:nonNumeric contextRef="FY_31_03_2014" name="ns7:EntityTrading" format="ixt2:booleantrue" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">true</ix:nonNumeric> 

[計算器限制正文]

+0

如果stackoverflow限制正文文本,請刪除與您的問題無關的位。這個限制是有原因的;你不需要發佈4KByte的XML來表達你的觀點。 (此外,您的要點是什麼*您沒有指定*哪個*標籤要以何種形式加載) – Tomalak

+0

我沒有指定要加載所有標籤的標籤。以什麼形式?字符串標籤的字符串等等。你知道如何解析HTML嗎? –

+0

不同地問,結果是什麼,整個行動的最終目標是什麼?一個HTML文件?並且請減少你的帖子大小,這也將幫助你建立一個有意義的例子。 – Tomalak

回答

0

我想你需要兩步法。

  • 使用HTML解析器去嵌入XML問題
  • ...然後使用DOM解析器上的內容

HTML並不總是符合XML規範(除非你使用XHTML已經變得不那麼流行)。瀏覽器讓許多事情像失蹤標籤,單引號和雙引號,沒有值的屬性等滑落,這可能是您的網站無法解析的原因。

許多都可用。

0

根據該文件,DTD validation always takes place,即使你告訴它不要!

你想要做的是創建一個新的DTD,它將你的名字空間添加到標準的XHTML DTD;在W3網站discusses how to acheive this,以及例如他們給是MATHML:

首先,定義實例化MATHML DTD並將其連接到內容模型內容模型模塊:

<!-- File: mathml-model.mod --> 
<!ENTITY % XHTML1-math 
    PUBLIC "-//W3C//DTD MathML 2.0//EN" 
      "http://www.w3.org/TR/MathML2/dtd/mathml2.dtd" > 
%XHTML1-math; 

<!ENTITY % Inlspecial.extra 
    "%a.qname; | %img.qname; | %object.qname; | %map.qname; 
     | %Mathml.Math.qname;" > 

接下來,定義一個DTD驅動程序,將我們的新內容模型模塊標識爲DTD的內容模型,並將處理轉交給XHTML 1.1驅動程序(例如):

<!-- File: xhtml-mathml.dtd --> 
<!ENTITY % xhtml-model.mod 
     SYSTEM "mathml-model.mod" > 
<!ENTITY % xhtml11.dtd 
    PUBLIC "-//W3C//DTD XHTML 1.1//EN" 
      "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd" > 
%xhtml11.dtd;