2011-04-27 31 views
0

我正在使用Java XPath API從xhtml文件提取內容。我正在通過html並試圖提取特定內容。包含文本和少數內。當我使用XPath時,奇怪的是,它忽略了所有的html標籤並僅提取文本內容。這是一個html代碼片段。Java XPath API提取選擇性文本

<html> 
<body> 
<div class="content"> 
    <div class="content_wrapper"> 
     <table border="0" cellspacing="0" cellpadding="0" class="test_class"> 
      <tr> 
       <td> 
        <p> 
         Reading and looking at images or movies is one thing. Experiencing it in 3D the other. If you like to figure out more about what Showcase is, I would really encourage you to 
         download Showcase Viewer and have a look at the demo files also available on this site. Interact with the models and see how real it looks. 
        </p> 
        <p style="text-align: center;"> 
         <img src="/testsource/fckdata/208123/image/showcarswatch.jpg" alt="" /> 
         <img src="/testsource/fckdata/208123/image/engineswatch.jpg" alt="" /> 
         <img src="/th.gen/?:760x0:/userdata/fckdata/208123/image/toasterswatch.jpg" alt="" /> 
         <img src="/testsource/fckdata/208123/image/smartphoneswatch.jpg" alt="" /> 
        </p> 
        <p> 
         <br /> 
         Showcase Viewer is actually a full Showcase install, except data processing and creation tools. This means that you can look at any data created with a regular Showcase you 
         just can´t add any information. But you may join a collaboration session hosed by a Showcase Professional user. Here is where you can get it:<br /> 
        </p> 
        <p> 
         <strong>Operating System</strong><br /> 
         • Microsoft® Windows® XP Professional (SP 2 or higher)<br /> 
         • Windows XP Professional x64 Edition (Autodesk® Showcase® software runs as a 32-bit application on 64-bit operating system)<br /> 
         • Microsoft Windows Vista® 32-bit or 64-bit, including Business, Enterprise or Ultimate (SP 1) 
        </p> 
       </td> 
      </tr> 
     </table> 
    </div> 
</div> 
</body> 
</html> 

現在,這裏是我使用的代碼。我需要在使用xpath之前做一些清理。

這裏是輸出。


Reading and looking at images or movies is one thing. Experiencing it in 3D the other. If you like to figure out more about what Showcase is, I would really encourage you to 
download Showcase Viewer and have a look at the demo files also available on this site. Interact with the models and see how real it looks. 

Showcase Viewer is actually a full Showcase install, except data processing and creation tools. This means that you can look at any data created with a regular Showcase you 
just can´t add any information. But you may join a collaboration session hosed by a Showcase Professional user. Here is where you can get it 

Operating System 
• Microsoft® Windows® XP Professional (SP 2 or higher)<br /> 
• Windows XP Professional x64 Edition (Autodesk® Showcase® software runs as a 32-bit application on 64-bit operating system)<br /> 
• Microsoft Windows Vista® 32-bit or 64-bit, including Business, Enterprise or Ultimate (SP 1) 

我需要的只是content_wrapper div中的完整內容。

任何指針將不勝感激。

  • 由於

EDIT響應於揚堡溶液

示例代碼。

XPathFactory factory = XPathFactory.newInstance(); 
XPath xpathCompiled = factory.newXPath(); 
XPathExpression expr = xpathCompiled.compile(contentPath); 
NodeList nodes = (NodeList) expr.evaluate(doc, XPathConstants.NODESET); 


for (int i = 0; i < nodes.getLength(); i++) { 
    Node n = (Node)nodes.item(i); 
    traverseNodes(n); 
} 

public static void traverseNodes(Node n) { 
    NodeList children = n.getChildNodes(); 
    if(children != null) { 
     for(int i = 0; i &gt; children.getLength(); i++) { 
      Node childNode = children.item(i); 
      System.out.println("node name = " + childNode.getNodeName()); 
      System.out.println("node value = " + childNode.getNodeValue()); 
      System.out.println("node type = " + childNode.getNodeType()); 
      traverseNodes(childNode); 
     } 
    } 
} 
+0

這不是關於XPath表達式,而是關於XPath結果的DOM方法。重新標記。 – 2011-04-28 00:00:04

回答

1

XPath匹配節點集。您的案例中的文本節點,包含子元素節點。 toString()獲取那個節點的文本表示,這就是 - 文本,沒有元素名稱或屬性。

你應該得到的節點:

NodeSequence nodes = (NodeSequence)XPathAPI.eval(); 

,然後通過節點走,傾倒你從他們什麼都想要,或者將其轉換成一個新的DOM文檔,例如。

P.S. Xalan很好,但現代Java擁有JAXP。對於代碼和知識便攜的緣故,我會建議使用(除非是必需的Xalan的擴展/有用):

XPathFactory factory = XPathFactory.newInstance(); 
XPath xpathCompiled = factory.newXPath(); 
XPathExpression expr = xpathCompiled.compile(xpath); 

NodeList nodes = (NodeList) expr.evaluate(doc, XPathConstants.NODESET); 

然後,將其轉換成字符串(顯然這是你想要的):

StringWriter sw = new StringWriter(); 
Transformer serializer = TransformerFactory.newInstance().newTransformer(); 
serializer.transform(new DOMSource(nodes.item(0)), new StreamResult(sw)); 
String result = sw.toString(); 

請注意,它只接受來自NodeList的第一個元素,因爲XML必須具有根元素。在你的情況下,它是好的,如果我理解正確的話,否則你需要在節點集上添加一個頂級元素。

+0

@ yamburg ..感謝您的建議。瀏覽節點列表會給我節點名稱和相應的值。節點名稱通常是td而不是​​。以確切格式重建內容會變得有點乏味。也許,我在這裏錯過了一些東西。我在問題部分添加了示例代碼。 – Shamik 2011-04-27 20:48:35

+0

已更新。請以更精確的方式制定你的願望。 ;) – 2011-04-28 02:32:03

+0

@ yamburg ......謝謝一個人,得到了問題。感謝你的幫助。 – Shamik 2011-04-28 18:04:35