解析XML與撇號

以例如BBC新聞RSS源，他們的新聞項目之一如下：解析XML與撇號

<item><title>Pupils 'bullied on sports field'</title><description>bla bla..

我有一些Java代碼解析這一點 - 但是，當標題中包含撇號（如上面），解析停止，所以我最終得到以下標題：學生的然後它繼續並解析描述（這很好）。我如何才能解析完整的標題？下面是一個代碼段從內我的for循環，我解析信息：

    NodeList title = element.getElementsByTagName("title"); 
        Element line = (Element) title.item(0); 
        tmp.setTitle(getCharacterDataFromElement(line).toString());

完全相同的代碼來解析其他元素，如描述和pubdate的等等，這些都是罰款。

這是getCharacterDataFromElement方法：

public static String getCharacterDataFromElement(Element e) { 
    Node child = ((Node) e).getFirstChild(); 
    if (child instanceof CharacterData) { 
     CharacterData cd = (CharacterData) child; 
     return cd.getData(); 
    } 
    return ""; 
}

我在做什麼錯？我使用DocumentBuilder，DocumentBuilderFactory和org.w3c.dom來處理RSS Feed。

來源

2012-04-16 Nicklas

正如davidfrancis建議，您應該遍歷所有getCharacterDataFromElement()中的孩子。

或者，如果您可以使用DOM級別3，則可以使用Node.getTextContent()方法，而不是您想要的。

NodeList title = element.getElementsByTagName("title"); 
Element line = (Element)title.item(0); 
tmp.setTitle(line.getTextContent());

來源

2012-04-16 23:44:48 prunge

這工作得很好，謝謝。 – Nicklas 2012-04-17 16:09:51

你getCharacterDataFromElement僅着眼於第一個孩子 - 看是否有進一步的子元素過多和粘性的所有文字一起

HTH - DF

來源

2012-04-16 22:26:50 davidfrancis

-1

嘛，據我所知，撇號是XML保留字符和因此應編碼爲'。

這意味着BBC新聞RSS源不提供格式良好的XML。

最好的辦法是向BBC新聞RSS提要提供商發佈錯誤報告，以便他們修復它。

來源

2012-04-16 22:34:39 Puce

爲什麼downvote？ – Puce 2012-08-23 08:05:17

解析XML與撇號

回答

相關問題