用html標籤解析文本

我必須從服務器解析xml文件; 我試着用DOM解析器和SAX解析器，但我不能夠解析HTML標籤，當它發現它停止第一「<」用html標籤解析文本

這是我的分析器類：

public class XMLParser { 

    // constructor 
    public XMLParser() { 

    } 


    public String getXmlFromUrl(String url) { 
     String xml = null; 
     BufferedReader in = null; 

     try { 
      // defaultHttpClient 
      DefaultHttpClient httpClient = new DefaultHttpClient(); 
      HttpPost httpPost = new HttpPost(url); 

      HttpResponse httpResponse = httpClient.execute(httpPost); 
      in = new BufferedReader(new InputStreamReader(
        httpResponse.getEntity().getContent(), "UTF-8")); 


      StringBuffer sb=new StringBuffer(""); 
      String line = ""; 
      String NL = System.getProperty("line.separator"); 

      while ((line = in.readLine()) != null) 
       { 
        sb.append(line); 
        sb.append(NL); 
        line=in.readLine(); 
       } 
      in.close(); 

      xml = sb.toString();; 

     } catch (UnsupportedEncodingException e) { 
      e.printStackTrace(); 
     } catch (ClientProtocolException e) { 
      e.printStackTrace(); 
     } catch (IOException e) { 
      e.printStackTrace(); 
     } 
     // return XML 
     return xml; 
    } 

    public Document getDomElement(String xml){ 
     Document doc = null; 
     DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance(); 
     try { 

      DocumentBuilder db = dbf.newDocumentBuilder(); 

      InputSource is = new InputSource(); 
       is.setCharacterStream(new StringReader(xml)); 
       doc = db.parse(is); 

      } catch (ParserConfigurationException e) { 
       Log.e("Error: ", e.getMessage()); 
       return null; 
      } catch (SAXException e) { 
       Log.e("Error: ", e.getMessage()); 
       return null; 
      } catch (IOException e) { 
       Log.e("Error: ", e.getMessage()); 
       return null; 
      } 

      return doc; 
    } 


    public final String getElementValue(Node elem) { 
     Node child; 
     if(elem != null){ 
      if (elem.hasChildNodes()){ 
       for(child = elem.getFirstChild(); child != null; child = child.getNextSibling()){ 
        if(child.getNodeType() == Node.TEXT_NODE ){ 
         return child.getNodeValue(); 
        } 
       } 
      } 
     } 
     return ""; 
    } 

    /** 
     * Getting node value 
     * @param Element node 
     * @param key string 
     * */ 
    public String getValue(Element item, String str) {  
      NodeList n = item.getElementsByTagName(str);   
     return this.getElementValue(n.item(0)); 
    } 

    }

來源

2012-05-19 mir

如果你的HTML不是格式良好的（例如，包含不關閉的標籤），這些解析器都不會起作用。您可能最終不得不手動解析（例如，使用正則表達式和類）。如果HTML格式正確，那麼您應該發佈您收到的錯誤，並且可能會鏈接到該頁面。

來源

2012-05-19 16:14:05 Melllvar

[鏈接]（http://mirsitelfi.comoj.com/test.php）你認爲這個html格式不正確嗎？ – mir

我不是專家，但它看起來像一個帶有HTML標頭的XML文檔。你的老闆應該開始修復標題（見http://www.w3schools.com/xml/） – Melllvar

你認爲如果我試圖修復它像這樣工作嗎？ – mir

您應該使用HTML解析器，因爲Web上可用的大多數html內容都不符合XML規範。在簡單的情況下，正則表達式就足夠了，但在複雜的情況下，您可能需要一個HTML解析器。

來源

2012-05-19 16:25:29

我沒有選擇我必須用xmlparser !! :( – mir

）然後你沒有選擇，因爲正如我所解釋的，XML解析器只是無法使用。順便說一句，爲什麼你必須使用XML解析器？ –

這是一個項目和「老闆」要求我用xml解析器做到這一點Oo – mir

用html標籤解析文本

回答

相關問題