將HTML轉換爲XML

我得到了需要用XML傳遞的HTML文件。我們正在使用這些HTML爲應用程序提供內容，但現在我們必須將這些內容作爲XML提供。將HTML轉換爲XML

HTML文件包含，表格，div的，形象的，對的，B或強標籤等。

我用Google搜索，發現了一些應用程序，但我不能achive呢。

你能否提出一種將這些文件內容轉換爲XML的方法？

2012-05-06 bahadir arslan

看看[這篇文章]（http://stackoverflow.com/a/85922/938089）。然後，仔細看看[第四條評論]（http://stackoverflow.com/questions/84556/#comment1436887_85922）。爲什麼要將HTML轉換爲XML？ –

@RobW我會檢查它。我們爲某些應用程序提供HTML作爲內容，但現在我們必須將其作爲XML。 –

@RobW，我也知道XML和HTML之間的區別。但我需要解析它的內容並將其放入XML中。 –

我成功了使用tidy命令行實用程序。在linux上，我用apt-get install tidy快速安裝了它。然後命令：

tidy -q -asxml --numeric-entities yes source.html >file.xml

給了一個XML文件，我可以用XSLT處理器來處理。不過，我需要正確設置xhtml1 dtds。

這是他們的主頁：html-tidy.org（和傳統之一：HTML Tidy）

來源

2013-03-10 09:40:22 Jarekczek

還有xmllint -html -xmlout –

我也有時使用它。我想你應該從中作出單獨的回答。 – Jarekczek

它是否會將html從html文件中移除 – Alaa

請記住，HTML和XML是標記語言樹中兩個不同的概念。你不能完全replace HTML with XML。 XML可以看作是HTML的一種廣義形式，但即使這樣也是不準確的。您主要使用HTML來顯示數據，並使用XML來攜帶（或存儲）數據。

此鏈接是有幫助的：How to read HTML as XML?

More here - difference between HTML and XML

來源

2012-05-06 20:11:28 Coffee

HTML __is__ XML。 – bfontaine

+10

@boudou。不，XHTML是XML，HTML不是。 – Bruno

那麼你的建議是什麼？如果我首先將HTML轉換爲XHTML，那麼我可以輕鬆地轉換爲XML嗎？ –

我沒有找到一個方法來轉換（甚至是壞的）HTML到良好的XML。我開始基於DOM loadHTML函數。然而，在一段時間內發生了幾個問題，我優化並添加了補丁以糾正副作用。

function tryToXml($dom,$content) { 
    if(!$content) return false; 

    // xml well formed content can be loaded as xml node tree 
    $fragment = $dom->createDocumentFragment(); 
    // wonderfull appendXML to add an XML string directly into the node tree! 

    // aappendxml will fail on a xml declaration so manually skip this when occurred 
    if(substr($content,0, 5) == '<?xml') { 
     $content = substr($content,strpos($content,'>')+1); 
     if(strpos($content,'<')) { 
     $content = substr($content,strpos($content,'<')); 
     } 
    } 

    // if appendXML is not working then use below htmlToXml() for nasty html correction 
    if([email protected]$fragment->appendXML($content)) { 
     return $this->htmlToXml($dom,$content); 
    } 

    return $fragment; 
    } 



    // convert content into xml 
    // dom is only needed to prepare the xml which will be returned 
    function htmlToXml($dom, $content, $needEncoding=false, $bodyOnly=true) { 

    // no xml when html is empty 
    if(!$content) return false; 

    // real content and possibly it needs encoding 
    if($needEncoding) { 
     // no need to convert character encoding as loadHTML will respect the content-type (only) 
     $content = '<meta http-equiv="Content-Type" content="text/html;charset='.$this->encoding.'">' . $content; 
    } 

    // return a dom from the content 
    $domInject = new DOMDocument("1.0", "UTF-8"); 
    $domInject->preserveWhiteSpace = false; 
    $domInject->formatOutput = true; 

    // html type 
    try { 
     @$domInject->loadHTML($content); 
    } catch(Exception $e){ 
     // do nothing and continue as it's normal that warnings will occur on nasty HTML content 
    } 
     // to check encoding: echo $dom->encoding 
     $this->reworkDom($domInject); 

    if($bodyOnly) { 
     $fragment = $dom->createDocumentFragment(); 

     // retrieve nodes within /html/body 
     foreach($domInject->documentElement->childNodes as $elementLevel1) { 
     if($elementLevel1->nodeName == 'body' and $elementLevel1->nodeType == XML_ELEMENT_NODE) { 
     foreach($elementLevel1->childNodes as $elementInject) { 
      $fragment->insertBefore($dom->importNode($elementInject, true)); 
     } 
     } 
     } 
    } else { 
     $fragment = $dom->importNode($domInject->documentElement, true); 
    } 

    return $fragment; 
    } 



    protected function reworkDom($node, $level = 0) { 

     // start with the first child node to iterate 
     $nodeChild = $node->firstChild; 

     while ($nodeChild) { 
      $nodeNextChild = $nodeChild->nextSibling; 

      switch ($nodeChild->nodeType) { 
       case XML_ELEMENT_NODE: 
        // iterate through children element nodes 
        $this->reworkDom($nodeChild, $level + 1); 
        break; 
       case XML_TEXT_NODE: 
       case XML_CDATA_SECTION_NODE: 
        // do nothing with text, cdata 
        break; 
       case XML_COMMENT_NODE: 
        // ensure comments to remove - sign also follows the w3c guideline 
        $nodeChild->nodeValue = str_replace("-","_",$nodeChild->nodeValue); 
        break; 
       case XML_DOCUMENT_TYPE_NODE: // 10: needs to be removed 
       case XML_PI_NODE: // 7: remove PI 
        $node->removeChild($nodeChild); 
        $nodeChild = null; // make null to test later 
        break; 
       case XML_DOCUMENT_NODE: 
        // should not appear as it's always the root, just to be complete 
        // however generate exception! 
       case XML_HTML_DOCUMENT_NODE: 
        // should not appear as it's always the root, just to be complete 
        // however generate exception! 
       default: 
        throw new exception("Engine: reworkDom type not declared [".$nodeChild->nodeType. "]"); 
      } 
      $nodeChild = $nodeNextChild; 
     } ; 
    }

現在，這也允許添加更多的HTML片段到一個我需要使用自己的XML。通常，可以使用這樣的：

 $c='<p>test<font>two</p>'; 
    $dom=new DOMDocument('1.0', 'UTF-8'); 

$n=$dom->appendChild($dom->createElement('info')); // make a root element 

if($valueXml=tryToXml($dom,$c)) { 
    $n->appendChild($valueXml); 
} 
    echo '<pre/>'. htmlentities($dom->saveXml($n)). '</pre>';

在這個例子中'testtwo'將很好地被良好地形成XML作爲「<info>testtwo</info>」 outputed英寸信息根標籤被添加，因爲它也允許轉換'onetwo'，它不是XML，因爲它沒有一個根元素。但是，如果你的html確實有一個根元素，那麼額外的根<info>標籤可以被跳過。

有了這個，我得到了非結構化，甚至損壞的HTML真正好的XML！

我希望它有點清楚，並可能有助於其他人使用它。

來源

2013-08-17 19:22:24

這是PHP代碼？ –

將HTML轉換爲XML

回答

相關問題