使用Nokogiri解析HTML的問題

我有一些HTML並希望獲取<body>元素下的內容。然而，無論我試過了，HTML是利用引入nokogiri，裏面的一切<doctype>和<head>也成爲<body>元素的一部分，當我取回<body>元素，我看到裏面的東西<doctype>和太<meta>和<script>標籤解析之後。使用Nokogiri解析HTML的問題

我原來的HTML是：

<!DOCTYPE html \"about:legacy-compat\"> 
<html> 
    <head> 
     <meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\"> 
     <title>Some Title</title> 
     <meta name='viewport' id='helloviewport' content='initial-scale=1.0,maximum-scale=2.5' /> 
     <link rel='stylesheet' id='hello-stylesheet' type='text/css' href='some-4ac294cd125e1a062562aca1c83714ff.css'/> 
     <script id='hello-javascript' type='text/javascript' src='/hello/hello.js'></script> 
    </head> 
    <body marginwidth=\"6\" marginheight=\"6\" leftmargin=\"6\" topmargin=\"6\"> 
     <div class=\"hello-status\">Hello World</div> 
     <div valign=\"top\"></div> 
    </body> 
</html>

我使用的解決方案是：

parsed_html = Nokogiri::HTML(my_html) 
body_tag_content = parsed_html.at('body') 
puts body_tag_content.inner_html

我是什麼讓：

<p>about:legacy-compat\"&gt;</p> 
\n 
<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\"> 
\n 
<title>Some title</title> 
\n 
<meta name='viewport' id='helloviewport' content='initial-scale=1.0,maximum-scale=2.5' /> 
\n 
<link rel='stylesheet' id='hello-stylesheet' type='text/css' href='some-4ac294cd125e1a062562aca1c83714ff.css'/> 
\n<script id='hello-javascript' type='text/javascript' src='/hello/hello.js'></script> 
<div class=\"hello-status\">Hello World</div> 
\n 
<div valign=\"top\">\n\n</div>

我是什麼期待：

<div class=\"hello-status\">Hello World</div> 
\n 
<div valign=\"top\">\n\n</div>

不知道這裏發生了什麼？

來源

2015-12-15 tech_human

我通過首先清理原始HTML來了解您的示例。我去掉了「關於：傳統-COMPAT」從中似乎在DOCTYPE被搞亂了引入nokogiri：

# clean up the junk in the doctype 
my_html.sub!("\"about:legacy-compat\"", "") 

# parse and get the body 
parsed_html = Nokogiri::HTML(my_html) 
body_tag_content = parsed_html.at('body') 

puts body_tag_content.inner_html 
# => "\n  <div class=\"hello-status\">Hello World</div>\n  <div valign=\"top\"></div>\n "

一般情況下，當你解析可能污損的第三方數據，如HTML，你應該好好打掃一下因此解析器不會窒息並做出意想不到的事情。您可以通過linter或「整潔」工具運行HTML以嘗試自動清理它。當其他一切都失敗時，你必須像上面那樣手工清理它。

HTML tidy/cleaning in Ruby 1.9

來源

2015-12-15 17:09:29 DiegoSalazar

是的，我想通了「關於：傳統-COMPAT」是問題，但不知道爲什麼HTML解析不處理它。解析工作前刪除它。 –

使用Nokogiri解析HTML的問題

回答

相關問題