2010-01-03 13 views
0

我正在尋找一種方式來將文本轉換是這樣的:在Ruby中從HTML字符串生成HTML頁面的原始字符串元描述?


" <!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">\n <html xml:lang=\"en\" lang=\"en\" xmlns=\"http://www.w3.org/1999/xhtml\">\n \t<head>\n \t\t<title>My Page Title</title>\n \t\t<META HTTP-EQUIV=\"Content-Type\" CONTENT=\"text/html; charset=ISO- 8859-1\">\n <style type=\"text/css\" media=\"screen\"> \n \t\tblockquote\n
{\n \tfont-style: italic;\n }\n cite\n {\n
\ttext-align: right;\n \tfont-style: normal;\n }\n .author\n {\n \ttext-align: right;\n \tmargin-right: 80px;\n
}\n </style>\n \t</head>\n \t<body>\n \t\t<h1>My Page Title</h1>\n<h3>Production Manager</h3>\n<blockquote>\n<p>&#8220;I want my passion for business plan and my pride in my work to show in every step of our company: from the labels and papers, to our relationships with our customers, to the enjoyment of each bottle of My Company business plan. As we expand our production, my dream is to plant a company of my own to specialize in good business, my personal favorite varietal.&#8221;</p>\n</blockquote>\n<p class=\"author\"><cite>- John Smith</cite></p>\n<p>Born and raised on the north coast of California, John Smith always felt a deep connection to this......"

進入這個:


My Page Title. Production Manager. I want my passion for business plan and my pride in my work to show in every step of our company: from the labels and papers, to our relationships with our customers, to the enjoyment of each bottle of My Company business plan. As we expand our production, my dream is to plant a company of my own to specialize in good business, my personal favorite varietal. 

這只是提取第一時間段之前的所有文本。但它必須:

  • 地帶HTML標籤
  • 替換\ n,其中(和多個\ n \ n \ n,其中 「 」)
  • 替換\噸,「 」
  • 替換「。」 \ S +用 「」
  • UNESCAPE東西替換「與」

開始做點事後就像這樣,我想這可能已經在更徹底的地方解決了。有沒有人有一個很好的單線方式來創建一個純文本摘錄這樣的HTML字符串(在Ruby中)?

我使用Nokogiri進行全功能的HTML解析,但是看起來好像使用它一樣困難。

回答

0

嗯。這對於單線程來說似乎是相當多的功能。如果你只是想解析和顯示一個純文本的HTML頁面,我建議使用w3m

string = "..." # your string 

IO.popen("w3m -T text/html", "r+") do |pipe| 
    pipe.write string 
    pipe.close_write 
    puts pipe.read 
end 

給我:

 
My Page Title 

Production Manager 

    「I want my passion for business plan and my pride in my work to show in 
    every step of our company: from the labels and papers, to our relationships 
    with our customers, to the enjoyment of each bottle of My Company business 
    plan. As we expand our production, my dream is to plant a company of my own 
    to specialize in good business, my personal favorite varietal.」 

- John Smith 

Born and raised on the north coast of California, John Smith always felt a deep 
connection to this...... 

對於置換的其餘部分,我建議應用正則表達式之前或加工後更換,根據您的具體需求。

0

它是否必須在紅寶石? 我,我可以把它寫在PHP中:

$text = '<html> ...'; 
$result = preg_replace(array('/\\n+/', '/\\[ts]/', '/"/'), array('. ', ' ', '\''), html_entity_decode(strip_tags($text)));