我正在尋找一種方式來將文本轉換是這樣的:在Ruby中從HTML字符串生成HTML頁面的原始字符串元描述?
" <!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">\n <html xml:lang=\"en\" lang=\"en\" xmlns=\"http://www.w3.org/1999/xhtml\">\n \t<head>\n \t\t<title>My Page Title</title>\n \t\t<META HTTP-EQUIV=\"Content-Type\" CONTENT=\"text/html; charset=ISO- 8859-1\">\n <style type=\"text/css\" media=\"screen\"> \n \t\tblockquote\n
{\n \tfont-style: italic;\n }\n cite\n {\n
\ttext-align: right;\n \tfont-style: normal;\n }\n .author\n {\n \ttext-align: right;\n \tmargin-right: 80px;\n
}\n </style>\n \t</head>\n \t<body>\n \t\t<h1>My Page Title</h1>\n<h3>Production Manager</h3>\n<blockquote>\n<p>“I want my passion for business plan and my pride in my work to show in every step of our company: from the labels and papers, to our relationships with our customers, to the enjoyment of each bottle of My Company business plan. As we expand our production, my dream is to plant a company of my own to specialize in good business, my personal favorite varietal.”</p>\n</blockquote>\n<p class=\"author\"><cite>- John Smith</cite></p>\n<p>Born and raised on the north coast of California, John Smith always felt a deep connection to this......"
進入這個:
My Page Title. Production Manager. I want my passion for business plan and my pride in my work to show in every step of our company: from the labels and papers, to our relationships with our customers, to the enjoyment of each bottle of My Company business plan. As we expand our production, my dream is to plant a company of my own to specialize in good business, my personal favorite varietal.
這只是提取第一時間段之前的所有文本。但它必須:
- 地帶HTML標籤
- 替換\ n,其中(和多個\ n \ n \ n,其中 「 」)
- 替換\噸,「 」
- 替換「。」 \ S +用 「」
- 像
「
- UNESCAPE東西替換「與」
開始做點事後就像這樣,我想這可能已經在更徹底的地方解決了。有沒有人有一個很好的單線方式來創建一個純文本摘錄這樣的HTML字符串(在Ruby中)?
我使用Nokogiri進行全功能的HTML解析,但是看起來好像使用它一樣困難。