我正在研究一個項目,該項目允許我從Portkey.org下載故事來閱讀我的kindle,而且我不能爲我的生活弄清楚如何正確編碼/解析從網站抓取的HTML 。我正在使用simple_html_dom來抓取它,並通過故事所在的主要元素的innertext
進行解析。將重音字符和HTML實體轉換爲UTF-8?
所以,我想在這裏完成如下:
- 抓鬥HTML從Portkey.org故事
- 轉換爲常規字符頁面上的所有HTML實體,用於讀取(實體,如
”
到」
,“
至「
,…
至…
等) - 其他語言(如韓語,日語,中文等)的重音字符或字符應保持原樣。
- 使用
tidy
修復HTML並將其保存爲.html
文件。
一切我已經在任的嘗試,到目前爲止結果如下:
- 鑽石與它內部的問號在重音的字符應該是
- 破碎UTF-8字符裏應該有引號和省略號,但重音的字符顯示正確
從故事HTML樣本:
個<p> Wel [snip] your emotions…but most impor [snip] ng fiancé </p>
編輯
html_entity_decode
結果如下輸出:
Wel [snip] your emotions…but most impor [snip] ng fiancé
正如你所看到的,重音字符是正確的,但現在…
顯示不正確。
編輯2:
的get_html_translation_table(HTML_ENTITIES)
結果:
array(252) { ["""]=> string(6) """ ["&"]=> string(5) "&" ["<"]=> string(4) "<" [">"]=> string(4) ">" [" "]=> string(6) " " ["¡"]=> string(7) "¡" ["¢"]=> string(6) "¢" ["£"]=> string(7) "£" ["¤"]=> string(8) "¤" ["Â¥"]=> string(5) "¥" ["¦"]=> string(8) "¦" ["§"]=> string(6) "§" ["¨"]=> string(5) "¨" ["©"]=> string(6) "©" ["ª"]=> string(6) "ª" ["«"]=> string(7) "«" ["¬"]=> string(5) "¬" ["Â"]=> string(5) "" ["®"]=> string(5) "®" ["¯"]=> string(6) "¯" ["°"]=> string(5) "°" ["±"]=> string(8) "±" ["²"]=> string(6) "²" ["³"]=> string(6) "³" ["´"]=> string(7) "´" ["µ"]=> string(7) "µ" ["¶"]=> string(6) "¶" ["·"]=> string(8) "·" ["¸"]=> string(7) "¸" ["¹"]=> string(6) "¹" ["º"]=> string(6) "º" ["»"]=> string(7) "»" ["¼"]=> string(8) "¼" ["½"]=> string(8) "½" ["¾"]=> string(8) "¾" ["¿"]=> string(8) "¿" ["À"]=> string(8) "À" ["Ã"]=> string(8) "Á" ["Â"]=> string(7) "Â" ["Ã"]=> string(8) "Ã" ["Ä"]=> string(6) "Ä" ["Ã…"]=> string(7) "Å" ["Æ"]=> string(7) "Æ" ["Ç"]=> string(8) "Ç" ["È"]=> string(8) "È" ["É"]=> string(8) "É" ["Ê"]=> string(7) "Ê" ["Ë"]=> string(6) "Ë" ["ÃŒ"]=> string(8) "Ì" ["Ã"]=> string(8) "Í" ["ÃŽ"]=> string(7) "Î" ["Ã"]=> string(6) "Ï" ["Ã"]=> string(5) "Ð" ["Ñ"]=> string(8) "Ñ" ["Ã’"]=> string(8) "Ò" ["Ã「"]=> string(8) "Ó" ["Ã」"]=> string(7) "Ô" ["Õ"]=> string(8) "Õ" ["Ö"]=> string(6) "Ö" ["×"]=> string(7) "×" ["Ø"]=> string(8) "Ø" ["Ù"]=> string(8) "Ù" ["Ú"]=> string(8) "Ú" ["Û"]=> string(7) "Û" ["Ãœ"]=> string(6) "Ü" ["Ã"]=> string(8) "Ý" ["Þ"]=> string(7) "Þ" ["ß"]=> string(7) "ß" ["à "]=> string(8) "à" ["á"]=> string(8) "á" ["â"]=> string(7) "â" ["ã"]=> string(8) "ã" ["ä"]=> string(6) "ä" ["Ã¥"]=> string(7) "å" ["æ"]=> string(7) "æ" ["ç"]=> string(8) "ç" ["è"]=> string(8) "è" ["é"]=> string(8) "é" ["ê"]=> string(7) "ê" ["ë"]=> string(6) "ë" ["ì"]=> string(8) "ì" ["Ã"]=> string(8) "í" ["î"]=> string(7) "î" ["ï"]=> string(6) "ï" ["ð"]=> string(5) "ð" ["ñ"]=> string(8) "ñ" ["ò"]=> string(8) "ò" ["ó"]=> string(8) "ó" ["ô"]=> string(7) "ô" ["õ"]=> string(8) "õ" ["ö"]=> string(6) "ö" ["÷"]=> string(8) "÷" ["ø"]=> string(8) "ø" ["ù"]=> string(8) "ù" ["ú"]=> string(8) "ú" ["û"]=> string(7) "û" ["ü"]=> string(6) "ü" ["ý"]=> string(8) "ý" ["þ"]=> string(7) "þ" ["ÿ"]=> string(6) "ÿ" ["Å’"]=> string(7) "Œ" ["Å「"]=> string(7) "œ" ["Å "]=> string(8) "Š" ["Å¡"]=> string(8) "š" ["Ÿ"]=> string(6) "Ÿ" ["Æ’"]=> string(6) "ƒ" ["ˆ"]=> string(6) "ˆ" ["Ëœ"]=> string(7) "˜" ["Α"]=> string(7) "Α" ["Î’"]=> string(6) "Β" ["Î「"]=> string(7) "Γ" ["Î」"]=> string(7) "Δ" ["Ε"]=> string(9) "Ε" ["Ζ"]=> string(6) "Ζ" ["Η"]=> string(5) "Η" ["Θ"]=> string(7) "Θ" ["Ι"]=> string(6) "Ι" ["Κ"]=> string(7) "Κ" ["Λ"]=> string(8) "Λ" ["Îœ"]=> string(4) "Μ" ["Î"]=> string(4) "Ν" ["Ξ"]=> string(4) "Ξ" ["Ο"]=> string(9) "Ο" ["Î "]=> string(4) "Π" ["Ρ"]=> string(5) "Ρ" ["Σ"]=> string(7) "Σ" ["Τ"]=> string(5) "Τ" ["Î¥"]=> string(9) "Υ" ["Φ"]=> string(5) "Φ" ["Χ"]=> string(5) "Χ" ["Ψ"]=> string(5) "Ψ" ["Ω"]=> string(7) "Ω" ["α"]=> string(7) "α" ["β"]=> string(6) "β" ["γ"]=> string(7) "γ" ["δ"]=> string(7) "δ" ["ε"]=> string(9) "ε" ["ζ"]=> string(6) "ζ" ["η"]=> string(5) "η" ["θ"]=> string(7) "θ" ["ι"]=> string(6) "ι" ["κ"]=> string(7) "κ" ["λ"]=> string(8) "λ" ["μ"]=> string(4) "μ" ["ν"]=> string(4) "ν" ["ξ"]=> string(4) "ξ" ["ο"]=> string(9) "ο" ["Ï€"]=> string(4) "π" ["Ï"]=> string(5) "ρ" ["Ï‚"]=> string(8) "ς" ["σ"]=> string(7) "σ" ["Ï„"]=> string(5) "τ" ["Ï…"]=> string(9) "υ" ["φ"]=> string(5) "φ" ["χ"]=> string(5) "χ" ["ψ"]=> string(5) "ψ" ["ω"]=> string(7) "ω" ["Ï‘"]=> string(10) "ϑ" ["Ï’"]=> string(7) "ϒ" ["Ï–"]=> string(5) "ϖ" [" "]=> string(6) " " [" "]=> string(6) " " [" "]=> string(8) " " ["‌"]=> string(6) "" ["â€"]=> string(5) "" ["‎"]=> string(5) "" ["â€"]=> string(5) "" ["â€「"]=> string(7) "–" ["â€」"]=> string(7) "—" ["‘"]=> string(7) "‘" ["’"]=> string(7) "’" ["‚"]=> string(7) "‚" ["“"]=> string(7) "「" ["â€"]=> string(7) "」" ["„"]=> string(7) "„" ["†"]=> string(8) "†" ["‡"]=> string(8) "‡" ["•"]=> string(6) "•" ["…"]=> string(8) "…" ["‰"]=> string(8) "‰" ["′"]=> string(7) "′" ["″"]=> string(7) "″" ["‹"]=> string(8) "‹" ["›"]=> string(8) "›" ["‾"]=> string(7) "‾" ["â„"]=> string(7) "⁄" ["€"]=> string(6) "€" ["â„‘"]=> string(7) "ℑ" ["℘"]=> string(8) "℘" ["â„œ"]=> string(6) "ℜ" ["â„¢"]=> string(7) "™" ["ℵ"]=> string(9) "ℵ" ["â†"]=> string(6) "←" ["↑"]=> string(6) "↑" ["→"]=> string(6) "→" ["â†「"]=> string(6) "↓" ["â†」"]=> string(6) "↔" ["↵"]=> string(7) "↵" ["â‡"]=> string(6) "⇐" ["⇑"]=> string(6) "⇑" ["⇒"]=> string(6) "⇒" ["â‡「"]=> string(6) "⇓" ["â‡」"]=> string(6) "⇔" ["∀"]=> string(8) "∀" ["∂"]=> string(6) "∂" ["∃"]=> string(7) "∃" ["∅"]=> string(7) "∅" ["∇"]=> string(7) "∇" ["∈"]=> string(6) "∈" ["∉"]=> string(7) "∉" ["∋"]=> string(4) "∋" ["âˆ"]=> string(6) "∏" ["∑"]=> string(5) "∑" ["−"]=> string(7) "−" ["∗"]=> string(8) "∗" ["√"]=> string(7) "√" ["âˆ"]=> string(6) "∝" ["∞"]=> string(7) "∞" ["∠"]=> string(5) "∠" ["∧"]=> string(5) "∧" ["∨"]=> string(4) "∨" ["∩"]=> string(5) "∩" ["∪"]=> string(5) "∪" ["∫"]=> string(5) "∫" ["∴"]=> string(8) "∴" ["∼"]=> string(5) "∼" ["≅"]=> string(6) "≅" ["≈"]=> string(7) "≈" ["≠"]=> string(4) "≠" ["≡"]=> string(7) "≡" ["≤"]=> string(4) "≤" ["≥"]=> string(4) "≥" ["⊂"]=> string(5) "⊂" ["⊃"]=> string(5) "⊃" ["⊄"]=> string(6) "⊄" ["⊆"]=> string(6) "⊆" ["⊇"]=> string(6) "⊇" ["⊕"]=> string(7) "⊕" ["⊗"]=> string(8) "⊗" ["⊥"]=> string(6) "⊥" ["â‹…"]=> string(6) "⋅" ["⌈"]=> string(7) "⌈" ["⌉"]=> string(7) "⌉" ["⌊"]=> string(8) "⌊" ["⌋"]=> string(8) "⌋" ["〈"]=> string(6) "⟨" ["〉"]=> string(6) "⟩" ["â—Š"]=> string(5) "◊" ["â™ "]=> string(8) "♠" ["♣"]=> string(7) "♣" ["♥"]=> string(8) "♥" ["♦"]=> string(7) "♦" }
編輯3:
就完全公開,這裏是我已經建立了爲目的的測試文件搞清楚這一點。目前,所有實體顯示正確,但重音字符顯示爲�
。
<?php
header('Content-Type: text/html; charset=UTF-8');
require_once('_RESOURCES/simple_html_dom.php');
$url = 'http://fanfiction.portkey.org/index.php?act=read&storyid=1585&chapterid=&agree=1';
function tidyHTML($html) {
ob_start();
$tidy = new tidy;
$config = array('indent' => true, 'output-xhtml' => false, 'wrap' => 200, 'clean' => false, 'show-body-only' => true);
$tidy->parseString($html, $config, 'utf8');
$tidy->cleanRepair();
$input = $tidy;
return $input;
}
function filter($html) {
$html = preg_replace('~>\s+<~', '><', $html);
$html = preg_replace('/<\/b>\s?<b>/', '', $html);
$html = preg_replace('/<\/i>\s?<i>/', '', $html);
$html = str_replace('<br>', '', $html);
$output = $html;
return $output;
}
$page_html = file_get_html($url);
$chapter_html = $page_html->find('td[class="story"]', 0);
foreach ($chapter_html->find('center') as $node) { $node->outertext = ''; }
$entities = html_entity_decode($chapter_html->innertext, ENT_QUOTES, 'UTF-8');
echo tidyHTML(filter($entities));
// var_dump(get_html_translation_table(HTML_ENTITIES));
?>
爲什麼要將實體轉換爲字符?它在這裏沒有任何用處,因爲實體在HTML中實際工作*更安全*(特別是如果你不知道如何聲明字符編碼)。此外,您的示例不包含實體。它有這樣的「é」,沒有省略號「...」。 –
我正在轉換實體,因爲我將以各種格式(包括明文)輸出故事文本。此外,該考試還包含'&hellip;以及'é'。我將編輯示例以專注於問題字符。 – zuddsy
如果它是HTML格式,從它生成純文本是一個更廣泛的問題,而不僅僅是將實體引用轉換爲字符。當純文本實際生成時可以處理它。 –