2014-12-29 62 views
0

我正在研究一個項目,該項目允許我從Portkey.org下載故事來閱讀我的kindle,而且我不能爲我的生活弄清楚如何正確編碼/解析從網站抓取的HTML 。我正在使用simple_html_dom來抓取它,並通過故事所在的主要元素的innertext進行解析。將重音字符和HTML實體轉換爲UTF-8?

所以,我想在這裏完成如下:

  1. 抓鬥HTML從Portkey.org故事
  2. 轉換爲常規字符頁面上的所有HTML實體,用於讀取(實體,如”“,…等)
  3. 其他語言(如韓語,日語,中文等)的重音字符或字符應保持原樣。
  4. 使用tidy修復HTML並將其保存爲.html文件。

一切我已經在任的嘗試,到目前爲止結果如下:

  • 鑽石與它內部的問號在重音的字符應該是
  • 破碎UTF-8字符裏應該有引號和省略號,但重音的字符顯示正確

從故事HTML樣本:

<p> Wel [snip] your emotions&hellip;but most impor [snip] ng fiancé </p> 

編輯

html_entity_decode結果如下輸出:

Wel [snip] your emotions…but most impor [snip] ng fiancé 

正如你所看到的,重音字符是正確的,但現在&hellip;顯示不正確。

編輯2:

get_html_translation_table(HTML_ENTITIES)結果:

array(252) { ["""]=> string(6) """ ["&"]=> string(5) "&" ["<"]=> string(4) "<" [">"]=> string(4) ">" [" "]=> string(6) " " ["¡"]=> string(7) "¡" ["¢"]=> string(6) "¢" ["£"]=> string(7) "£" ["¤"]=> string(8) "¤" ["Â¥"]=> string(5) "¥" ["¦"]=> string(8) "¦" ["§"]=> string(6) "§" ["¨"]=> string(5) "¨" ["©"]=> string(6) "©" ["ª"]=> string(6) "ª" ["«"]=> string(7) "«" ["¬"]=> string(5) "¬" ["­"]=> string(5) "­" ["®"]=> string(5) "®" ["¯"]=> string(6) "¯" ["°"]=> string(5) "°" ["±"]=> string(8) "±" ["²"]=> string(6) "²" ["³"]=> string(6) "³" ["´"]=> string(7) "´" ["µ"]=> string(7) "µ" ["¶"]=> string(6) "¶" ["·"]=> string(8) "·" ["¸"]=> string(7) "¸" ["¹"]=> string(6) "¹" ["º"]=> string(6) "º" ["»"]=> string(7) "»" ["¼"]=> string(8) "¼" ["½"]=> string(8) "½" ["¾"]=> string(8) "¾" ["¿"]=> string(8) "¿" ["À"]=> string(8) "À" ["Ã"]=> string(8) "Á" ["Â"]=> string(7) "Â" ["Ã"]=> string(8) "Ã" ["Ä"]=> string(6) "Ä" ["Ã…"]=> string(7) "Å" ["Æ"]=> string(7) "Æ" ["Ç"]=> string(8) "Ç" ["È"]=> string(8) "È" ["É"]=> string(8) "É" ["Ê"]=> string(7) "Ê" ["Ë"]=> string(6) "Ë" ["ÃŒ"]=> string(8) "Ì" ["Ã"]=> string(8) "Í" ["ÃŽ"]=> string(7) "Î" ["Ã"]=> string(6) "Ï" ["Ã"]=> string(5) "Ð" ["Ñ"]=> string(8) "Ñ" ["Ã’"]=> string(8) "Ò" ["Ã「"]=> string(8) "Ó" ["Ã」"]=> string(7) "Ô" ["Õ"]=> string(8) "Õ" ["Ö"]=> string(6) "Ö" ["×"]=> string(7) "×" ["Ø"]=> string(8) "Ø" ["Ù"]=> string(8) "Ù" ["Ú"]=> string(8) "Ú" ["Û"]=> string(7) "Û" ["Ãœ"]=> string(6) "Ü" ["Ã"]=> string(8) "Ý" ["Þ"]=> string(7) "Þ" ["ß"]=> string(7) "ß" ["à "]=> string(8) "à" ["á"]=> string(8) "á" ["â"]=> string(7) "â" ["ã"]=> string(8) "ã" ["ä"]=> string(6) "ä" ["Ã¥"]=> string(7) "å" ["æ"]=> string(7) "æ" ["ç"]=> string(8) "ç" ["è"]=> string(8) "è" ["é"]=> string(8) "é" ["ê"]=> string(7) "ê" ["ë"]=> string(6) "ë" ["ì"]=> string(8) "ì" ["í"]=> string(8) "í" ["î"]=> string(7) "î" ["ï"]=> string(6) "ï" ["ð"]=> string(5) "ð" ["ñ"]=> string(8) "ñ" ["ò"]=> string(8) "ò" ["ó"]=> string(8) "ó" ["ô"]=> string(7) "ô" ["õ"]=> string(8) "õ" ["ö"]=> string(6) "ö" ["÷"]=> string(8) "÷" ["ø"]=> string(8) "ø" ["ù"]=> string(8) "ù" ["ú"]=> string(8) "ú" ["û"]=> string(7) "û" ["ü"]=> string(6) "ü" ["ý"]=> string(8) "ý" ["þ"]=> string(7) "þ" ["ÿ"]=> string(6) "ÿ" ["Å’"]=> string(7) "Œ" ["Å「"]=> string(7) "œ" ["Å "]=> string(8) "Š" ["Å¡"]=> string(8) "š" ["Ÿ"]=> string(6) "Ÿ" ["Æ’"]=> string(6) "ƒ" ["ˆ"]=> string(6) "ˆ" ["Ëœ"]=> string(7) "˜" ["Α"]=> string(7) "Α" ["Î’"]=> string(6) "Β" ["Î「"]=> string(7) "Γ" ["Î」"]=> string(7) "Δ" ["Ε"]=> string(9) "Ε" ["Ζ"]=> string(6) "Ζ" ["Η"]=> string(5) "Η" ["Θ"]=> string(7) "Θ" ["Ι"]=> string(6) "Ι" ["Κ"]=> string(7) "Κ" ["Λ"]=> string(8) "Λ" ["Îœ"]=> string(4) "Μ" ["Î"]=> string(4) "Ν" ["Ξ"]=> string(4) "Ξ" ["Ο"]=> string(9) "Ο" ["Î "]=> string(4) "Π" ["Ρ"]=> string(5) "Ρ" ["Σ"]=> string(7) "Σ" ["Τ"]=> string(5) "Τ" ["Î¥"]=> string(9) "Υ" ["Φ"]=> string(5) "Φ" ["Χ"]=> string(5) "Χ" ["Ψ"]=> string(5) "Ψ" ["Ω"]=> string(7) "Ω" ["α"]=> string(7) "α" ["β"]=> string(6) "β" ["γ"]=> string(7) "γ" ["δ"]=> string(7) "δ" ["ε"]=> string(9) "ε" ["ζ"]=> string(6) "ζ" ["η"]=> string(5) "η" ["θ"]=> string(7) "θ" ["ι"]=> string(6) "ι" ["κ"]=> string(7) "κ" ["λ"]=> string(8) "λ" ["μ"]=> string(4) "μ" ["ν"]=> string(4) "ν" ["ξ"]=> string(4) "ξ" ["ο"]=> string(9) "ο" ["Ï€"]=> string(4) "π" ["Ï"]=> string(5) "ρ" ["Ï‚"]=> string(8) "ς" ["σ"]=> string(7) "σ" ["Ï„"]=> string(5) "τ" ["Ï…"]=> string(9) "υ" ["φ"]=> string(5) "φ" ["χ"]=> string(5) "χ" ["ψ"]=> string(5) "ψ" ["ω"]=> string(7) "ω" ["Ï‘"]=> string(10) "ϑ" ["Ï’"]=> string(7) "ϒ" ["Ï–"]=> string(5) "ϖ" [" "]=> string(6) " " [" "]=> string(6) " " [" "]=> string(8) " " ["‌"]=> string(6) "‌" ["â€"]=> string(5) "‍" ["‎"]=> string(5) "‎" ["â€"]=> string(5) "‏" ["â€「"]=> string(7) "–" ["â€」"]=> string(7) "—" ["‘"]=> string(7) "‘" ["’"]=> string(7) "’" ["‚"]=> string(7) "‚" ["“"]=> string(7) "「" ["â€"]=> string(7) "」" ["„"]=> string(7) "„" ["†"]=> string(8) "†" ["‡"]=> string(8) "‡" ["•"]=> string(6) "•" ["…"]=> string(8) "…" ["‰"]=> string(8) "‰" ["′"]=> string(7) "′" ["″"]=> string(7) "″" ["‹"]=> string(8) "‹" ["›"]=> string(8) "›" ["‾"]=> string(7) "‾" ["â„"]=> string(7) "⁄" ["€"]=> string(6) "€" ["â„‘"]=> string(7) "ℑ" ["℘"]=> string(8) "℘" ["â„œ"]=> string(6) "ℜ" ["â„¢"]=> string(7) "™" ["ℵ"]=> string(9) "ℵ" ["â†"]=> string(6) "←" ["↑"]=> string(6) "↑" ["→"]=> string(6) "→" ["â†「"]=> string(6) "↓" ["â†」"]=> string(6) "↔" ["↵"]=> string(7) "↵" ["â‡"]=> string(6) "⇐" ["⇑"]=> string(6) "⇑" ["⇒"]=> string(6) "⇒" ["â‡「"]=> string(6) "⇓" ["â‡」"]=> string(6) "⇔" ["∀"]=> string(8) "∀" ["∂"]=> string(6) "∂" ["∃"]=> string(7) "∃" ["∅"]=> string(7) "∅" ["∇"]=> string(7) "∇" ["∈"]=> string(6) "∈" ["∉"]=> string(7) "∉" ["∋"]=> string(4) "∋" ["âˆ"]=> string(6) "∏" ["∑"]=> string(5) "∑" ["−"]=> string(7) "−" ["∗"]=> string(8) "∗" ["√"]=> string(7) "√" ["âˆ"]=> string(6) "∝" ["∞"]=> string(7) "∞" ["∠"]=> string(5) "∠" ["∧"]=> string(5) "∧" ["∨"]=> string(4) "∨" ["∩"]=> string(5) "∩" ["∪"]=> string(5) "∪" ["∫"]=> string(5) "∫" ["∴"]=> string(8) "∴" ["∼"]=> string(5) "∼" ["≅"]=> string(6) "≅" ["≈"]=> string(7) "≈" ["≠"]=> string(4) "≠" ["≡"]=> string(7) "≡" ["≤"]=> string(4) "≤" ["≥"]=> string(4) "≥" ["⊂"]=> string(5) "⊂" ["⊃"]=> string(5) "⊃" ["⊄"]=> string(6) "⊄" ["⊆"]=> string(6) "⊆" ["⊇"]=> string(6) "⊇" ["⊕"]=> string(7) "⊕" ["⊗"]=> string(8) "⊗" ["⊥"]=> string(6) "⊥" ["â‹…"]=> string(6) "⋅" ["⌈"]=> string(7) "⌈" ["⌉"]=> string(7) "⌉" ["⌊"]=> string(8) "⌊" ["⌋"]=> string(8) "⌋" ["〈"]=> string(6) "⟨" ["〉"]=> string(6) "⟩" ["â—Š"]=> string(5) "◊" ["â™ "]=> string(8) "♠" ["♣"]=> string(7) "♣" ["♥"]=> string(8) "♥" ["♦"]=> string(7) "♦" } 

編輯3:

就完全公開,這裏是我已經建立了爲目的的測試文件搞清楚這一點。目前,所有實體顯示正確,但重音字符顯示爲

<?php 

header('Content-Type: text/html; charset=UTF-8'); 

require_once('_RESOURCES/simple_html_dom.php'); 

$url = 'http://fanfiction.portkey.org/index.php?act=read&storyid=1585&chapterid=&agree=1'; 

function tidyHTML($html) { 
    ob_start(); 
    $tidy = new tidy; 
    $config = array('indent' => true, 'output-xhtml' => false, 'wrap' => 200, 'clean' => false, 'show-body-only' => true); 
    $tidy->parseString($html, $config, 'utf8'); 
    $tidy->cleanRepair(); 
    $input = $tidy; 
    return $input; 
} 

function filter($html) { 
    $html = preg_replace('~>\s+<~', '><', $html); 
    $html = preg_replace('/<\/b>\s?<b>/', '', $html); 
    $html = preg_replace('/<\/i>\s?<i>/', '', $html); 
    $html = str_replace('<br>', '', $html); 
    $output = $html; 
    return $output; 
} 

$page_html = file_get_html($url); 
$chapter_html = $page_html->find('td[class="story"]', 0); 
foreach ($chapter_html->find('center') as $node) { $node->outertext = ''; } 

$entities = html_entity_decode($chapter_html->innertext, ENT_QUOTES, 'UTF-8'); 

echo tidyHTML(filter($entities)); 

// var_dump(get_html_translation_table(HTML_ENTITIES)); 

?> 
+0

爲什麼要將實體轉換爲字符?它在這裏沒有任何用處,因爲實體在HTML中實際工作*更安全*(特別是如果你不知道如何聲明字符編碼)。此外,您的示例不包含實體。它有這樣的「é」,沒有省略號「...」。 –

+0

我正在轉換實體,因爲我將以各種格式(包括明文)輸出故事文本。此外,該考試還包含'&hellip;以及'é'。我將編輯示例以專注於問題字符。 – zuddsy

+0

如果它是HTML格式,從它生成純文本是一個更廣泛的問題,而不僅僅是將實體引用轉換爲字符。當純文本實際生成時可以處理它。 –

回答

1

您可能想要html_entity_decode。從文檔:「將字符串中的所有HTML實體轉換爲其適用的字符。」根據您的PHP版本和設置,您可能需要手動指定編碼。例如:

html_entity_decode($raw_text, ENT_QUOTES, 'UTF-8'); 

Tidy可能會重新編碼您的實體。我不確定您的輸入字符串有多複雜,但可以考慮使用類似strip_tags的HTML標籤,如果您不需要完全匹配的格式。

+0

這很奇怪。你使用的是什麼版本的PHP?如果您使用的是舊版本,則可能必須手動設置編碼。我正在編輯我的答案以包含此內容。 – Skunkwaffle

+0

我正在使用v5.5.11。我嘗試手動設置編碼,因爲你編輯,並保持完全一樣。破橢圓,重音字符正確顯示。 – zuddsy

+0

你在看html_entity_decode的結果,還是最終輸出?整潔可能會重新逃脫你的字符串。嘗試直接回顯html_entity_decode的結果。你應該能夠通過這種方式排除一些事情。 – Skunkwaffle

0

我完成了我着手通過改變整齊的編碼從

$tidy->parseString($html, $config, 'utf8');

$tidy->parseString($html, $config, 'win1252');

此轉換的重音的字符爲HTML實體。然後,我使用html_entity_decode將所有實體轉換爲UTF-8字符。

新的測試文件(作品!)

<?php 

header('Content-Type: text/html; charset=UTF-8'); 

require_once('_RESOURCES/simple_html_dom.php'); 

$url = 'http://fanfiction.portkey.org/index.php?act=read&storyid=1585&chapterid=&agree=1'; 

function tidyHTML($html) { 
    ob_start(); 
    $tidy = new tidy; 
    $config = array('indent' => true, 'output-xhtml' => false, 'wrap' => 200, 'clean' => false, 'show-body-only' => true); 
    $tidy->parseString($html, $config, 'win1252'); 
    $tidy->cleanRepair(); 
    $input = $tidy; 
    return $input; 
} 

function filter($html) { 
    $html = preg_replace('~>\s+<~', '><', $html); 
    $html = preg_replace('/<\/b>\s?<b>/', '', $html); 
    $html = preg_replace('/<\/i>\s?<i>/', '', $html); 
    $html = str_replace('<br>', '', $html); 
    $output = $html; 
    return $output; 
} 

$page_html = file_get_html($url); 
$chapter_html = $page_html->find('td[class="story"]', 0); 
foreach ($chapter_html->find('center') as $node) { $node->outertext = ''; } 

echo filter(html_entity_decode(tidyHTML($chapter_html->innertext))); 

?> 

不能做它沒有你,Skunkwaffle!

相關問題