正則表達式破壞我的UTF-8 XML（PHP）

我有一個問題..我有一個代碼，下載一些XML文件，並刪除一些我不需要的標籤。因爲這一切都是finde。我的XML文件是UTF-8，我沒有問題。正則表達式破壞我的UTF-8 XML（PHP）

但自從我加入了代碼替換和更改標題值我的XML文件不是UTF-8壽命長，我得到這個錯誤信息：

"D:\Anwendung\PHP 7\php-win.exe" C:\Users\Jan\PhpstormProjects\censored\test.php 
PHP Warning: DOMDocument::load(): Input is not proper UTF-8, indicate encoding ! 
Bytes: 0xE3 0xA4 0x63 0x68 in file:/C:/Users/Jan/PhpstormProjects/censored/data/gamesplanet.xml, line: 1423 in C:\Users\Jan\PhpstormProjects\censored\test.php on line 18 
PHP Fatal error: Uncaught Error: Call to a member function getElementsByTagName() on null in C:\Users\Jan\PhpstormProjects\censored\test.php:23 
Stack trace: 
#0 C:\Users\Jan\PhpstormProjects\censored\test.php(86): countAd('data/gamesplane...') 
#1 {main} 
    thrown in C:\Users\Jan\PhpstormProjects\censored\test.php on line 23 

Process finished with exit code 255

在行1423個展臺：W㥣hter Von Mittelerde

如果我不通過下面的代碼，我得到沒有錯誤消息，並在1423行：Wächter von Mittelerde

有沒有人有一個想法，可以幫助我嗎？

代碼：

function loadTitles($tagName, $path){ 

    $dom = new DOMDocument('1.0', 'utf-8'); 
    $dom->preserveWhiteSpace = false; 
    $dom->formatOutput = true; 
    $dom->load($path); 

    $marker = $dom->getElementsByTagName($tagName); 

    for ($i = $marker->length - 1; $i >= 0; $i--) { 
     $word = $marker->item($i)->textContent; 
     $escapedWord = escapWord($word); 
     $escapedWord = modifyWord($escapedWord); 
     $marker->item($i)->textContent = $escapedWord; 
    } 

    $dom->saveXML(); 
    $dom->save($path); 
} 
function escapWord($string){ 

    $replaceNothing = [":", ",", ";", "`", "#", "'", "´", "–", "!", "(", ")", ".", "@", "’", "+", "™"]; 
    $replaceSpace = ["-", "–", "_", "/", ":"]; 
    $delete = ["Steam", "Eu", "Key", "CD", "Gift", "Edition", "Pack", "Uplay", "Required", "Collection", "Origin", "HD", "Complete", "Digital", "Download", "EA", "Europa", "RPG", "Activated", "Access", "Code", "Limited", "Direct", "Bundle", "Special", "CDKEY", "GLOBAL", "EARLY", "ACCESS", "Card", "Cartel", "Player", "Trade", "DE", "GOG", "Multilanguage", "Multi", "Full", "Only", "UNCUT", "Cut", "Box", "Ps Vita", "VIP", "Rockstar", "Subscription"]; 

    $string= str_replace($replaceNothing, '', $string); 
    $string= str_replace($replaceSpace, ' ', $string); 
    $string= preg_replace('~\b(?:' . implode('|', $delete) . ')\b~i', '', $string); 
    $string= str_replace("&amp;", ' & ', $string); 
    $string= strtolower($string); 
    $string= ucwords($string); 
    $string= preg_replace('/\bAsia\b/i', 'ASIA', $string); 
    $string= preg_replace('/\buk\b/i', 'UK', $string); 
    $string= preg_replace('/\bAU\b/i', 'AU', $string); 
    $string= preg_replace('/\bXBOX\b/i', 'XBOX ', $string); 
    $string= preg_replace('/\bpc\b/i', 'PC', $string); 
    $string= preg_replace('/\bus\b/i', 'US', $string); 
    $string= preg_replace('/\bru\b/i', 'RUS', $string); 
    $string= preg_replace('/\bRUS\b/i', 'RUS', $string); 
    $string= preg_replace('/\bPS4\b/i', 'PS4', $string); 
    $string= preg_replace('/\bAddon\b/i', 'AddOn', $string); 
    $string= preg_replace('/\bPlay Station 4\b/i', 'PS4', $string); 
    $string= preg_replace('/\bPs4\b/i', 'PS4', $string); 
    $string= preg_replace('/\bPs3\b/i', 'PS3', $string); 
    $string= preg_replace('/\bPlayStation 4\b/i', 'PS4', $string); 
    $string= preg_replace('/\bPlay Station 3\b/i', 'PS3', $string); 
    $string= preg_replace('/\bPlayStation 3\b/i', 'PS3', $string); 
    $string= preg_replace('/\bPlayStation Network\b/i', 'PSN', $string); 
    $string= preg_replace('/\bPSN\b/i', 'PSN', $string); 
    $string= preg_replace('/\bXX\b/i', 'XX', $string); 
    $string= preg_replace('/\bXIX\b/i', 'XIX', $string); 
    $string= preg_replace('/\bXVIII\b/i', 'XVIII', $string); 
    $string= preg_replace('/\bXVII\b/i', 'XVII', $string); 
    $string= preg_replace('/\bXVI\b/i', 'XVI', $string); 
    $string= preg_replace('/\bXV\b/i', 'XV', $string); 
    $string= preg_replace('/\bXIV\b/i', 'XIV', $string); 
    $string= preg_replace('/\bXiii\b/i', 'XIII', $string); 
    $string= preg_replace('/\bXii\b/i', 'XII', $string); 
    $string= preg_replace('/\bXi\b/i', 'XI', $string); 
    $string= preg_replace('/\bIX\b/i', 'IX', $string); 
    $string= preg_replace('/\bVIII\b/i', 'VIII', $string); 
    $string= preg_replace('/\bVII\b/i', 'VII', $string); 
    $string= preg_replace('/\bVI\b/i', 'VI', $string); 
    $string= preg_replace('/\bV\b/i', 'V', $string); 
    $string= preg_replace('/\bIV\b/i', 'IV', $string); 
    $string= preg_replace('/\bIII\b/i', 'III', $string); 
    $string= preg_replace('/\bII\b/i', 'II', $string); 
    $string= preg_replace('/\bdlc\b/i', 'DLC', $string); 
    $string= trim(preg_replace('/\s\s+/', ' ', str_replace("\n", " ", $string))); 

    return $string; 
} 
function modifyWord($string){ 

    if(strpos($string, "Counter Strike Offensive") !== false){ 
     $newstring = explode("Offensive", $string);; 
     $newstring[0] = $newstring[0] . "Global Offensive"; 
     $string = $newstring[0] . $newstring[1]; 
    } 

    return $string; 
}

問候，並感謝您！

來源

2017-01-05 Jan

問題是您使用多字節字符串（UTF8）使用不支持多字節字符的函數（'str_replace'，'ucwords'，'strtolower'，'preg_replace'沒有u修飾符）。改爲使用'mb_'函數，並使用帶有'preg_replace'的u修飾符。 –

請注意，'preg_replace'可以將數組作爲第一個和第二個參數。 –

你可以給我一個代碼片段，我怎麼能做到這一點？ - 因爲我不知道mb_functions是什麼意思，以及「u修飾符」是什麼意思？ – Jan

您應該激活使用u修改你的模式的統一模式。這意味着你將匹配unicode字符和codepoint，而不是單個字節。在Wächter中的ä由幾個字節組成，其中一個在單字節模式下被解釋爲字結束。

preg_match('(.)u', 'äöü', $match); 
var_dump($match);

輸出：

array(1) { 
    [0]=> 
    string(2) "ä" 
}

正如你所看到的例子中的第一個字符，不僅第一個字節相匹配。接下來是使用數組作爲參數preg_replace()的可能性。這使您可以簡化呼叫。

var_dump(preg_replace(['(ä)u', '(ü)u'], '_', 'äöü'));

輸出：

string(4) "_ö_"

但更好的選擇可能是使用你的模式字符類和|運營商。人物$replaceNothing和$replaceSpace的陣列，它們可以改成字符類：

$replaceWithNothing = '([,;`#\'´!()[email protected]’+™]+|(?:\b(?:Steam|Eu|Key)\b))u'; 
$replaceWithSpace = '([-–_/:]+)u'; 

var_dump(
    preg_replace(
    [$replaceWithNothing, $replaceWithSpace], 
    ['', ' '], 
    'remove (™) and :replace:' 
) 
);

而且這個詞替換：

$replaceWords = [ 
    '(\bAsia\b)ui' => 'ASIA'; 
    '(\buk\b)ui', 'UK' 
); 
$output = preg_replace(array_keys($words), $words, $input);

我不知道你爲什麼不使用一個簡單的替換modifeWord()函數。您將用Counter Strike Global Offensive"取代Counter Strike Offensive"的第一次出現。

使用mb_ *函數提到的註釋。我會建議使用更現代的ICU grapheme functions。這是PHP中用於unicode處理的標準，更現代和更強大的擴展。

來源

2017-01-05 12:49:41 ThW

但是，這不是解決我的'$ string = mb_strtolower（$ string，'UTF- 8'）;'問題... – Jan

有點，最好'strtolower（）'會銷燬UTF-8，它是一個ANSI（單字節）函數。 'ucwords（）'也是。但我不認爲你需要轉換字符串變量 - 模式使用「i」（不區分大小寫）修飾符。加上'u'這應該就夠了。 – ThW

我只使用strtolower（）函數，因爲我想讓所有字符串相等。所以我打電話給strtolower（）函數，然後我想說ucwords（），所以它們都顯得很好 – Jan

正則表達式破壞我的UTF-8 XML（PHP）

回答

相關問題