我有一個網站,每月通過FTP接收一個CSV文件一次。多年來,它是一個ASCII文件。現在我在一個月後收到UTF-8,接下來是UTF-16BE和UTF-16LE。也許我會在下個月獲得UTF-32。 Fgets返回UTF文件開頭的字節順序標記。我如何讓PHP自動識別字符編碼?我曾嘗試mb_detect_encoding,無論文件類型如何,它都返回ASCII。我更改了代碼以讀取BOM,並明確地將字符編碼轉換爲mb_convert_encoding。這工作,直到最新的文件,這是UTF-16LE。在這個文件中,它正確地讀取第一行,所有後續行顯示爲問號(「?」)。我究竟做錯了什麼?PHP字符編碼地獄閱讀csv文件與fgets
$fhandle = fopen($file_in, "r");
if (fhandle === false)
{
echo "<p class=redbold>Error opening file $file_in.</p>";
die();
}
$i = 0;
while(($line = fgets($fhandle)) !== false)
{
$i++;
// Detect encoding on first line. Actual text always begins with string "Document"
if ($i == 1)
{
$line_start = substr($line, 0, 4);
$line_start_hex = bin2hex($line_start);
$utf16_start = 'fffe4400';
$utf8_start = 'efbbbf44';
if (strcmp($line_start, 'Docu') == 0)
{ $char_encoding = 'ASCII'; }
elseif (strcmp($line_start_hex, 'efbbbf44') == 0)
{
$char_encoding = 'UTF-8';
$line = substr($line, 3);
}
elseif (strcmp($line_start_hex, 'fffe4400') == 0)
{
$char_encoding = 'UTF-16LE';
$line = substr($line, 2);
}
elseif (strcmp($line_start_hex, 'feff4400') == 0)
{
$char_encoding = 'UTF-16BE';
$line = substr($line, 2);
}
else
{
echo "<p class=redbold>Error, unknown character encoding. Line =<br>", $line_start_hex, '</p>';
require('../footer.php');
die();
}
echo "<p>char_encoding = $char_encoding</p>";
}
// Convert UTF
if ($char_encoding != 'ASCII')
{
$line = mb_convert_encoding($line, 'ASCII', $char_encoding);
}
echo '<p>'; var_dump($line); echo '</p>';
}
輸出:
char_encoding = UTF-16LE
string(101) "DocumentNumber,RecordedTS,Title,PageCount,City,TransTaxAccountCode,TotalTransferTax,Description,Name
"
string(83) "???????????????????????????????????????????????????????????????????????????????????"
string(88) "????????????????????????????????????????????????????????????????????????????????????????"
string(84) "????????????????????????????????????????????????????????????????????????????????????"
string(80) "????????????????????????????????????????????????????????????????????????????????"
不幸的是,mb_detect_encoding似乎爲某些UTF文件返回「ASCII」。 – George
哎呀,錯過了那部分問題..回到繪圖板 –
但ascii是unicode的一個子集(第一個255十進制),因此它們應該很容易轉換。只需轉換爲ascii並且不使用多字節字符串。哦,你有沒有想過可能會向提供FTP數據的人大喊大叫? – Amelia