如何使用preg_replace從遠程網站翻譯日語到英語

我正在嘗試編寫一個PHP腳本，將日文烹飪食譜從日文翻譯成英文。這是寵物項目，並不一定是完美的。我的策略是..如何使用preg_replace從遠程網站翻譯日語到英語

使用的file_get_contents
英文更換一些日本（主要成分的名稱）
結果寫入HTML文件

我抓取網站內容從命令行運行此PHP腳本：

<?php 

    // Get contents of Japanese cooking website 
    $url = 'http://recipe.igamono.jp/?eid=1077379'; 
    $context = stream_context_create(array('http' => array('header' => 'Accept-Charset: UTF-8, *;q=0'))); 
    $html = file_get_contents($url, false, $context); 

    // Replace stuff 
    $html = preg_replace('/right/s', 'foobar', $html); // works 
    $html = preg_replace('/の/s', 'shazam!!', $html); // doesn't work 

    // Write output to a file 
    file_put_contents('output.html', $html); 

?>

我使用Sublime Text來編輯fil e（translate.php），並且我已經確保使用以下文件保存文件：使用編碼/ UTF-8的文件/保存

當我運行腳本時，除了替換「不會發生任何替換。

然而，這確實工作：

<?php 
    $html = "one の two の"; 
    $html = preg_replace('/の/s', 'shazam!!', $html); 

    file_put_contents('output.html', $html); 
?>

輸出是：

一個快變！兩個shazam！

有什麼建議嗎？我知道這是一個字符編碼問題，但我似乎無法讓它工作。

UPDATE：

下面是修改後的版本，測試在$ HTML變量UTF-8編碼：

<?php 
    // Get contents of Japanese cooking website 
    $url = 'http://recipe.igamono.jp/?eid=1077379'; 
    $context = stream_context_create(array('http' => array('header' => 'Accept-Charset: UTF-8, *;q=0'))); 
    $html = file_get_contents($url, FILE_TEXT, $context); 

    if(mb_detect_encoding($html, 'UTF-8')) print("Yep, UTF-8 it is.\n"); 
    if(! mb_detect_encoding($html, 'UTF-8', true)) print("Well, on second thought.. maybe not!\n"); 

?>

輸出是：

是的，UTF- 8它是。

那麼，第二個想法..也許不是！

我的解決方案

下面是我想出了一個解決辦法：

<?php 
    // Get contents of Japanese cooking website 
    $url = 'http://recipe.igamono.jp/?eid=1077379'; 
    $html = file_get_contents($url); 

    // Convert HTML to UTF-8 from Japanese 
    $html = mb_convert_encoding($html, "UTF-8", "EUC-JP"); 

    // Replace stuff 
    $html = preg_replace('/right/s', 'foobar', $html); 
    $html = preg_replace('/の/s', 'shazam!!', $html); 

    // Convert HTML back to Japanese character encoding 
    $html = mb_convert_encoding($html, "EUC-JP", "UTF-8"); 

    // Write HTML to a file 
    file_put_contents('output.html', $html); 
?>

來源

2016-02-14 clone45

就是這樣！快樂它的解決.. –

建議：

$html = preg_replace('/[\x{306E}]/u', 'shazam!!', $html);

UPDATE： 原來你加載頁面是不是UTF-8，但EUC-JP。做：

<?php 

// Get contents of Japanese cooking website 
$url = 'http://recipe.igamono.jp/?eid=1077379'; 
$context = stream_context_create(array('http' => array('header' => 'Accept-Charset: EUC-JP, *;q=0'))); 
$html = file_get_contents($url, false, $context); 

// Replace stuff 
$html = mb_convert_encoding($html, "UTF-8", "EUC-JP"); 
$html = preg_replace('/right/s', 'foobar', $html); // works 
$html = preg_replace('/の/s', 'shazam!!', $html); // doesn't work 

// Write output to a file 
file_put_contents('output.html', $html); 
?>

我得到「shazam !!」

來源

2016-02-14 01:10:04

如果我使用/ su，我的輸出文件是空白的。如果我使用/ s，我會得到：Warning：preg_replace（）：編譯失敗：偏移量爲\ x {}或\ o {}中的字符值太大。也許這是一個好主意，但也許我的$ html不是UTF-8？ – clone45

解決了它..但是：我錯誤地編輯了Mvorisek的答案，而不是我自己的答案。現在需要進行「同行評議」......問題是日文頁面不是UTF-8，而是EUC-JP，需要先將其轉換爲UTF-8才能進行shazammed .. –

我也解決了它。我會將我的解決方案發布爲我的問題的編輯，但如果它也可以，我會接受您的解決方案。 – clone45

嘗試$html = preg_replace('/の/su', 'shazam!!', $html);（與/ U UTF-8修改器）。

UPDATE：

當使用/ U修改，輸入文本（HTML $這裏）必須爲UTF-8有效，否則什麼也不會在全部替換。輸入的文本編碼可以通過mb_check_encoding($string, 'UTF-8')進行檢查。

來源

2016-02-14 00:53:32 Mvorisek

感謝您的答覆。 mb_detect_encoding（$ html，'UTF-8'）爲true，但「strict」版本mb_detect_encoding（$ html，'UTF-8'，true）爲false。另外，如果我需要/ u修飾符，我的第二個示例也應該失敗，對吧？當我添加/ u修飾符時，輸出爲空。該死！ – clone45

如何使用preg_replace從遠程網站翻譯日語到英語

回答

相關問題