試圖將字符串轉換爲正確的格式/編碼？

我有一個程序，做一些法語網頁的屏幕抓取並找到一個特定的字符串。一旦發現我拿起那個字符串並保存。返回的字符串顯示爲User does not have a desktop configured.或法語爲L'utilisateur ne dispose pas d'un bureau configuré.，但實際上顯示爲：L**\x26#39**;utilisateur ne dispose pas d**\x26#39**;un bureau configur**�**.我怎樣才能將它考慮爲\x26#39作爲撇號'字符。試圖將字符串轉換爲正確的格式/編碼？

C＃中是否有東西可以用來讀取Url並返回正確的短語。

我看過很多可用的C＃功能，但找不到能夠爲我提供正確結果的功能。

示例代碼試圖用玩：

// translated the true French text to English to help out with this example. 
// 
Encoding winVar1252 = Encoding.GetEncoding(1252); 
Encoding utf8 = Encoding.UTF8; 
Encoding ascii = Encoding.ASCII; 
Encoding unicode = Encoding.Unicode; 

string url = String.Format("http://www.My-TEST-SITE.com/); 
WebClient webClient = new WebClient(); 
webClient.Encoding = System.Text.Encoding.UTF8; 
string result = webClient.DownloadString(url); 
cVar = result.Substring(result.IndexOf("Search_TEXT=")).Length ; 
result = result.Substring(result.IndexOf("Search_TEXT="), cVar); 
result = WebUtility.HtmlDecode(result); 
result = WebUtility.UrlDecode(result); 
result = result.Substring(0, result.IndexOf("Found: "));

這將返回L**\x26#39**;utilisateur ne dispose pas d**\x26#39**;un bureau configur**�**. 時，它應該返回：L'utilisateur ne dispose pas d'un bureau configuré.。

我試圖擺脫\x26#39，並得到適當的法國字符顯示爲é ê è ç â等

來源

2014-01-08 user3147056

您不希望使用適當的工具如HtmlAgilityPack進行網絡疤痕的任何特定原因？ –

你在混合很多東西。基本上，UTF8是字符編碼的方式，Unicode是表示法。我建議你先閱讀這篇令人驚歎的文章，然後你就會明白髮生了什麼。 http://www.joelonsoftware.com/articles/Unicode.html –

我不知道「HtmlAgilityPack」，現在閱讀文檔。至於Joel的網站......是的，我已經看到它，但它並沒有告訴我爲什麼我仍然在我的屏幕上看不到任何UTF8代碼。試圖找到完美的代碼來給我正確的文本。 – user3147056

我不能肯定，但：

result = result.Substring(result.IndexOf("Search_TEXT="), cVar); 
result = WebUtility.HtmlDecode(result); 
result = WebUtility.UrlDecode(result);

雙解碼文本不能很好。它可能是URL或HTML，也可能都不是。不是都。

來源

2014-01-08 03:10:53

嘗試過：result = WebUtility.HtmlDecode（result）; // result = WebUtility.UrlDecode（result）;然後//結果= WebUtility.HtmlDecode（result）; result = WebUtility.UrlDecode（result）; UrlDecode單獨給了我一個字符串大小的錯誤。 – user3147056

它看起來像你的第一個問題不是與字符編碼，但與某人的自定義組合"\x" escaped sequence和被遮蓋的html entities。

那個有趣的**\x26#39**;實際上只是一個簡單的單引號。翻譯的十六進制字符\x26變爲&，因此您可以獲得**&#39**;。刪除無關的星星，你會得到html實體'。隨着HtmlDecode這成爲簡單的撇號，'，這只是ascii字符39.

試試這個片段。請注意，只有最後一步我們才能夠執行HtmlDecode。

var input = @"L**\x26#39**;utilisateur ne dispose pas d**\x26#39**;un bureau configur**�**"; 

var result = Regex.Replace(input, @"\*\*([^*]*)\*\*", "$1"); // Take out the extra stars 

// Unescape \x values 
result = Regex.Replace(result, 
         @"\\x([a-fA-F0-9]{2})", 
         match => char.ConvertFromUtf32(Int32.Parse(match.Groups[1].Value, 
                    System.Globalization.NumberStyles.HexNumber))); 

// Decode html entities 
result = System.Net.WebUtility.HtmlDecode(result);

輸出爲L'utilisateur ne dispose pas d'un bureau configur�

第二個問題是重音「E」。這實際上是一個編碼問題，你可能不得不繼續玩弄它，以使其正確。您可能還想嘗試使用UTF16或甚至UTF32。但HtmlAgilityPack可能會自動爲您處理這個問題。

來源

2014-01-08 03:25:34

試圖將字符串轉換爲正確的格式/編碼？

回答

相關問題