我想使用.Net的WebClient
類下載網頁,提取標題(即<title>
和</title>
之間的內容)並將頁面保存到文件。將UTF8轉換爲ANSI?
問題是,頁面使用UTF-8編碼,並且System.IO.StreamWriter
在使用帶有這些字符的文件名時引發異常。
我google了,並嘗試了幾種方法將UTF8轉換爲ANSI,無濟於事。有人有爲此工作的代碼嗎?
'Using WebClient asynchronous downloading
Private Sub AlertStringDownloaded(ByVal sender As Object,
ByVal e As DownloadStringCompletedEventArgs)
If e.Cancelled = False AndAlso e.Error Is Nothing Then
Dim Response As String = CStr(e.Result)
'Doesn't work
Dim resbytes() As Byte = Encoding.UTF8.GetBytes(Response)
Response = Encoding.Default.GetString(Encoding.Convert(Encoding.UTF8,
Encoding.Default, resbytes))
Dim title As Regex = New Regex("<title>(.+?) \(",
RegexOptions.Singleline)
Dim m As Match
m = title.Match(Response)
If m.Success Then
Dim MyTitle As String = m.Groups(1).Value
'Illegal characters in path.
Dim objWriter As New System.IO.StreamWriter("c:\" & MyTitle & ".txt")
objWriter.Write(Response)
objWriter.Close()
End If
End If
End Sub
編輯:感謝大家的幫助。事實證明,錯誤不是由於UTF8造成的,而是頁面標題部分隱藏的LF字符,這顯然是路徑中的非法字符。
編輯:這裏有一個簡單的方法來去除一些非法字符的文件名/路徑:
Dim MyTitle As String = m.Groups(1).Value
Dim InvalidChars As String = New String(Path.GetInvalidFileNameChars()) + New String(Path.GetInvalidPathChars())
For Each c As Char In InvalidChars
MyTitle = MyTitle.Replace(c.ToString(), "")
Next
編輯:下面是如何告訴WebClient的期望UTF-8:
Dim webClient As New WebClient
AddHandler webClient.DownloadStringCompleted, AddressOf AlertStringDownloaded
webClient.Encoding = Encoding.UTF8
webClient.DownloadStringAsync(New Uri("www.acme.com"))
有大量的ASCII字符不能在文件中使用尼姆......完全是什麼標題? – Esailija
對不起,字符很好(雖然我寧願在文件名中使用ANSI字符而不是UTF8:「c:\Cinéma Paradiso.txt」不是用戶友好的)。我會找到如何去除隱藏的,引起錯誤的LF字符 – Gulbahar