分析和導航UTF-8字節流的習慣方式

我正在研究Rust中的一個詞法分析器/分詞器，它需要解析一個UTF-8輸入文件（以&[u8]的形式給出）爲單個的char，但還必須跟蹤文件內的字節位置。在稍後的階段 - 比如說需要在輸入中報告錯誤時 - 我需要根據字節和回溯找到有問題的字符，以查找其行中的相對位置。分析和導航UTF-8字節流的習慣方式

什麼是將字節流解析爲字符（保持跟蹤字節位置）的常用方式;在標準庫中是否有函數允許我計算後續字節的後續數量，或者一個字節是前導字節還是後續字節，還是我必須根據Unicode標準自行實現這些功能？

例如，像：

// First to parse some input stream: 
let input: &[u8] = "something"; 
for (chr, bytepos) in parse(input) { 
    // ... 
} 

// Later to locate a character based on the byte position and 
// use is_leading_byte() to step backwards and count the number 
// of characters since the start of the line: 
let chr: u8 = input[some_bytepos]; 
chr.is_leading_byte(); 
chr.is_trailing_byte();

來源

2016-11-03 Fabian

[每個問題請提一個問題]（http://meta.stackexchange.com/q/39223/281829）。 – Shepmaster

這聽起來像你要[[char_indices']（https://doc.rust-lang.org/std/primitive.str.html#method.char_indices）並將其存儲到'Vec'（'let everything ：Vec <_> = input.char_indices（）。collect（）'）。 – Shepmaster

如果你有'＆[u8]'是UTF-8，你應該把它變成'＆str'。 – Shepmaster

似乎char_indices解決了我的兩個問題：

let input: &str = "something"; 
for (offset, chr) in input.char_indices() { 
    // ... 
}

，並在稍後階段，查找的字符和向後計數數可以用split_at：

let input: &str = "something"; 
let where: usize = 4; 
let (left,_) = input.split_at(where); 
for (offset, chr) in left.char_indices().rev() { 
    if chr == '\n' { 
     break; 
    } 
    // ... 
}

Matthieu M.指出了一個警告：迭代和統計Unicode代碼點並不一定對應於本能地認爲是腳本的單個字母的內容;這是由於多個代碼點可能組成一個字形。 An example can be found in the documentation of chars()。

來源

2016-11-04 07:31:31 Fabian

分析和導航UTF-8字節流的習慣方式

回答

相關問題