程序集AT＆T x86 - 如何比較長特定字節？

我在彙編中寫入一個無符號長整型函數。這是一個UTF-8字符。程序集AT＆T x86 - 如何比較長特定字節？

我想檢查它是否是1個，2個，3個或4個字節的UTF-8字符。到目前爲止，我有這樣的：（我改變代碼不被字節序的影響，我認爲...）

movl 12(%ebp),%eax # Move long u to %eax 
movl %eax,buff  # Move long u to buff 
andl $128,buff  # &-mask 1 MSB (from LSByte) 
cmpl $0,buff   # Compare buff to 0 
je  wu8_1byte  # If 0, 1 byte UTF8 

movl 12(%ebp),%eax # Move long u to %eax 
movl %eax,buff  # Move long u to buff 
andl $0xE000,buff # &-mask 3 MSB (from byte LSByte 2) 
cmpl $0xC000,buff # Compare the 3 MSB to binary 110 
je  wu8_2byte  # If =, 2 byte UTF8 

movl 12(%ebp),%eax # Move long u to %eax 
movl %eax,buff  # Move long u to buff 
andl $0xF00000,buff # &-mask 4 MSB (from byte MSByte 3) 
cmpl $0xE00000,buff # Compare the 4 MSB to binary 1110 
je  wu8_3byte  # If =, 3 byte UTF8 

jmp  wu8_4byte  # If no, 4 byte UTF8

12（％EBP）是長我想工作。 Buff是一個4字節的變量。

它適用於1byte，但不適用於其他。

關於如何弄清楚它是什麼類型的UTF-8字符的任何提示？

UTF-8編碼：

      0xxxxxxx # 1 byte 
        110xxxxx 10xxxxxx # 2 byte 
     1110xxxx 10xxxxxx 10xxxxxx # 3 byte 
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx # 4 byte

來源

2016-05-08 Eivind ubersloth Furuberg

當你說「這是一個UTF-8字符」時，它是否以大端形式編碼？因爲這對英特爾來說是不尋常的。 –

我很確定它是什麼endianess。我使用gcc -m32編譯它，並用C程序測試裝配函數。 –

那麼事實上，你將13位移動到前端字節的測試部分告訴我，你相信只要UTF-8字節序列以big endian順序存儲就可以了。你的系統的體系結構是小端的。這就是爲什麼我覺得很奇怪。 –

它不應該爲任何人工作的原因很簡單。

您取一個32位的值並將其右移。然後，你將它與一個常數相比較，忘記了還有比你比較的更多的位。

你必須和值僅獲取你想要的位：

movl 12(%ebp),%eax 
movl %eax,buff 
shrb $13,buff #UTF8 2 byte looks like 110xxxxx 10xxxxxx 
andl $7, buff # Take only the three lowest bits 
cmpl $6,buff #Therefore shift 13 spaces right and check 
je wu8_2byte #if buff=6 (110 = 6)

我還要寄存器內處理它，而不是在一個內存位置，使其更快。你也可以在沒有任何班次的情況下做到這一點。

來源

2016-05-08 15:22:48

我認爲多頭從左到右都是0填充。但是，我試過你的解決方案，但它也沒有工作。 –

根據您想要執行的錯誤檢查的次數，您可以使用test指令簡單地測試位。我假設unsigned long已從UTF-8編碼字節序列中加載，最低有效字節在前，這應該與在小端機上將char*混淆爲unsigned long*的結果相同。

如果這些假設是錯誤的，那麼您可能需要相應地更改代碼 - 它可能更復雜，因爲您可能不知道哪個字節是前導字節。

E.g.

movl 12(%ebp),%eax 
testl $128,%eax 
jz wu8_1byte 
testl $32,%eax  # We know that the top bit is set, it's not valid for it to be 
        # 10xxxxxx so we test this bit: 11?xxxxx 
jz wu8_2byte 
testl $16,%eax  # 111?xxxx 
jz wu8_3byte 
# Must be 4 byte 
jmp wu8_4byte

此代碼片段與您的原始代碼具有相同的假設。

movl 12(%ebp),%eax 

testl $0x80,%eax 
jz wu8_1byte 
        # We can assume that the last byte is of the form 10xxxxxx 
testl $0x7000,%eax # Testing this bit in byte n - 1: 1?xxxxxx 
jnz wu8_2byte 

testl $0x700000,%eax # Testing this bit in byte n - 2: 1?xxxxxx 
jnz wu8_3byte 
# Must be 4 byte 
jmp wu8_4byte

來源

2016-05-08 19:02:38

其他優化可以分散您的答案：'testl $ 128，％eax'可以通過測試'％al'來縮短，甚至可以通過'testl％al，％al' /'jns'分支到最低位的最低位8沒有立即的常數。但'test $ imm8，％al'有一個特殊的操作碼，所以它仍然是2B指令。此外，'testl $ 0x7000，％eax'可能是'test 0x70，％ah'來保存另外幾個字節的機器碼。（讀取部分寄存器總是很好，它會寫入部分寄存器，可能會導致速度減慢。儘管如此，避免使用'test $ imm16，％ax'，但是：使用立即數的16位操作數大小導致LCP失速） –

我解決它由UTF-8讀取並尋找一個簡單的解決方案：

cmpl $0x7F,12(%ebp)  # Compare unsigned long to 1 byte UTF-8 max value 
jbe  wu8_1byte 

cmpl $0x7FF,12(%ebp) # Compare unsigned long to 2 byte UTF-8 max value 
jbe  wu8_2byte 

cmpl $0xFFFF,12(%ebp) # Compare unsigned long to 3 byte UTF-8 max value 
jbe  wu8_3byte 

cmpl $0xFFFFFF,12(%ebp) # Compare unsigned long to 4 byte UTF-8 max value 
jbe  wu8_4byte

的UTF-8字符的編碼的方式，1點字節的字符具有0x7F的的最大值，最大2字節0x7FF，最大3字節0xFFFF，最大4字節0xFFFFFF。所以，通過比較無符號long和這些值，我可以確定解碼字符所需的字節數。

來源

2016-05-08 20:08:11

儘管我仍然不確定，其中實際數據字節位於，例如：在無符號長整型中：xxxxxxxx（byte0）xxxxxxxx（byte1）xxxxxxxx（byte2）xxxxxxxx（byte3）。其中12（％ebp）是byte0，15（％ebp）是byte3。 UTF-8數據字節在哪裏？他們總是從byte0開始？如4字節UTF-8：11110xxx 10xxxxxx 10xxxxxx 10xxxxxx＆1字節UTF-8：0xxxxxxx 00000000 00000000 00000000？ –

程序集AT＆T x86 - 如何比較長特定字節？

回答

相關問題