如何在Python中創建GSM-7編碼？

GSM-7字符集被定義爲基本映射表+擴展字符映射表（https://en.wikipedia.org/wiki/GSM_03.38#GSM_7-bit_default_alphabet_and_extension_table_of_3GPP_TS_23.038_.2F_GSM_03.38）。含義u'@'應映射到b'\x00'（長度爲1的字節串），但u'['應映射到b'\x1b<'或b'\x1b\x3c'（長度爲2的字節串）。如何在Python中創建GSM-7編碼？

我已經設法通過擴展encoding_table來使編碼部分工作，但我不知道該怎麼做與decoding_table ..？

下面是完整的編解碼器的樣板：

import codecs 
from encodings import normalize_encoding 

class GSM7Codec(codecs.Codec): 
    def encode(self, input, errors='strict'): 
     return codecs.charmap_encode(input, errors, encoding_table) 

    def decode(self, input, errors='strict'): 
     return codecs.charmap_decode(input, errors, decoding_table) 

class GSM7IncrementalEncoder(codecs.IncrementalEncoder): 
    def encode(self, input, final=False): 
     return codecs.charmap_encode(input, self.errors, encoding_table)[0] 

class GSM7IncrementalDecoder(codecs.IncrementalDecoder): 
    def decode(self, input, final=False): 
     return codecs.charmap_decode(input, self.errors, decoding_table)[0] 

class GSM7StreamWriter(codecs.Codec, codecs.StreamWriter): pass 
class GSM7StreamReader(codecs.Codec, codecs.StreamReader): pass 

_cache = {} 

def search_function(encoding): 
    """Register the gsm-7 encoding with Python's codecs API. This involves 
     adding a search function that takes in an encoding name, and returns 
     a codec for that encoding if it knows one, or None if it doesn't. 
    """ 
    if encoding in _cache: 
     return _cache[encoding] 
    norm_encoding = normalize_encoding(encoding) 
    if norm_encoding in ('gsm_7', 'g7', 'gsm7'): 
     cinfo = codecs.CodecInfo(
      name='gsm-7', 
      encode=GSM7Codec().encode, 
      decode=GSM7Codec().decode, 
      incrementalencoder=GSM7IncrementalEncoder, 
      incrementaldecoder=GSM7IncrementalDecoder, 
      streamreader=GSM7StreamReader, 
      streamwriter=GSM7StreamWriter, 
     ) 
     _cache[norm_encoding] = cinfo 
     return cinfo 
    return None 

codecs.register(search_function)

，這裏是表的定義：

decoding_table = (
    u"@£$¥èéùìòÇ\nØø\rÅå" + 
    u"Δ_ΦΓΛΩΠΨΣΘΞ\x1bÆæßÉ" + 
    u" !\"#¤%&'()*+,-./" + 
    u":;<=>?" + 
    u"¡ABCDEFGHIJKLMNO" + 
    u"PQRSTUVWXYZÄÖÑÜ§" + 
    u"¿abcdefghijklmno" + 
    u"pqrstuvwxyzäöñüà" 
) 

encoding_table = codecs.charmap_build(
    decoding_table + '\0' * (256 - len(decoding_table)) 
) 

# extending the encoding table with extension characters 
encoding_table[ord(u'|')] = '\x1b\x40' 
encoding_table[ord(u'^')] = '\x1b\x14' 
encoding_table[ord(u'€')] = '\x1b\x65' 
encoding_table[ord(u'{')] = '\x1b\x28' 
encoding_table[ord(u'}')] = '\x1b\x29' 
encoding_table[ord(u'[')] = '\x1b\x3C' 
encoding_table[ord(u'~')] = '\x1b\x3D' 
encoding_table[ord(u']')] = '\x1b\x3E' 
encoding_table[ord(u'\\')] = '\x1b\x2F'

編碼部分現在的作品，但解碼並不：

>>> u'['.encode('g7') 
'\x1b<' 
>>> _.decode('g7') 
u'\x1b<' 
>>>

我一直無法找到有關編寫編碼的文檔的好資源。

來源

2016-06-12 thebjorn

工作，你不能用一個表來這裏做解碼。你必須創建你自己的函數來在這裏進行解碼（逐個字符地處理以找到多字節序列）。 –

感謝@MartijnPieters我根據您的評論創建了一個答案。 – thebjorn

基於@Martijn Pieters的評論，我已經改變了代碼：

def decode_gsm7(txt, errors): 
    ext_table = { 
     '\x40': u'|', 
     '\x14': u'^', 
     '\x65': u'€', 
     '\x28': u'{', 
     '\x29': u'}', 
     '\x3C': u'[', 
     '\x3D': u'~', 
     '\x3E': u']', 
     '\x2F': u'\\', 
    } 
    chunks = filter(None, txt.split('\x1b')) # split on ESC 
    res = u'' 
    for chunk in chunks: 
     res += ext_table[chunk[0]] # first character after ESC 
     if len(chunk) > 1: 
      # charmap_decode returns a tuple.. 
      decoded, _ = codecs.charmap_decode(chunk[1:], errors, decoding_table) 
      res += decoded 
    return res, len(txt) 


class GSM7Codec(codecs.Codec): 

    def encode(self, txt, errors='strict'): 
     return codecs.charmap_encode(txt, errors, encoding_table) 

    def decode(self, txt, errors='strict'): 
     return decode_gsm7(txt, errors)

這似乎:-)

來源

2016-06-12 15:06:04 thebjorn

如何在Python中創建GSM-7編碼？

回答

相關問題