Eksport unidecode數據庫的國際字符ascii當量

如何導出unidecode python模塊的數據用於其他語言？Eksport unidecode數據庫的國際字符ascii當量

此模塊的Unicode字符轉換爲拉丁語（ASCII）字符，大致保持拼音意思是這樣的：

kožušček => kozuscek 
北亰 -> Bei Jing 
Москва -> Moskva

這是國際網頁創建URL-S例如有用。有其他語言的端口，如UnidecodeSharp，但質量不是很好。

來源

2015-06-05 Tometzky

這裏是一個Python程序unidecode_sqlite.py到unidecode數據導出到SQLite數據庫，它可以在每一個主要語言被使用：

#!/usr/bin/env python 

'''Export unidecode data to SQLite''' 

from __future__ import print_function, unicode_literals 

import inspect 
import os, sys, re 
import sqlite3 
import unidecode, unicodedata 

def unidecode_sqlite(filename): 
    '''Export unidecode data to filename''' 

    if os.path.exists(filename): 
     raise RuntimeError('File exists: %s' % filename) 

    conn = sqlite3.connect(filename) 
    conn.execute(
     '''create table if not exists unidecode (
      c text primary key, 
      category text not null, 
      ascii text not null 
     )''' 
    ) 

    unidecode_path = os.path.dirname(inspect.getfile(unidecode)) 

    # Python 2 compatibility 
    if 'unichr' in dir(__builtins__): 
     unichr_ = unichr 
    else: 
     unichr_ = chr 

    for filename in sorted(os.listdir(unidecode_path)): 
     if not os.path.isfile(os.path.join(unidecode_path, filename)): 
      continue 
     filename_match = re.match(
      r'^x([0-9a-f]{3})\.py$', 
      filename, 
      re.IGNORECASE 
     ) 
     if not filename_match: 
      continue 
     section = filename_match.group(1) 
     section_start = int("0x"+section, 0)*0x100 
     for char_position in range(0x100): 
      character = unichr_(section_start+char_position) 
      unidecoded_character = unidecode.unidecode(character) 
      if unidecoded_character is None or unidecoded_character == '[?]': 
       continue 
      conn.execute(
       '''insert into unidecode (c, category, ascii) 
        values (?,?,?)''', 
       (
        character, 
        unicodedata.category(character), 
        unidecoded_character 
       ) 
      ) 
    conn.commit() 
    conn.execute('vacuum') 

if __name__ == "__main__": 
    if len(sys.argv) != 2: 
     print('USAGE: %s FILE' % sys.argv[0]) 
     sys.exit(0) 

    try: 
     unidecode_sqlite(sys.argv[1]) 
    except (OSError, RuntimeError) as error: 
     print('ERROR: %s' % error, file=sys.stderr) 
     sys.exit(1)

這可用於這樣的任何計算機上與Python（2或3，我不知道有關Windows），並創建1,3MB文件：

virtualenv venv 
venv/bin/pip install unidecode 
venv/bin/python unidecode_sqlite.py unidecode.sqlite

來源

2015-06-05 09:47:20 Tometzky

注意unidecode在GPL下，這可能排除在很多應用中使用導出數據。原始的Perl模塊在Perl藝術許可下。如有可能，實際的數據可能最好從相關的Unicode出版物中收集，以避免任何許可問題。 – Joey

@Joey我不使用unidecode代碼，但它的輸出。 IANAL，但我認爲GPL並未涵蓋程序輸出。 – Tometzky

你本質上是傾銷所有的數據，這是更少的程序輸出和更多的轉換數據文件。 IANAL，但那是我要小心的地方。 – Joey

Eksport unidecode數據庫的國際字符ascii當量

回答

相關問題