在python2和python3 python unicode問題

我有一組python腳本（https://github.com/hvdwolf/wikiscripts）解析wikidumps，使其成爲gpx/osm/csv/sql/sqlite轉儲以用作導航應用程序中的POI文件。我只解析有座標的文章。爲此，我使用包含sql插入語句的externallinks轉儲。包含「geohack.php」子字符串的sql語句包含座標。我將它們導入到一個sqlite數據庫中，作爲文章轉儲的參考。他們都utf-8轉儲和解析所有「西式」文件工作正常，但阿拉伯語，波斯語，俄語，日語，希臘語，中文和其他語言的語言不起作用。顯然我做錯了什麼。在python2和python3 python unicode問題

我得到的標題字符串是：

％D9％85％D8％A7％D9％81％D8％B8％D8％A9_％D8％A7％D9％84％D8％ A8％D8％AF％D8％A7％D8％A6％D8％B9 ％D8％A3％D9％88％D8％B1％D9％8A％D9％88％D9％8A％D9％84％D8％A7 Battle_of_Nicopolis
青島

因此，一些普通的字符都OK。其餘的都是胡言亂語（對我來說）。我已經做了一些測試，只需讀取轉儲並寫入utf-8編碼的文本文件（line in => line out），然後工作正常，但在字符串處理函數和「re」中的某處。它會改變我的Unicode文本。

編輯：我的python腳本開頭：＃ - - 編碼：UTF-8 - -
我的代碼（相關部分，包括python2和python3語句，有的話要顯示的內容我已經嘗試過）：

with gzip.open(externallinks_file, 'r') as single_externallinksfile: 
#reader = codecs.getreader("utf-8") 
#single_externallinksfile = reader(single_externallinksfile) 
#with codecs.getreader('utf-8')gzip.open(externallinks_file, 'r') as single_externallinksfile: 
linecounter = 0 
totlinecounter = 0 
filelinecounter = 0 
# We need to read line by line as we have massive files, sometimes multiple GBs 
for line in single_externallinksfile: 
    if sys.version_info<(3,0,0): 
     line = unicode(line, 'utf-8') 
    else: 
     line = line.decode("utf-8") 
    if "INSERT INTO" in line: 
     insert_statements = line.split("),(") 
     for statement in insert_statements: 
      #statement = statement.decode("utf-8") 
      filelinecounter += 1 
      #if ("geohack.php?" in statement) and (("pagename" in statement) or ("src=" in statement)): 
      # src can also be in the line, but is different and we leave it out for now 
      if ("geohack.php?" in statement) and ("pagename" in statement) and ("params" in statement): 
       language = "" 
       region = "" 
       poitype = "" 
       content = re.findall(r'.*?pagename=(.*?)\'\,\'',statement,flags=re.IGNORECASE) 
       if len(content) > 0: # We even need this check due to corrupted lines 
        splitcontent = content[0].split("&") 
        title = splitcontent[0] 
        #title = title.decode('utf8') 
        for subcontent in splitcontent: 
         if "language=" in subcontent: 
          language = subcontent.replace("language=","") 
          #print('taal is: ' + language) 
         if "params=" in subcontent: 
          params_string = subcontent.replace("params=","").split("_") 
          latitude,longitude,poitype,region = get_coordinates_type_region(params_string) 
        if (str(latitude) != "" and str(longitude) != "" and (str(latitude) != "0") or (str(longitude) != "0")): 
         if GENERATE_SQL == "YES": 
          sql_file.write('insert into ' + file_prefix + '_externallinks values ("' + title + '","' + str(latitude) + '","' + str(longitude) + '","' + language + '","' + poitype + '","' + region + '");\n') 
         if CREATE_SQLITE == "YES": 
          sqlcommand = 'insert into ' + file_prefix + '_externallinks values ("' + title + '","' + str(latitude) + '","' + str(longitude) + '","' + language + '","' + poitype + '","' + region +'");' 
          #print(sqlcommand) 
          cursor.execute(sqlcommand) 
         linecounter += 1 
         if linecounter == 10000: 
          if CREATE_SQLITE == "YES": 
           # Do a databse commit every 10000 rows 
           wikidb.commit() 
          totlinecounter += linecounter 
          linecounter = 0 
          print('\nProcessed ' + str(totlinecounter) + ' lines out of ' + str(filelinecounter) + ' sql line statements. Elapsed time: ' + str(datetime.datetime.now().replace(microsecond=0) - start_time))

來源

2015-05-09 Harry van der Wolf

看起來像標題是percent-encoded。

try: 
    # Python 3 
    from urllib.parse import unquote 
except ImportError: 
    # Python 2 
    from urllib import unquote 

percent_encoded = ''' 
%D9%85%D8%A7%D9%81%D8%B8%D8%A9_%D8%A7%D9%84%D8%A8%D8%AF%D8%A7%D8%A6%D8%B9 
%D8%A3%D9%88%D8%B1%D9%8A%D9%88%D9%8A%D9%84%D8%A7 
Battle_of_Nicopolis 
Qingdao 
''' 
print(unquote(percent_encoded))

產生

مافظة_البدائع 
أوريويلا 
Battle_of_Nicopolis 
Qingdao

來源

2015-05-09 11:20:05 unutbu

非常感謝你。那樣做了！我嘗試了許多解碼/編碼選項，但我從來沒有聽說過百分比編碼。 –

@HarryvanderWolf：[百分比編碼]（https://tools.ietf.org/html/rfc3986#section-2.1）常用於網址。過去，經常使用非常相似的（'％20' - >'+'）'application/x-www-form-urlencoded'內容類型通過web表單（通過http）提交內容。 – jfs

我知道網址中的％20和其他編碼。我從來沒有把「這些角色中的一些」和只有這些角色的句子聯繫起來。 –

在python2和python3 python unicode問題

回答

相關問題