Python - pyparsing unicode字符

:)我試過使用w = Word（printables），但它不工作。我應該如何給這個規範。 'w'表示處理印地文字符（UTF-8）Python - pyparsing unicode字符

該代碼指定語法並相應地解析。

671.assess :: अहसास ::2 
x=number + "." + src + "::" + w + "::" + number + "." + number

如果只有英文字符它正在工作，所以代碼對於ascii格式是正確的，但代碼不適用於unicode格式。

我的意思是代碼工作的時候，我們有如下形式 671.assess :: ahsaas :: 2

即它解析詞語的英文格式的東西，但我不知道如何解析，然後以unicode格式打印字符。我需要英語北印度語單詞對齊的目的。

的Python代碼如下所示：

# -*- coding: utf-8 -*- 
from pyparsing import Literal, Word, Optional, nums, alphas, ZeroOrMore, printables , Group , alphas8bit , 
# grammar 
src = Word(printables) 
trans = Word(printables) 
number = Word(nums) 
x=number + "." + src + "::" + trans + "::" + number + "." + number 
#parsing for eng-dict 
efiledata = open('b1aop_or_not_word.txt').read() 
eresults = x.parseString(efiledata) 
edict1 = {} 
edict2 = {} 
counter=0 
xx=list() 
for result in eresults: 
    trans=""#translation string 
    ew=""#english word 
    xx=result[0] 
    ew=xx[2] 
    trans=xx[4] 
    edict1 = { ew:trans } 
    edict2.update(edict1) 
print len(edict2) #no of entries in the english dictionary 
print "edict2 has been created" 
print "english dictionary" , edict2 

#parsing for hin-dict 
hfiledata = open('b1aop_or_not_word.txt').read() 
hresults = x.scanString(hfiledata) 
hdict1 = {} 
hdict2 = {} 
counter=0 
for result in hresults: 
    trans=""#translation string 
    hw=""#hin word 
    xx=result[0] 
    hw=xx[2] 
    trans=xx[4] 
    #print trans 
    hdict1 = { trans:hw } 
    hdict2.update(hdict1) 

print len(hdict2) #no of entries in the hindi dictionary 
print"hdict2 has been created" 
print "hindi dictionary" , hdict2 
''' 
####################################################################################################################### 

def translate(d, ow, hinlist): 
    if ow in d.keys():#ow=old word d=dict 
    print ow , "exists in the dictionary keys" 
     transes = d[ow] 
    transes = transes.split() 
     print "possible transes for" , ow , " = ", transes 
     for word in transes: 
      if word in hinlist: 
     print "trans for" , ow , " = ", word 
       return word 
     return None 
    else: 
     print ow , "absent" 
     return None 

f = open('bidir','w') 
#lines = ["'\ 
#5# 10 # and better performance in business in turn benefits consumers . # 0 0 0 0 0 0 0 0 0 0 \ 
#5# 11 # vHyaapaar mEmn bEhtr kaam upbhOkHtaaomn kE lIe laabhpHrdd hOtaa hAI . # 0 0 0 0 0 0 0 0 0 0 0 \ 
#'"] 
data=open('bi_full_2','rb').read() 
lines = data.split('[email protected]#$%') 
loc=0 
for line in lines: 
    eng, hin = [subline.split(' # ') 
       for subline in line.strip('\n').split('\n')] 

    for transdict, source, dest in [(edict2, eng, hin), 
            (hdict2, hin, eng)]: 
     sourcethings = source[2].split() 
     for word in source[1].split(): 
      tl = dest[1].split() 
      otherword = translate(transdict, word, tl) 
      loc = source[1].split().index(word) 
      if otherword is not None: 
       otherword = otherword.strip() 
       print word, ' <-> ', otherword, 'meaning=good' 
       if otherword in dest[1].split(): 
        print word, ' <-> ', otherword, 'trans=good' 
        sourcethings[loc] = str(
         dest[1].split().index(otherword) + 1) 

     source[2] = ' '.join(sourcethings) 

    eng = ' # '.join(eng) 
    hin = ' # '.join(hin) 
    f.write(eng+'\n'+hin+'\n\n\n') 
f.close() 
'''

如果源文件的例子輸入一句話是：

1# 5 # modern markets : confident consumers # 0 0 0 0 0 
1# 6 # AddhUnIk baajaar : AshHvsHt upbhOkHtaa . # 0 0 0 0 0 0 
[email protected]#$%

的ouptut是這樣的： -

1# 5 # modern markets : confident consumers # 1 2 3 4 5 
1# 6 # AddhUnIk baajaar : AshHvsHt upbhOkHtaa . # 1 2 3 4 5 0 
[email protected]#$%

輸出說明： - 這實現了雙向對齊。這意味着英語「現代」的第一個字映射到印地語「AddhUnik」的第一個詞，反之亦然。在這裏甚至字符也被視爲單詞，因爲它們也是雙向映射的組成部分。因此，如果你觀察印地文WORD''。有一個空的對齊方式，因爲它沒有完全停止，所以它與英語句子無關。輸出中的第三行基本上代表了一個分隔符，當我們正在處理多個嘗試實現雙向映射的語句時。

如果我有Unicode（UTF-8）格式的印地文句子，我應該對它做些什麼修改。

來源

2010-02-26 boddhisattva

請編輯此問題，使用正確的格式，使問題是可讀 –

一般來說，做不過程編碼字節串：讓他們到適當的Unicode字符串（通過調用其.decode法）儘快，做你的處理總是Unicode字符串，然後，如果你有爲了I/O的目的，.encode他們回到你需要的任何字節串編碼。

如果你在談論文字，因爲看起來你是在你的代碼中，「儘快」是一次：使用u'...'來表達你的文字。在更一般的情況下，如果您不得不以編碼形式執行I/O操作，那麼在輸入後立即執行I/O操作（如果您需要以特定編碼形式執行輸出，則它就在輸出之前）。

來源

2010-02-26 06:08:08

你好先生.. :)謝謝你的回答..無論你在第二段說的是否完全適用於我的情況..我在下面的代碼行中試過這個東西： trans = u'Word（ printables）' ，我無法達到預期的輸出。如果我對錯誤行進行了修改，請您糾正我，因爲在進行此更改之後，錯誤即將到來（對於定義grammmar的行，希望在該位置使用printables）。 – boddhisattva

@mgj，不要將unicode字符串文字分配給'trans'，這是沒有意義的。只要確保'printables'是一個unicode對象（**不是** utf8編碼的字節字符串！ - 也不是帶有任何其他編碼的字節字符串！），並使用'trans = Word（printables）'。如果你的_file_是utf-8編碼，或者使用任何其他編碼進行編碼，請使用來自'codecs'模塊的'codecs.open'對其進行解碼，而不是像你在做的那樣內置'open'，這樣每個' line'是一個unicode對象，而不是一個字節字符串（以任何編碼方式）。 –

Pyparsing的printables只處理ASCII字符範圍內的字符串。要在完整的Unicode範圍printables，像這樣：

trans = Word(unicodePrintables)

我無法測試對你的印地文：

unicodePrintables = u''.join(unichr(c) for c in xrange(sys.maxunicode) 
             if not unichr(c).isspace())

現在你可以使用這個更完整的非空格字符定義trans測試字符串，但我認爲這會做伎倆。

（如果你使用Python 3，則沒有單獨的unichr功能，並且沒有的xrange發電機，只需使用：

unicodePrintables = ''.join(chr(c) for c in range(sys.maxunicode) 
             if not chr(c).isspace())

來源

2010-02-26 09:43:50 PaulMcG

謝謝你的回答，先生.. :) – boddhisattva

這個答案很久以來就已經過時了：unicode不再是16位，循環一切都不是高性能的。 –

@flyingsheep - 好的提示，更新爲使用'sys.maxunicode'而不是硬編碼常量，因此它會跟蹤Python的'sys'模塊。至於循環所有的東西，這個位只運行一次，最初定義一個解析器，當用來創建一個pyparsing'Word'時，它被存儲爲一個set（），所以解析時的性能還是相當不錯的。 – PaulMcG

Python - pyparsing unicode字符

回答

相關問題