無法處理此正則表達式

我有以下的「greekSymbols.txt」無法處理此正則表達式

Α α alpha 
Β β beta 
Γ γ gamma 
Δ δ delta 
Ε ε epsilon 
Ζ ζ zeta 
Η η eta 
Θ θ theta 
Ι ι iota 
Κ κ kappa 
Λ λ lambda 
Μ μ mu 
Ν ν nu 
Ξ ξ xi 
Ο ο omicron 
Π π pi 
Ρ ρ rho 
Σ σ sigma 
Τ τ tau 
Υ υ upsilon 
Φ φ phi 
Χ χ chi 
Ψ ψ psi 
Ω ω omega

我試圖將其轉換成Anki的純文本文件選項卡作爲分隔符。我將每行轉換爲兩張牌，其中前面是符號（大寫或小寫），後面是名字。我有以下幾點。

#!/usr/local/bin/python 

import re 

pattern = re.compile(r"(.)\s+(.)\s+(.+)", re.UNICODE) 

input = open("./greekSymbols.txt", "r") 

output = open("./greekSymbolsFormated.txt", "w+") 

line = input.readline() 

while line: 

    string = line.rstrip() 

    m = pattern.match(string) 

    if m: 
     output.write(m.group(1) + "\t" + m.group(3) + "\n") 
     output.write(m.group(2) + "\t" + m.group(3) + "\n") 
    else: 
     print("I was unable to process line '" + string + "' [" + str(m) + "]") 

    line = input.readline() 

input.close(); 
output.close();

不幸的是，我目前得到「我無法處理......」消息的每一行，通過str（M）是無的價值。我究竟做錯了什麼？

> localhost:Anki stephen$ python ./convertGreekSymbols.py 
I was unable to process line 'Α α alpha' [None] 
I was unable to process line 'Β β beta' [None] 
...

來源

2013-04-09 Stephen Cagle

我更新了由答案建議的正則表達式更改，但我仍然沒有找到匹配項。我也刪除了換行符，以防萬一導致某些事情發生。 – 2013-04-09 06:13:49

你知道文件的編碼嗎？ – 2013-04-09 06:22:56

你並不真的需要這樣的正則表達式：

with (open("./greekSymbols.txt") as infile, 
     open("./greekSymbolsFormated.txt", "w+") as outfile): 
    for line in infile: 
     up, low, name = line.split() 
     outfile.write("{0}\t{1}".format(up,name)) 
     outfile.write("{0}\t{1}".format(low,name))

如果你想堅持正則表達式，請嘗試以下的正則表達式的你，而不是（這應該IMO工作，但是這或許是不夠明確）：

pattern = re.compile(r"(\S+)\s+(\S+)\s+(.+)", re.UNICODE)

來源

2013-04-09 05:43:41

謝謝，這裏的整個過程的一部分是學習RE，儘管我很感謝幫助。 – 2013-04-09 06:13:00

@StephenCagle：我添加了一個正則表達式的建議。我希望你的正則表達式能夠工作 - 可能這個問題與UTF-8中多字節序列表示的一些字符有關（我假設你正在使用它），並且這些字符沒有被一個'。，儘管我本來預計它會在基於字符的級別上工作，而不是基於字節的。但由於我不在Unix環境中，因此我無法在此處進行測試。 – 2013-04-09 06:26:20

在我看來，這是空白解析是錯誤的。難道不是(.)\s(.)\s(.+)，而不是\t？您的輸入中似乎沒有選項卡。

來源

2013-04-09 05:42:14 Dolda2000

我相信我有標籤，它似乎粘貼到HTML刪除它們？ – 2013-04-09 06:08:33

你有\ t其中沒有標籤，應該是\ S：

>>> matcher = re.compile(r"(.)\s(.)\t(.+)", re.UNICODE) 
>>> phi = "Φ φ phi" 
>>> matcher.match(phi) 
>>> matcher = re.compile(r"(.)\s(.)\s+(.+)", re.UNICODE) 
>>> matcher.match(phi) 
<_sre.SRE_Match object at 0x1018d8290> 
>>>

來源

2013-04-09 05:43:26

不能與你的邏輯爭論，但我仍然得到錯誤？ – 2013-04-09 06:12:28

你可以用\ s +（我在上面更新）嘗試。可能是你的標籤是多個空白字符。如果這不起作用，你能粘貼一行到pastebin或者其他什麼東西嗎？ – 2013-04-09 06:22:18

這是終於可以正常工作的代碼。看來原來的文件我已經是utf-8了，這是造成問題的原因。這是工作解決方案，它允許我爲Anki創建一個分離的導入文件。

#!/usr/local/bin/python 

import re 
import codecs 

pattern = re.compile(r"(\S+)\s+(\S+)\s+(.+)", re.UNICODE) 

input = codecs.open("./greekSymbols.txt", "r", encoding="utf-8") 

output = codecs.open("./greekSymbolsFormated.txt", "w+", encoding="utf-8") 

line = input.readline() 

while line: 

    string = line.rstrip() 

    m = pattern.match(string) 

    if m: 
     output.write(unicode(m.group(1) + "\t" + m.group(3) + "\n")) 
     output.write(unicode(m.group(2) + "\t" + m.group(3) + "\n")) 
    else: 
     print("I was unable to process line '" + string + "' [" + str(m) + "]") 

    line = input.readline() 

input.close(); 
output.close();

來源

2013-04-09 08:17:28

無法處理此正則表達式

回答

相關問題