2012-08-16 23 views
1

這是第一次編寫的優化版本的標記器,它工作得很好。輔助標記器可以解析來自此函數的輸出以創建具有更高特異性的分類標記。如何重寫簡單標記器以使用正則表達式?

def tokenize(source): 
    return (token for token in (token.strip() for line 
      in source.replace('\r\n', '\n').replace('\r', '\n').split('\n') 
      for token in line.split('#', 1)[0].split(';')) if token) 

我的問題是這樣的:這可怎麼用re模塊簡單地寫的?以下是我的無效嘗試。

def tokenize2(string): 
    search = re.compile(r'^(.+?)(?:;(.+?))*?(?:#.+)?$', re.MULTILINE) 
    for match in search.finditer(string): 
     for item in match.groups(): 
      yield item 

編輯:這是輸出的,我希望從標記生成器的類型。解析文本應該很容易。

>>> def tokenize(source): 
    return (token for token in (token.strip() for line 
      in source.replace('\r\n', '\n').replace('\r', '\n').split('\n') 
      for token in line.split('#', 1)[0].split(';')) if token) 

>>> for token in tokenize('''\ 
a = 1 + 2; b = a - 3 # create zero in b 
c = b * 4; d = 5/C# trigger div error 

e = (6 + 7) * 8 
# try a boolean operation 
f = 0 and 1 or 2 
a; b; c; e; f'''): 
    print(repr(token)) 


'a = 1 + 2' 
'b = a - 3 ' 
'c = b * 4' 
'd = 5/c ' 
'e = (6 + 7) * 8' 
'f = 0 and 1 or 2' 
'a' 
'b' 
'c' 
'e' 
'f' 
>>> 
+0

將在您的發電機理解年底將正則表達式匹配到'if'聲明做呢? – tMC 2012-08-16 19:00:24

+0

不,其中一個問題是像'a; b; c'這樣的語句只返回'('a','c')',而'a#b'返回'('a',None)'。 – 2012-08-16 19:04:22

回答

1

我可能是遙遠這裏 -

>>> def tokenize(source): 
...  search = re.compile(r'^(.+?)(?:;(.+?))*?(?:#.+)?$', re.MULTILINE) 
...  return (token.strip() for line in source.split('\n') if search.match(line) 
...     for token in line.split('#', 1)[0].split(';') if token) 
... 
>>> 
>>> 
>>> for token in tokenize('''\ 
... a = 1 + 2; b = a - 3 # create zero in b 
... c = b * 4; d = 5/C# trigger div error 
... 
... e = (6 + 7) * 8 
... # try a boolean operation 
... f = 0 and 1 or 2 
... a; b; c; e; f'''): 
...  print(repr(token)) 
... 
'a = 1 + 2' 
'b = a - 3' 
'c = b * 4' 
'd = 5/c' 
'e = (6 + 7) * 8' 
'f = 0 and 1 or 2' 
'a' 
'b' 
'c' 
'e' 
'f' 
>>> 

如果適用,我會保持re.compiledef範圍。

+0

謝謝!我希望能夠用一個單一的正則表達式來完成所有的標記化工作,但是代碼工作得很好。其他人仍然歡迎編寫'lambda source:re.finditer(PATTERN,source,FLAGS)',他們在那裏定義PATTERN和FLAGS。這將是一次很好的學習經歷。 – 2012-08-16 19:54:57

+0

你不應該是'.strip()'返回值嗎? – Ben 2012-08-16 20:01:17

1

這裏是基於關閉您tokenize2功能之一:

def tokenize2(source): 
    search = re.compile(r'([^;#\n]+)[;\n]?(?:#.+)?', re.MULTILINE) 
    for match in search.finditer(source): 
     for item in match.groups(): 
      yield item 

>>> for token in tokenize2('''\ 
... a = 1 + 2; b = a - 3 # create zero in b 
... c = b * 4; d = 5/C# trigger div error 
... 
... e = (6 + 7) * 8 
... # try a boolean operation 
... f = 0 and 1 or 2 
... a; b; c; e; f'''): 
...  print(repr(token)) 
... 
'a = 1 + 2' 
' b = a - 3 ' 
'c = b * 4' 
' d = 5/c ' 
'e = (6 + 7) * 8' 
'f = 0 and 1 or 2' 
'a' 
' b' 
' c' 
' e' 
' f' 
>>>