如何檢查與正則表達式不匹配的字符序列

我目前正試圖實現一個稍後將成爲編譯器一部分的詞法掃描程序。該程序使用正則表達式來匹配輸入程序文件。如果一系列非空白字符與正則表達式匹配，那麼匹配的輸入部分將被轉換爲一個令牌，與其他其他令牌一起發送給解析器。我有代碼工作，以便正確的令牌輸出正確，但我需要做到這一點，如果發現一系列非空白字符不匹配任何一個非正常字符，掃描程序將引發異常（由方法no_token()調用）正則表達式給出。這是我在這裏的第一篇文章，所以請如果你有任何提示，我可以如何提高我的帖子，請讓我知道，或者如果你需要更多的問題或代碼信息請詢問。如何檢查與正則表達式不匹配的字符序列

def get_token(self): 
    '''Returns the next token and the part of input_string it matched. 
     The returned token is None if there is no next token. 
     The characters up to the end of the token are consumed. 
     Raise an exception by calling no_token() if the input contains 
     extra non-white-space characters that do not match any token.''' 
    self.skip_white_space() 
    # find the longest prefix of input_string that matches a token 
    token, longest = None, '' 
    for (t, r) in Token.token_regexp: 
     match = re.match(r, self.input_string[self.current_char_index:]) 
     if match is None: 
      self.no_token() 
     elif match and match.end() > len(longest): 
      token, longest = t, match.group() 
    self.current_char_index += len(longest) 
    return (token, longest)

，你可以看到我嘗試使用

if match is None: 
    self.no_token()

，但是這會產生異常，並在所涉及的起動退出程序並且不返回任何標記，但如果我評論了這一點的代碼工作正常。很顯然，如果非空白字符不匹配任何正則表達式，或者在開發的後期階段會導致問題，我需要使用此部分來產生例外

方法skip_white_space()消耗所有空白直到下一個非空白字符，正則表達式存儲在token_regexp中，self.input_string[self.current_char_index:])給出當前字符。

該計劃爲.txt文件：

z := 2; 
if z < 3 then 
    z := 1 
end

，而不調用no_token輸出：

ID z 

BEC 

NUM 2 

SEM 

IF 

ID z 

LESS 

NUM 3 

THEN 

ID z 

BEC 

NUM 1 

END

這是正確的，但是當我試圖實現no_token（）調用我得到：

lexical error: no token found at the start of z := 2; 
if z < 3 then 
    z := 1 
end

這是什麼no_token()方法輸出如果有一個ser與我在掃描儀中實現的正則表達式不匹配的字符，但不是這種輸入的情況。這裏的所有字符序列都是有效的。

來源

2016-05-10 saleem

要回答你的問題，你可以使用[式斷言（https://docs.python.org/3/library/re.html#regular - 表達式語法）如果你真的需要。一個[最小的，可驗證的，完整的例子]（/ help/mvce）對於獲得答案比真正的長解釋和相當多的不相關的代碼更有用。 – Kupiakos

感謝您的快速回復。我將閱讀https://docs.python.org/2/library/re.html，看看這可以幫助我。當你說一個可驗證的，最小的完整答案時，你的意思是一個輸入和預期輸出的例子嗎？我不確定代碼是如何不相關的，因爲這是在整個程序中對no_token（）的唯一調用，並且是錯誤發生的原因 – saleem

輸入和期望輸出的例子是最低限度的，是的。 – Kupiakos

知道了所有排序。乾杯

def get_token(self): 
    '''Returns the next token and the part of input_string it matched. 
     The returned token is None if there is no next token. 
     The characters up to the end of the token are consumed. 
     Raise an exception by calling no_token() if the input contains 
     extra non-white-space characters that do not match any token.''' 
    self.skip_white_space() 
    # find the longest prefix of input_string that matches a token 
    token, longest = None, '' 
    for (t, r) in Token.token_regexp: 
     match = re.match(r, self.input_string[self.current_char_index:]) 
     if match and match.end() > len(longest): 
      token, longest = t, match.group() 

    self.current_char_index += len(longest) 
    if token == None and self.current_char_index < len(self.input_string): 
     self.no_token() 
    return (token, longest)

是最後的工作代碼

來源

2016-05-11 00:32:46 saleem

如何檢查與正則表達式不匹配的字符序列

回答

相關問題