創建列表詞法分析器

我需要創建一個詞法分析器來處理可變長度和結構的輸入數據。創建列表詞法分析器

說我有保留關鍵字的列表：

keyWordList = ['command1', 'command2', 'command3']

和用戶輸入的字符串：

userInput = 'The quick brown command1 fox jumped over command2 the lazy dog command 3' 
userInputList = userInput.split()

我將如何去寫這樣的功能：

INPUT: 

tokenize(userInputList, keyWordList) 

OUTPUT: 
[['The', 'quick', 'brown'], 'command1', ['fox', 'jumped', 'over'], 'command 2', ['the', 'lazy', 'dog'], 'command3']

我已經寫了一個可以識別關鍵字的標記器，但一直無法找出一種有效的方法來嵌入不將n關鍵字放入更深層次的列表中。

RE解決方案是受歡迎的，但我真的很想看到底層算法，因爲我可能會將應用程序擴展到其他對象列表，而不僅僅是字符串。

來源

2012-01-15 Joel Cornett

試試這個：

keyWordList = ['command1', 'command2', 'command3'] 
userInput = 'The quick brown command1 fox jumped over command2 the lazy dog command3' 
inputList = userInput.split() 

def tokenize(userInputList, keyWordList): 
    keywords = set(keyWordList) 
    tokens, acc = [], [] 
    for e in userInputList: 
     if e in keywords: 
      tokens.append(acc) 
      tokens.append(e) 
      acc = [] 
     else: 
      acc.append(e) 
    if acc: 
     tokens.append(acc) 
    return tokens 

tokenize(inputList, keyWordList) 
> [['The', 'quick', 'brown'], 'command1', ['fox', 'jumped', 'over'], 'command2', ['the', 'lazy', 'dog'], 'command3']

來源

2012-01-15 00:47:26

我實際上想出了類似的東西，但是你的優雅一點。 – 2012-01-15 01:27:21

這很容易與一些正則表達式來做到：

>>> reg = r'(.+?)\s(%s)(?:\s|$)' % '|'.join(keyWordList) 
>>> userInput = 'The quick brown command1 fox jumped over command2 the lazy dog command3' 
>>> re.findall(reg, userInput) 
[('The quick brown', 'command1'), ('fox jumped over', 'command2'), ('the lazy dog', 'command3')]

現在，你只需要拆分每個元組的第一個元素。

對於多個層面的深度，正則表達式可能不是一個好的答案。

有一些不錯的解析器，爲您在本頁面選擇：http://wiki.python.org/moin/LanguageParsing

我覺得Lepl是一個很好的一個。

來源

2012-01-15 00:19:57 JBernardo

只要command1是其他術語之一的子字符串（例如「length」和「len」），就會出現問題。 – DSM 2012-01-15 00:24:18

確實如此。可以在關鍵字列表中添加'\ s'來解決這個問題。我編輯了我的回答 – JBernardo 2012-01-15 00:26:30

+1的鏈接 – 2012-04-25 22:02:35

事情是這樣的：

def tokenize(lst, keywords): 
    cur = [] 
    for x in lst: 
     if x in keywords: 
      yield cur 
      yield x 
      cur = [] 
     else: 
      cur.append(x)

這將返回一個發電機，所以你包裹在一個list電話。

來源

2012-01-15 00:23:06

或者看看PyParsing。相當不錯的一個lex解析器組合

來源

2012-01-15 12:45:19 Nickle

創建列表詞法分析器

回答

相關問題