這將是非常困難乾淨。基本解析器類依賴於精確匹配或生產RHS來彈出內容,所以它需要子類化和重寫解析器類的大部分。我剛纔嘗試了一下功能語法課,然後放棄了。
我所做的更多的是黑客行爲,但基本上,我首先從文本中提取正則表達式匹配,然後將它們作爲生成語句添加到語法中。如果您使用的是大文法,它會很慢,因爲它需要重新計算每個調用的語法和解析器。
import re
import nltk
from nltk.grammar import Nonterminal, Production, ContextFreeGrammar
grammar = nltk.parse_cfg ("""
S -> TEXT
TEXT -> WORD | WORD TEXT | NUMBER | NUMBER TEXT
""")
productions = grammar.productions()
def literal_production(key, rhs):
""" Return a production <key> -> n
:param key: symbol for lhs:
:param rhs: string literal:
"""
lhs = Nonterminal(key)
return Production(lhs, [rhs])
def parse(text):
""" Parse some text.
"""
# extract new words and numbers
words = set([match.group(0) for match in re.finditer(r"[a-zA-Z]+", text)])
numbers = set([match.group(0) for match in re.finditer(r"\d+", text)])
# Make a local copy of productions
lproductions = list(productions)
# Add a production for every words and number
lproductions.extend([literal_production("WORD", word) for word in words])
lproductions.extend([literal_production("NUMBER", number) for number in numbers])
# Make a local copy of the grammar with extra productions
lgrammar = ContextFreeGrammar(grammar.start(), lproductions)
# Load grammar into a parser
parser = nltk.RecursiveDescentParser(lgrammar)
tokens = text.split()
return parser.parse(tokens)
print parse("foo hello world 123 foo")
這裏有更多的背景,其中這是在NLTK用戶羣在谷歌組討論:https://groups.google.com/d/topic/nltk-users/4nC6J7DJcOc/discussion
我不知道這是可能的,但並不實現CFG子文法匹配字母數字字符。試圖做到這一點的背景是什麼? – dmh 2013-07-03 15:04:36
NLTK的'parse_cfg'不夠健壯,無法讓你做'\ w *' – alvas 2013-12-15 13:28:35