從文本中提取特定信息

我想從文本文件中獲取一些數據。我已決定使用Natural Language Toolkit來做，但如果有更好的方法可以做到這一點，我會接受建議。從文本中提取特定信息

下面是一個例子：

我需要從紐約紐約到舊金山CA的航班

從這段文字中，我想得到的城市和國家的起源和目的地。

這是我到目前爲止有：

import nltk 
from nltk.text import * 
from nltk.corpus import PlaintextCorpusReader 

def readfiles():  
    corpus_root = 'C:\prototype\emails' 
    w = PlaintextCorpusReader(corpus_root, '.*') 
    t = Text(w.words()) 
    print "--- to ----" 
    print t.concordance("to") 

    print "--- from ----" 
    print t.concordance("from")

我可以讀一些輸入（在我的文件）的文本，然後使用concordance method找到這一切的用途。我想提取這個城市，在'到'和'從'之後提供的狀態信息。

問題是查看「to」和「from」實例之後的文本的最佳方式是什麼？

來源

2011-12-28 dev.e.loper

從文本中挑選類似這樣的地方稱爲「命名實體識別」 - 儘管您可能想根據地名詞典（GeoNames.org可能會查找數據）來調整自己的版本，但NLTK可以執行此操作。 – winwaed 2011-12-29 00:33:06

也許你最好逐行閱讀文件？
然後一些簡單：

cityState = dataAfterTo.split(",") 
city = cityState[0] 
state = cityState[1].split()[0]

除非你正在處理的用戶生成的教學內容。

來源

2011-12-28 16:39:14 Brian

是的，它的用戶生成了。因此，可能會或可能不會有一個'，'將城市和州隔開。我希望能夠使用Python語言或庫找到更優雅的解決方案。 – 2011-12-28 21:08:27

從文本中提取特定信息

回答

相關問題