NLTK一套方法打印字符，而不是言辭

我是新來NLTK（和蟒蛇...）和我在與它的基本方法之一兩個問題：當我打電話NLTK一套方法打印字符，而不是言辭

sorted(set(<one of nltk's preloaded corpora>))

它打印文本中所有單詞的列表，但每個單詞都以'u'開頭，如下所示：[u'yourselves'，u'youth']。我以爲我打破了分詞器，但我試圖重新克隆回購並重新安裝。

第二個可能相關的問題是，當我在字符串中使用這些方法定義try時，我會自己傳入，而不是單詞。在使用set（）之前，我需要解析傳入的文本嗎？

來源

2013-12-20 TuringTested

的'u'只是說，該字符串是unicode。不要擔心。至於你的第二個問題，'set'使傳出的迭代器集合成爲一個集合。如果你想製作一組單詞，你需要在傳遞它之前將其分成單詞。 – Blender

太好了，謝謝你。 – TuringTested

u'foo bar'只是unicode中的一個字符串。既str和unicode被視爲basestring（見http://docs.python.org/2/howto/unicode.html，http://docs.python.org/2/library/functions.html#basestring）

>>> x = u'foobar' 
>>> isinstance(x, str) 
False 
>>> isinstance(x,unicode) 
True 
>>> isinstance(x,basestring) 
True 
>>> print x 
foobar

當您嘗試訪問從NLTK的語料庫讀者語料庫，默認的數據結構是句子的一個列表，其中每個句子是令牌列表每個令牌都是一個基礎字符串。

>>> from nltk.corpus import brown 
>>> print brown.sents() 
[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]

如果你想語料庫的純文本版本，你可以簡單地做：

>>> for i in brown.sents(): 
...  print " ".join(i) 
...  break 
... 
The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place .

有在NLTK許多內部的魔法，使語料庫的工作，因爲它是從NLTK的模塊，但知道什麼是在這些「預加載」語料庫中的一個（或者更準確地「預編碼」語料庫讀者）最簡單的方法是使用：

來源

2013-12-27 10:03:00 alvas

NLTK一套方法打印字符，而不是言辭

回答

相關問題