導入我自己的文本以使用NLTK詞性標註器

我是一個初學者，但是我想創建一個文件夾，我有很多文本（讓我們把小說保存爲.txt）。然後，我想要用戶選擇其中一個小說，然後自動讓詞性標註器分析整個文本。這可能嗎？我一直在嘗試使用：導入我自己的文本以使用NLTK詞性標註器

如何讓它分析用戶選擇的文本而不是此語句？如何導入這些文本？

2014-01-16 user3203883

有幾種方法可以讀取文本文件的目錄。

讓我們嘗試了原生的Python的方式首先，從終端/主機/命令提示符：

~$ mkdir ~/testcorpora 
~$ cd ~/testcorpora/ 
~/testcorpora$ ls 
~/testcorpora$ echo 'this is a foo foo bar bar.\n bar foo, dah dah.' > somefoobar.txt 
~/testcorpora$ echo 'what are you talking about?' > talkingabout.txt 
~/testcorpora$ ls 
somefoobar.txt talkingabout.txt 
~/testcorpora$ cd .. 
~$ python 
>>> import os 
>>> from nltk.tokenize import word_tokenize 
>>> from nltk.tag import pos_tag 
>>> corpus_directory = 'testcorpora/' 
>>> for infile in os.listdir(corpus_directory): 
...  with open(corpus_directory+infile, 'r') as fin: 
...    pos_tag(word_tokenize(fin.read())) 
... 
[('what', 'WP'), ('are', 'VBP'), ('you', 'PRP'), ('talking', 'VBG'), ('about', 'IN'), ('?', '.')] 
[('this', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('foo', 'NN'), ('foo', 'NN'), ('bar', 'NN'), ('bar.\\n', 'NN'), ('bar', 'NN'), ('foo', 'NN'), (',', ','), ('dah', 'NN'), ('dah', 'NN'), ('.', '.')]

另一種解決方案是在NLTK使用PlaintextCorpusReader，然後運行在語料庫看到Creating a new corpus with NLTKword_tokenize和pos_tag：

>>> from nltk.corpus.reader.plaintext import PlaintextCorpusReader 
>>> from nltk.tag import pos_tag 
>>> corpusdir = 'testcorpora/' 
>>> newcorpus = PlaintextCorpusReader(corpusdir,'.*') 
>>> dir(newcorpus) 
['CorpusView', '__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_encoding', '_fileids', '_get_root', '_para_block_reader', '_read_para_block', '_read_sent_block', '_read_word_block', '_root', '_sent_tokenizer', '_tag_mapping_function', '_word_tokenizer', 'abspath', 'abspaths', 'encoding', 'fileids', 'open', 'paras', 'raw', 'readme', 'root', 'sents', 'words'] 
# POS tagging all the words in all text files at the same time. 
>>> newcorpus.words() 
['this', 'is', 'a', 'foo', 'foo', 'bar', 'bar', '.\\', ...] 
>>> pos_tag(newcorpus.words()) 
[('this', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('foo', 'NN'), ('foo', 'NN'), ('bar', 'NN'), ('bar', 'NN'), ('.\\', ':'), ('n', 'NN'), ('bar', 'NN'), ('foo', 'NN'), (',', ','), ('dah', 'NN'), ('dah', 'NN'), ('.', '.'), ('what', 'WP'), ('are', 'VBP'), ('you', 'PRP'), ('talking', 'VBG'), ('about', 'IN'), ('?', '.')]

來源

2014-01-16 19:14:07 alvas

非常感謝！ – user3203883

但是我不得不將整個小說輸入到python中，然後將其保存爲新的.txt文件嗎？如果沒有這個步驟，我不能這麼做嗎，只需「調用」我已有的.txt文件？ – user3203883

問題：你有一個文本文件的目錄。或單個文本文件？ – alvas

導入我自己的文本以使用NLTK詞性標註器

回答

相關問題