2014-01-16 23 views
0

我是一個初學者,但是我想創建一個文件夾,我有很多文本(讓我們把小說保存爲.txt)。然後,我想要用戶選擇其中一個小說,然後自動讓詞性標註器分析整個文本。這可能嗎?我一直在嘗試使用:導入我自己的文本以使用NLTK詞性標註器

如何讓它分析用戶選擇的文本而不是此語句? 如何導入這些文本?

回答

2

有幾種方法可以讀取文本文件的目錄。

讓我們嘗試了原生的Python的方式首先,從終端/主機/命令提示符:

~$ mkdir ~/testcorpora 
~$ cd ~/testcorpora/ 
~/testcorpora$ ls 
~/testcorpora$ echo 'this is a foo foo bar bar.\n bar foo, dah dah.' > somefoobar.txt 
~/testcorpora$ echo 'what are you talking about?' > talkingabout.txt 
~/testcorpora$ ls 
somefoobar.txt talkingabout.txt 
~/testcorpora$ cd .. 
~$ python 
>>> import os 
>>> from nltk.tokenize import word_tokenize 
>>> from nltk.tag import pos_tag 
>>> corpus_directory = 'testcorpora/' 
>>> for infile in os.listdir(corpus_directory): 
...  with open(corpus_directory+infile, 'r') as fin: 
...    pos_tag(word_tokenize(fin.read())) 
... 
[('what', 'WP'), ('are', 'VBP'), ('you', 'PRP'), ('talking', 'VBG'), ('about', 'IN'), ('?', '.')] 
[('this', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('foo', 'NN'), ('foo', 'NN'), ('bar', 'NN'), ('bar.\\n', 'NN'), ('bar', 'NN'), ('foo', 'NN'), (',', ','), ('dah', 'NN'), ('dah', 'NN'), ('.', '.')] 

另一種解決方案是在NLTK使用PlaintextCorpusReader,然後運行在語料庫看到Creating a new corpus with NLTKword_tokenizepos_tag

>>> from nltk.corpus.reader.plaintext import PlaintextCorpusReader 
>>> from nltk.tag import pos_tag 
>>> corpusdir = 'testcorpora/' 
>>> newcorpus = PlaintextCorpusReader(corpusdir,'.*') 
>>> dir(newcorpus) 
['CorpusView', '__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_encoding', '_fileids', '_get_root', '_para_block_reader', '_read_para_block', '_read_sent_block', '_read_word_block', '_root', '_sent_tokenizer', '_tag_mapping_function', '_word_tokenizer', 'abspath', 'abspaths', 'encoding', 'fileids', 'open', 'paras', 'raw', 'readme', 'root', 'sents', 'words'] 
# POS tagging all the words in all text files at the same time. 
>>> newcorpus.words() 
['this', 'is', 'a', 'foo', 'foo', 'bar', 'bar', '.\\', ...] 
>>> pos_tag(newcorpus.words()) 
[('this', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('foo', 'NN'), ('foo', 'NN'), ('bar', 'NN'), ('bar', 'NN'), ('.\\', ':'), ('n', 'NN'), ('bar', 'NN'), ('foo', 'NN'), (',', ','), ('dah', 'NN'), ('dah', 'NN'), ('.', '.'), ('what', 'WP'), ('are', 'VBP'), ('you', 'PRP'), ('talking', 'VBG'), ('about', 'IN'), ('?', '.')] 
+0

非常感謝! – user3203883

+0

但是我不得不將整個小說輸入到python中,然後將其保存爲新的.txt文件嗎?如果沒有這個步驟,我不能這麼做嗎,只需「調用」我已有的.txt文件? – user3203883

+0

問題:你有一個文本文件的目錄。或單個文本文件? – alvas

相關問題