我有這個python腳本,我使用nltk庫來解析,標記化,標記和塊一些讓我們說從網上隨機文本。如何輸出NLTK塊到文件?
我需要格式化並在文件中寫入輸出chunked1
,chunked2
,chunked3
。這些有類型class 'nltk.tree.Tree'
更具體地說,我只需要寫出與正則表達式chunkGram1
,chunkGram2
,chunkGram3
匹配的行。
我該怎麼做?
#! /usr/bin/python2.7
import nltk
import re
import codecs
xstring = ["An electronic library (also referred to as digital library or digital repository) is a focused collection of digital objects that can include text, visual material, audio material, video material, stored as electronic media formats (as opposed to print, micro form, or other media), along with means for organizing, storing, and retrieving the files and media contained in the library collection. Digital libraries can vary immensely in size and scope, and can be maintained by individuals, organizations, or affiliated with established physical library buildings or institutions, or with academic institutions.[1] The electronic content may be stored locally, or accessed remotely via computer networks. An electronic library is a type of information retrieval system."]
def processLanguage():
for item in xstring:
tokenized = nltk.word_tokenize(item)
tagged = nltk.pos_tag(tokenized)
#print tokenized
#print tagged
chunkGram1 = r"""Chunk: {<JJ\w?>*<NN>}"""
chunkGram2 = r"""Chunk: {<JJ\w?>*<NNS>}"""
chunkGram3 = r"""Chunk: {<NNP\w?>*<NNS>}"""
chunkParser1 = nltk.RegexpParser(chunkGram1)
chunked1 = chunkParser1.parse(tagged)
chunkParser2 = nltk.RegexpParser(chunkGram2)
chunked2 = chunkParser2.parse(tagged)
chunkParser3 = nltk.RegexpParser(chunkGram3)
chunked3 = chunkParser2.parse(tagged)
#print chunked1
#print chunked2
#print chunked3
# with codecs.open('path\to\file\output.txt', 'w', encoding='utf8') as outfile:
# for i,line in enumerate(chunked1):
# if "JJ" in line:
# outfile.write(line)
# elif "NNP" in line:
# outfile.write(line)
processLanguage()
對於時候,我試圖運行它是我得到的錯誤:
`Traceback (most recent call last):
File "sentdex.py", line 47, in <module>
processLanguage()
File "sentdex.py", line 40, in processLanguage
outfile.write(line)
File "C:\Python27\lib\codecs.py", line 688, in write
return self.writer.write(data)
File "C:\Python27\lib\codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
TypeError: coercing to Unicode: need string or buffer, tuple found`
編輯: @Alvas答案之後,我能夠做到我想要的東西。但是現在,我想知道如何從文本語料庫中去除所有非ASCII字符。例如:
#store cleaned file into variable
with open('path\to\file.txt', 'r') as infile:
xstring = infile.readlines()
infile.close
def remove_non_ascii(line):
return ''.join([i if ord(i) < 128 else ' ' for i in line])
for i, line in enumerate(xstring):
line = remove_non_ascii(line)
#tokenize and tag text
def processLanguage():
for item in xstring:
tokenized = nltk.word_tokenize(item)
tagged = nltk.pos_tag(tokenized)
print tokenized
print tagged
processLanguage()
以上是從S/O中的另一個答案中獲取的。但它似乎並不奏效。什麼可能是錯的?我得到的錯誤是:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position
not in range(128)
帶有行號的錯誤跟蹤將有助於識別代碼中導致「TypeError」的內容。 – 2015-02-06 12:22:30
你的'line'包含一個'Tree',而不是'string'。嘗試對包含的字符串進行迭代。 – Selcuk 2015-02-06 12:26:48
@Selcuk你想介紹一下..嗎? – kapelnick 2015-02-06 12:39:37