2015-02-06 42 views
4

我有這個python腳本,我使用nltk庫來解析,標記化,標記和塊一些讓我們說從網上隨機文本。如何輸出NLTK塊到文件?

我需要格式化並在文件中寫入輸出chunked1,chunked2,chunked3。這些有類型class 'nltk.tree.Tree'

更具體地說,我只需要寫出與正則表達式chunkGram1,chunkGram2,chunkGram3匹配的行。

我該怎麼做?

#! /usr/bin/python2.7 

import nltk 
import re 
import codecs 

xstring = ["An electronic library (also referred to as digital library or digital repository) is a focused collection of digital objects that can include text, visual material, audio material, video material, stored as electronic media formats (as opposed to print, micro form, or other media), along with means for organizing, storing, and retrieving the files and media contained in the library collection. Digital libraries can vary immensely in size and scope, and can be maintained by individuals, organizations, or affiliated with established physical library buildings or institutions, or with academic institutions.[1] The electronic content may be stored locally, or accessed remotely via computer networks. An electronic library is a type of information retrieval system."] 


def processLanguage(): 
    for item in xstring: 
     tokenized = nltk.word_tokenize(item) 
     tagged = nltk.pos_tag(tokenized) 
     #print tokenized 
     #print tagged 

     chunkGram1 = r"""Chunk: {<JJ\w?>*<NN>}""" 
     chunkGram2 = r"""Chunk: {<JJ\w?>*<NNS>}""" 
     chunkGram3 = r"""Chunk: {<NNP\w?>*<NNS>}""" 

     chunkParser1 = nltk.RegexpParser(chunkGram1) 
     chunked1 = chunkParser1.parse(tagged) 

     chunkParser2 = nltk.RegexpParser(chunkGram2) 
     chunked2 = chunkParser2.parse(tagged) 

     chunkParser3 = nltk.RegexpParser(chunkGram3) 
     chunked3 = chunkParser2.parse(tagged) 

     #print chunked1 
     #print chunked2 
     #print chunked3 

     # with codecs.open('path\to\file\output.txt', 'w', encoding='utf8') as outfile: 

      # for i,line in enumerate(chunked1): 
       # if "JJ" in line: 
        # outfile.write(line) 
       # elif "NNP" in line: 
        # outfile.write(line) 



processLanguage() 

對於時候,我試圖運行它是我得到的錯誤:

`Traceback (most recent call last): 
    File "sentdex.py", line 47, in <module> 
    processLanguage() 
    File "sentdex.py", line 40, in processLanguage 
    outfile.write(line) 
    File "C:\Python27\lib\codecs.py", line 688, in write 
    return self.writer.write(data) 
    File "C:\Python27\lib\codecs.py", line 351, in write 
    data, consumed = self.encode(object, self.errors) 
TypeError: coercing to Unicode: need string or buffer, tuple found` 

編輯: @Alvas答案之後,我能夠做到我想要的東西。但是現在,我想知道如何從文本語料庫中去除所有非ASCII字符。例如:

#store cleaned file into variable 
with open('path\to\file.txt', 'r') as infile: 
    xstring = infile.readlines() 
infile.close 

    def remove_non_ascii(line): 
     return ''.join([i if ord(i) < 128 else ' ' for i in line]) 

    for i, line in enumerate(xstring): 
     line = remove_non_ascii(line) 

#tokenize and tag text 
def processLanguage(): 
    for item in xstring: 
     tokenized = nltk.word_tokenize(item) 
     tagged = nltk.pos_tag(tokenized) 
     print tokenized 
     print tagged 
processLanguage() 

以上是從S/O中的另一個答案中獲取的。但它似乎並不奏效。什麼可能是錯的?我得到的錯誤是:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 
not in range(128) 
+1

帶有行號的錯誤跟蹤將有助於識別代碼中導致「TypeError」的內容。 – 2015-02-06 12:22:30

+1

你的'line'包含一個'Tree',而不是'string'。嘗試對包含的字符串進行迭代。 – Selcuk 2015-02-06 12:26:48

+0

@Selcuk你想介紹一下..嗎? – kapelnick 2015-02-06 12:39:37

回答

6

您的代碼h作爲幾個問題,雖然主要的罪魁禍首是你for循環不修改xstring的內容:

我會解決你的代碼在這裏的所有問題:

不能寫路徑一樣這與單\,因爲\t將被解釋爲一個製表符,和\f作爲換行字符。你必須加倍。我知道這是這裏的例子,但這樣的困惑經常出現:

with open('path\\to\\file.txt', 'r') as infile: 
    xstring = infile.readlines() 

以下infile.close錯誤。它不會調用close方法,它實際上沒有做任何事情。此外,您的文件已經關閉由與條款,如果你看到的任何地方任何回答這一行,請你只downvote的答案直接與評論說file.close是錯誤的,應該是file.close()

以下應該工作,但是你需要知道它與' '會破的詞語,如天真和咖啡館

def remove_non_ascii(line): 
    return ''.join([i if ord(i) < 128 else ' ' for i in line]) 

替換每個非ASCII字符,但在這裏就是爲什麼你的代碼失敗的原因unicode異常:你根本沒有修改xstring的元素,也就是說,你正在計算刪除ascii字符的行,是的,但是這是一個新值,從來沒有存儲到列表中:

for i, line in enumerate(xstring): 
    line = remove_non_ascii(line) 

相反,它應該是:

for i, line in enumerate(xstring): 
    xstring[i] = remove_non_ascii(line) 

或我的首選很Python的:

xstring = [ remove_non_ascii(line) for line in xstring ] 

雖然這些Unicode錯誤主要發生只是因爲你正在使用用於處理純Unicode文本的Python 2.7,som對於最近的Python 3版本來說,這是一件好事,因此我建議你,如果你剛開始的任務很快就會升級到Python 3.4+。

+0

感謝您的回答我一旦我有一些時間,我會仔細看看它。 – kapelnick 2015-02-14 14:31:51

7

首先,看這個視頻:https://www.youtube.com/watch?v=0Ef9GudbxXY

enter image description here

現在的正確答案:

import re 
import io 

from nltk import pos_tag, word_tokenize, sent_tokenize, RegexpParser 


xstring = u"An electronic library (also referred to as digital library or digital repository) is a focused collection of digital objects that can include text, visual material, audio material, video material, stored as electronic media formats (as opposed to print, micro form, or other media), along with means for organizing, storing, and retrieving the files and media contained in the library collection. Digital libraries can vary immensely in size and scope, and can be maintained by individuals, organizations, or affiliated with established physical library buildings or institutions, or with academic institutions.[1] The electronic content may be stored locally, or accessed remotely via computer networks. An electronic library is a type of information retrieval system." 


chunkGram1 = r"""Chunk: {<JJ\w?>*<NN>}""" 
chunkParser1 = RegexpParser(chunkGram1) 

chunked = [chunkParser1.parse(pos_tag(word_tokenize(sent))) 
      for sent in sent_tokenize(xstring)] 

with io.open('outfile', 'w', encoding='utf8') as fout: 
    for chunk in chunked: 
     fout.write(str(chunk)+'\n\n') 

[出]:

[email protected]:~$ python test2.py 
Traceback (most recent call last): 
    File "test2.py", line 18, in <module> 
    fout.write(str(chunk)+'\n\n') 
TypeError: must be unicode, not str 
[email protected]:~$ python3 test2.py 
[email protected]:~$ head outfile 
(S 
    An/DT 
    (Chunk electronic/JJ library/NN) 
    (/: 
    also/RB 
    referred/VBD 
    to/TO 
    as/IN 
    (Chunk digital/JJ library/NN) 
    or/CC 

如果你要堅持python2.7:

with io.open('outfile', 'w', encoding='utf8') as fout: 
    for chunk in chunked: 
     fout.write(unicode(chunk)+'\n\n') 

[出]:

[email protected]:~$ python test2.py 
[email protected]:~$ head outfile 
(S 
    An/DT 
    (Chunk electronic/JJ library/NN) 
    (/: 
    also/RB 
    referred/VBD 
    to/TO 
    as/IN 
    (Chunk digital/JJ library/NN) 
    or/CC 
[email protected]:~$ python3 test2.py 
Traceback (most recent call last): 
    File "test2.py", line 18, in <module> 
    fout.write(unicode(chunk)+'\n\n') 
NameError: name 'unicode' is not defined 

,並強烈建議,如果你必須堅持py2.7:

from six import text_type 
with io.open('outfile', 'w', encoding='utf8') as fout: 
    for chunk in chunked: 
     fout.write(text_type(chunk)+'\n\n') 

[出]:

[email protected]:~$ python test2.py 
[email protected]:~$ head outfile 
(S 
    An/DT 
    (Chunk electronic/JJ library/NN) 
    (/: 
    also/RB 
    referred/VBD 
    to/TO 
    as/IN 
    (Chunk digital/JJ library/NN) 
    or/CC 
[email protected]:~$ python3 test2.py 
[email protected]:~$ head outfile 
(S 
    An/DT 
    (Chunk electronic/JJ library/NN) 
    (/: 
    also/RB 
    referred/VBD 
    to/TO 
    as/IN 
    (Chunk digital/JJ library/NN) 
    or/CC 
+0

我會接受你的回答,因爲我重視你提供的反饋。也許你可以幫助我做另一件小事。看看問題的編輯部分。 – kapelnick 2015-02-08 12:02:02

+2

我會回答你的編輯,但我認爲這是另一個問題本身。最好在SO版主出現之前提出另一個問題,並由於某種原因刪除您的問題。 hahahaaa =) – alvas 2015-02-08 12:56:05

+0

你可以上傳你的文件到某個地方,然後問另一個關於數據清理的問題嗎?如果我不知道文件的外觀如何或文件是什麼,我無法提供幫助。根據文件和內容,可以有101種方法來清理數據。 – alvas 2015-02-08 12:59:20