如何在NLTK中爲語料庫創建子類別Python

我試圖在父類別下創建另一個類別。是可以創建。如何完成以及如何引用這些子類別？如何在NLTK中爲語料庫創建子類別Python

來源

2012-03-16 hyades

CategorizedCorpusReader只支持一個級別的類別。但由於類別基於文件名，因此您可以自由設置自己的名稱/類別方案並根據需要過濾語料庫文件。

你想如何使用多級類別？如果您有後續問題，請說明您想要達到的目標以及迄今爲止所嘗試的內容。

來源

2012-03-26 19:54:44 alexis

對語料庫分類最簡單的方法是每個類別都有一個文件。以下是從movie_reviews語料庫2個摘錄：

movie_pos.txt

the thin red line is flawed but it provokes .

movie_neg.txt

a big-budget and glossy production can not make up for a lack of spontaneity that permeates their tv show .

有了這兩個文件，我們將有兩類：正和負。

我們將使用CategorizedPlaintextCorpusReader，它繼承PlaintextCorpusReader和CategorizedCorpusReader。這兩個超類需要三個參數：根目錄，fileids和類別規範。

>>> from nltk.corpus.reader import 
CategorizedPlaintextCorpusReader 
>>> reader = CategorizedPlaintextCorpusReader('.', r'movie_.*\. 
txt', cat_pattern=r'movie_(\w+)\.txt') 
>>> reader.categories() 
['neg', 'pos'] 
>>> reader.fileids(categories=['neg']) 
['movie_neg.txt'] 
>>> reader.fileids(categories=['pos']) 
['movie_pos.txt']

前兩個參數CategorizedPlaintextCorpusReader是根目錄和fileids，這是傳遞給PlaintextCorpusReader讀取n中的文件。 cat_pattern關鍵字參數是用於從fileids中提取類別名稱的正則表達式。在我們的例子中，該類別是movie_之後和.txt之前的fileid的一部分。類別必須由分組括號包圍。將cat_pattern傳遞到CategorizedCorpusReader，它將覆蓋常見語料庫閱讀器函數（例如fileids(),words(),sents()和paras()）以接受類別關鍵字參數。這樣，你可以通過調用reader.sents(categories=['pos'])來獲得所有的pos句子。 CategorizedCorpusReader還提供了categories（）函數，它返回語料庫中所有已知類別的列表。

來源

2016-02-09 23:58:56 Arqam

如何在NLTK中爲語料庫創建子類別Python

回答

相關問題