如何添加自定義語料庫到本地機器在nltk

我有一個自定義語料庫創建數據，我需要做一些分類。我擁有與movies_reviews語料庫包含的相同格式的數據集。根據nltk文檔，我使用以下代碼訪問movie_reviews語料庫。無論如何，都可以將任何自定義語料庫添加到nltk_data/corpora目錄中，並以訪問現有語料庫的相同方式訪問該語料庫。如何添加自定義語料庫到本地機器在nltk

import nltk 
    from nltk.corpus import movie_reviews 

    documents = [(list(movie_reviews.words(fileid)), category) 
     for category in movie_reviews.categories() 
     for fileid in movie_reviews.fileids(category)]

來源

2017-02-11 Janitha

雖然你可以破解NLTK讓你的陰莖看起來像一個內置的NLTK語料庫，這是錯誤的方式去了解它。 nltk提供了豐富的「語料庫閱讀器」，您可以從任何位置閱讀您的語料庫，而無需將它們移動到nltk_data目錄或竊取nltk源文件。 nltk自己的語料庫在幕後使用相同的語料庫讀者，因此您的讀者將擁有等效內置語料庫的所有方法和行爲。

讓我們來看看movie_reviews語料庫是如何在nltk/corpora/__init__.py定義：

movie_reviews = LazyCorpusLoader(
    'movie_reviews', CategorizedPlaintextCorpusReader, 
    r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', 
    encoding='ascii')

可以忽略LazyCorpusLoader部分;這是爲了提供語料庫，你的程序很可能永遠不會使用。剩下的部分顯示電影評論語料庫被讀取爲CategorizedPlaintextCorpusReader，其文件全部以.txt結尾，並且評論通過在子目錄pos和neg中被分類爲類別。最後，語料庫編碼是ascii。因此，像這樣定義自己的語料庫（根據需要更改值）：

mycorpus = nltk.corpus.reader.CategorizedPlaintextCorpusReader(
    r"/home/user/path/to/my_corpus", 
    r'(?!\.).*\.txt', 
    cat_pattern=r'(neg|pos)/.*', 
    encoding="ascii")

就是這樣;你現在可以撥打mycorpus.words(),mycorpus.sents(categories="neg")等，就好像這是一個由nltk提供的語料庫。

來源

2017-02-11 17:41:30 alexis

偉大的這是工作:)非常感謝 – Janitha

首先將來自新語料庫的實際數據放入您的nltk_data/corpora/目錄。然後，您必須編輯nltk.corpus的__init__.py文件。您可以通過執行找到此文件路徑：

import nltk 
print(nltk.corpus.__file__)

在文本編輯這個文件，你會看到，大多數的文件是創建LazyCorpusLoader對象，並將它們分配給全局變量。

因此，例如，一個部分可能看起來像：

.... 
verbnet = LazyCorpusLoader(
    'verbnet', VerbnetCorpusReader, r'(?!\.).*\.xml') 
webtext = LazyCorpusLoader(
    'webtext', PlaintextCorpusReader, r'(?!README|\.).*\.txt', encoding='ISO-8859-2') 
wordnet = LazyCorpusLoader(
    'wordnet', WordNetCorpusReader, 
    LazyCorpusLoader('omw', CorpusReader, r'.*/wn-data-.*\.tab', encoding='utf8')) 
....

爲了增加一個新的語料，你只需要一個新的行添加到該文件相同的格式上面的例子。所以，如果你有一個名爲movie_reviews語料庫，你必須保存在nltk_data/corpora/movie_reviews的數據，那麼你想添加一行：

movie_reviews = LazyCorpusLoader('movie_reviews', ....)

爲LazyCorpusLoader其他參數可以在文檔here被發現。

然後你只需保存此文件，那麼你應該能夠做到：

from nltk.corpus import movie_reviews

來源

2017-02-11 16:13:57 bunji

然後，有一天，你更新NLTK，恕不另行通知這些更改將被抹去。真的，與alexis的答案一起去更安全。 – lenz

@bunji - 試用亞歷克西斯的方式。它正在工作。謝謝你的指導 – Janitha

@Janitha，很高興幫助。我想我誤解了你的請求，「以同樣的方式訪問現有的語料庫」，因爲它意味着它應該像現有的語料庫一樣可導入。我的壞... – bunji

好的，所以我對提供的解決方案有點問題，我發現對我來說很簡單的方法是首先在「corpora」目錄中創建我的文件夾和子文件夾，然後編輯init。 py doc。

所以在我的情況，我創建了語料庫是VC和子文件夾都audio_them，audio_us，video_them，video_us

vc = LazyCorpusLoader(
    'vc', CategorizedPlaintextCorpusReader, 
    r'(?!\.).*\.txt', 
cat_pattern = r'(audio_them|audio_us|video_them|video_us)/.$ 
    encoding="ascii")

來源

2017-11-18 16:21:21 Fayomi

如何添加自定義語料庫到本地機器在nltk

回答

相關問題