使用tm-package進行文本挖掘 - 詞語詞幹

我正在使用tm -package進行R中的一些文本挖掘。一切都很順利。但是，在阻塞之後會出現一個問題（http://en.wikipedia.org/wiki/Stemming）。顯然，有一些詞彙具有相同的詞幹，但重要的是它們不是「一起」（因爲這些詞語意味着不同的東西）。使用tm-package進行文本挖掘 - 詞語詞幹

例如，請參閱下面的4個文本。在這裏你不能使用「講師」或「講座」（「協會」和「同伴」）互換。但是，這是在步驟4中完成的。

是否有任何優雅的解決方案如何對某些案例/單詞進行手動實現（例如，「講師」和「講座」保留爲兩個不同的東西）？

texts <- c("i am member of the XYZ association", 
"apply for our open associate position", 
"xyz memorial lecture takes place on wednesday", 
"vote for the most popular lecturer") 

# Step 1: Create corpus 
corpus <- Corpus(DataframeSource(data.frame(texts))) 

# Step 2: Keep a copy of corpus to use later as a dictionary for stem completion 
corpus.copy <- corpus 

# Step 3: Stem words in the corpus 
corpus.temp <- tm_map(corpus, stemDocument, language = "english") 

inspect(corpus.temp) 

# Step 4: Complete the stems to their original form 
corpus.final <- tm_map(corpus.temp, stemCompletion, dictionary = corpus.copy) 

inspect(corpus.final)

來源

2013-04-17 majom

這是干擾點。你這樣做是爲了獲取根詞。如果你想保留差異，那就不要干涉。 –

我知道。但是，在某些情況下，是否有一種優雅的方式來改變它？ – majom

我不是100％你在做什麼，也不完全知道tm_map如何工作。如果我明白了下面的作品。據我所知，你想提供一個不應該被阻止的單詞列表。我使用qdap包主要是因爲我很懶，它有我喜歡的功能mgsub。

注意，我很沮喪使用mgsub和tm_map，因爲它不停地拋出一個錯誤，所以我只是用lapply代替。

texts <- c("i am member of the XYZ association", 
    "apply for our open associate position", 
    "xyz memorial lecture takes place on wednesday", 
    "vote for the most popular lecturer") 

library(tm) 
# Step 1: Create corpus 
corpus.copy <- corpus <- Corpus(DataframeSource(data.frame(texts))) 

library(qdap) 
# Step 2: list to retain and indentifier keys 
retain <- c("lecturer", "lecture") 
replace <- paste(seq_len(length(retain)), "SPECIAL_WORD", sep="_") 

# Step 3: sub the words you want to retain with identifier keys 
corpus[seq_len(length(corpus))] <- lapply(corpus, mgsub, pattern=retain, replacement=replace) 

# Step 4: Stem it 
corpus.temp <- tm_map(corpus, stemDocument, language = "english") 

# Step 5: reverse -> sub the identifier keys with the words you want to retain 
corpus.temp[seq_len(length(corpus.temp))] <- lapply(corpus.temp, mgsub, pattern=replace, replacement=retain) 

inspect(corpus)  #inspect the pieces for the folks playing along at home 
inspect(corpus.copy) 
inspect(corpus.temp) 

# Step 6: complete the stem 
corpus.final <- tm_map(corpus.temp, stemCompletion, dictionary = corpus.copy) 
inspect(corpus.final)

基本上它的工作原理是：

膠層出去所提供的「NO STEM」字樣的唯一標識符鍵（mgsub）
則幹（使用stemDocument）
接下來將其翻轉並將標識符鍵與「NO STEM」字（mgsub）
最後完成幹（stemCompletion）

下面是輸出：

## >  inspect(corpus.final) 
## A corpus with 4 text documents 
## 
## The metadata consists of 2 tag-value pairs and a data frame 
## Available tags are: 
## create_date creator 
## Available variables in the data frame are: 
## MetaID 
## 
## $`1` 
## i am member of the XYZ associate 
## 
## $`2` 
## for our open associate position 
## 
## $`3` 
## xyz memorial lecture takes place on wednesday 
## 
## $`4` 
## vote for the most popular lecturer

來源

2013-04-18 00:01:27

感謝您的幫助。很棒。 – majom

您也可以使用下面的包steeming話：https://cran.r-project.org/web/packages/SnowballC/SnowballC.pdf。

你只需要使用的功能詞幹，傳遞加以遏制的話的載體，也是語言你正在處理。要知道需要使用的確切語言字符串，可以參考方法getStemLanguages，它將返回所有可能的選項。

親切的問候

來源

2017-07-04 02:06:51 brunoazev

使用tm-package進行文本挖掘 - 詞語詞幹

回答

相關問題