在R中沒有停用詞的形式bigrams

我最近在使用R的文本挖掘中遇到了一些問題。目的是在新聞中找到有意義的關鍵詞，例如「智能車」和「數據挖掘」。在R中沒有停用詞的形式bigrams

比方說，如果我有一個字符串，如下所示：

"IBM have a great success in the computer industry for the past decades..."

刪除停用詞（「有」，「一」，「中」，「中」，「爲」）後，

"IBM great success computer industry past decades..."

因此，會出現像「成功計算機」或「工業過去」這樣的巨頭。

但我真正需要的是在兩個單詞之間不存在任何停用詞，例如「計算機行業」就是我想要的bigram的明確示例。

我的代碼的部分低於：

corpus <- tm_map(corpus, removeWords, stopwords("english")) 
corpus <- tm_map(corpus, stripWhitespace) 
corpus <- tm_map(corpus, stemDocument) 
NgramTokenizer = function(x) {unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)} 
dtm <- TermDocumentMatrix(corpus, control = list(tokenize = NgramTokenizer))

是否有任何方法，以避免像「成功的電腦」時，TF計數的話，結果呢？

來源

2015-12-15 John Chou

也許你可以先把你一句被禁用詞拆分到不同的子句子。然後繼續進行bigrams的識別。 –

@VenYao 如何將句子拆分爲某種功能？我使用readLines導入的文本。如果文字大量的話會怎麼樣？我擔心效率問題。 –

使用'strsplit'函數。這個功能很快。 –

注意：已編輯2017-10-12以反映新的quanteda語法。

您可以在quanteda中做到這一點，它可以在形成後從ngrams中刪除停用詞。

txt <- "IBM have a great success in the computer industry for the past decades..." 

library("quanteda") 
myDfm <- tokens(txt) %>% 
    tokens_remove("\\p{P}", valuetype = "regex", padding = TRUE) %>% 
    tokens_remove(stopwords("english"), padding = TRUE) %>% 
    tokens_ngrams(n = 2) %>% 
    dfm() 

featnames(myDfm) 
# [1] "great_success"  "computer_industry" "past_decades"

做些什麼：

形式的令牌。
使用正則表達式刪除標點符號，但在刪除空格處留下空格。這可以確保你不會使用從不相鄰的令牌形成ngram，因爲它們被標點符號分開。
刪除停用詞，也留下墊在他們的位置。
構成bigrams。
構造文檔特徵矩陣。

要獲得這些二元語法的計數，您可以直接檢查DFM，或使用topfeatures()：

myDfm 
# Document-feature matrix of: 1 document, 3 features. 
# 1 x 3 sparse Matrix of class "dfmSparse" 
#  features 
# docs great_success computer_industry past_decades 
# text1    1     1   1 

topfeatures(myDfm) 
# great_success computer_industry  past_decades 
#    1     1     1

來源

2015-12-15 08:26:07

謝謝！但我如何與tm軟件包集成？將myDfm放入TermDocumentMatrix以獲得TF結果可以嗎？我的代碼位於GitHub上的以下鏈接中。 [my code]（https://gist.github.com/anonymous/0afbbe82415b0a55cfb3） –

** dtm < - TermDocumentMatrix（myDfm）** –

「TF結果」是什麼意思？'dfm（）'產生一個文檔特徵矩陣。您可以使用'topfeatures（myDfm）'查詢頂部術語，或者直接在矩陣上生成操作。 –

在R中沒有停用詞的形式bigrams

回答

相關問題