正則表達式錯誤信息 - 「內存不足」

我一直在玩R的情緒分析功能，並且一直運行在運行gsub函數時引發的錯誤。正面和負面的單詞列表取自here。正則表達式錯誤信息 - 「內存不足」

經過一番Google搜索之後，我在R幫助列表中發現了一處提到這個錯誤的地方，但沒有其他地方。有沒有人遇到這個問題？到底是怎麼回事？有沒有解決方法？

我在過去使用字符串時運行過類似的代碼（使用gsub和stringer包），這是我第一次遇到這種類型的錯誤。此外，我試圖通過在一組不同的字符串上編寫類似的腳本來重現此錯誤，並且工作正常。

以下是錯誤消息：

> pos_match <- str_c(vpos, collapse = "|") 
> neg_match <- str_c(vneg, collapse = "|") 
> dat$positive <- as.numeric(str_detect(dat$Comment, pos_match)) 
> dat$negative <- as.numeric(str_detect(dat$Comment, neg_match)) 
Error: invalid regular expression, reason 'Out of memory'

這裏是整個 '的過程。'

## SET WORKING DIRECTOR AND IMPORT PACKAGES: 
setwd("~/Desktop/R_Tricks") 
require(tm); require(stringr); require(lubridate); library(RTextTools) 

# IMPORT DATA: 
d1 <- read.csv("Video_Comments.csv", stringsAsFactors=FALSE, sep=",", fileEncoding="ISO_8859-2") 
pos <- read.csv("positive-words.csv", stringsAsFactors=FALSE, header=TRUE, fileEncoding="ISO_8859-2") 
neg <- read.csv("negative-words.csv", stringsAsFactors=FALSE, header=TRUE, fileEncoding="ISO_8859-2") 
vpos = as.vector(pos[,1]); vneg = as.vector(neg[,1]) 
head(vpos); head(vneg) 
colnames(d1); nrow(d1); ncol(d1) 
str(d1); head(d1) 
table(d1$Likes); table(d1$Replies) 
nrow(vpos); nrow(vneg) 
length(vpos); length(vneg) 
is.atomic(vpos); is.atomic(vneg) 

# SELECT DATA: 
dat = data.frame(Comment=c(d1$Comment)) 
head(dat) 
# CLEAN DATA - COMMENTS: 
dat$Comment = gsub('[[:punct:]]', '', dat$Comment) 
dat$Comment = gsub('[[:cntrl:]]', '', dat$Comment) 
dat$Comment = gsub('\\d+', '', dat$Comment) 
dat$Comment = tolower(dat$Comment) 
head(dat) 
# CLEAN DATA - CLASSIFICATIONS: 
vpos = gsub('[[:punct:]]', '', vpos); vneg = gsub('[[:punct:]]', '', vneg) 
vpos = gsub('[[:cntrl:]]', '', vpos); vneg = gsub('[[:cntrl:]]', '', vneg) 
vpos = gsub('\\d+', '', vpos); vneg = gsub('\\d+', '', vneg) 
vpos = tolower(vpos); vneg = tolower(vneg) 
head(vpos); head(vneg) 

# MATCH WORDS WITH FACEBOOK COMMENTS: 
pos_match <- str_c(vpos, collapse = "|") 
neg_match <- str_c(vneg, collapse = "|") 
dat$positive <- as.numeric(str_detect(dat$Comment, pos_match)) 
dat$negative <- as.numeric(str_detect(dat$Comment, neg_match))

編輯：

我已經接收到另一個錯誤信息是：

> dat$negative <- as.numeric(str_detect(dat$Comment, neg_match)) 
Error: invalid regular expression 'faced|faces|abnormal|abolish|abominable|abominably|abominate|abomination|abort|aborted|

編輯2：

數據用於再現錯誤：

dat = c("Hey guys I am Aliza Lomez...18 y.o. I need your likes please like my page and find love quotes, beauty tips and much more.Please like my page you will never regret thank u all\u0083 <3 <3 <3...", 
     "Alexandra Saturn", "And that's what makes a Subaru a Subaru", "Missouri in a battleground....; meanwhile in southern California....", "What the Frisbee", "very cool !!!!", "Get a life", 
     "Try that with my GT!!!", "Did he make any money?", "Wo! WO! BSMITH THROWING DISCS WITH SUBARUS?!?! THIS IS SO AWESOME! SHOULD OF USED AN STI THO")

來源

2014-10-06 ATMA

你正在創建〜6000'OR'運算符匹配 - 「|」嗎？ 'pos_match < - str_c（vpos，collapse =「|」）' – zx8754 2014-10-06 14:01:40

我沒有回答你的問題，因爲它不可重現，但你可能想看看['polarity'函數]（http：///trinker.github.io/qdap_dev/polarity.html）放在'qdap'包中。你可能正在重新發明已經完成的事情。 – 2014-10-06 14:03:05

'tm'包還有一個'tm.plugin.sentiment' [plugin/package]（https://r-forge.r-project.org/R/?group_id=1048），應該會好一點比建立巨大的正則表達式。 – hrbrmstr 2014-10-06 14:05:27

我不知道他整個解決方案，但我可以讓你開始。我做了這個社區wiki，希望有人可以填寫空格...

對於無效的正則表達式，要創建一個OR，您需要將所有內容都括在括號內。例如，如果您想匹配單詞「a」，「an」或「the」，則可以使用正則表達式字符串(a|an|the)。如果我有一個單詞列表，我想匹配的或正則表達式，這裏就是我通常使用：

mywords <- c("a", "an", "the") 
mystring <- paste0("(", paste(mywords, collapse="|"), ")") 

> mystring 
[1] "(a|an|the)"

這應該擺脫你的正則表達式無效的錯誤，因爲你的字符串開頭不是一個開放的圓括號，並以管道而不是結束圓括號結束。

來源

2014-10-06 14:29:07

正則表達式錯誤信息 - 「內存不足」

回答

相關問題