2015-03-13 21 views
0

我想從多個字符向量中刪除多個模式。目前,我打算:從文本向量中刪除多個模式r

a.vector <- gsub("@\\w+", "", a.vector) 
a.vector <- gsub("http\\w+", "", a.vector) 
a.vector <- gsub("[[:punct:]], "", a.vector) 

等等等等

這是痛苦的。我正在看這個問題&回答:R: gsub, pattern = vector and replacement = vector但它沒有解決問題。

無論是mapply還是mgsub的工作。我做了這些載體

remove <- c("@\\w+", "http\\w+", "[[:punct:]]") 
substitute <- c("") 

無論mapply(gsub, remove, substitute, a.vector)也不mgsub(remove, substitute, a.vector) worked.

a.vector看起來是這樣的:

[4951] "@karakamen: Suicide amongst successful men is becoming rampant. Kudos for staing the conversation. #mental"                                                                                                                                             
[4952] "@stiphan: you are phenomenal.. #mental #Writing. httptxjwufmfg" 

我想:

[4951] "Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"                                                                                                                                             
[4952] "you are phenomenal #mental #Writing" ` 

回答

1

嘗試使用|結合您的子模式。例如

>s<-"@karakamen: Suicide amongst successful men is becoming rampant. Kudos for staing the conversation. #mental" 
> gsub("@\\w+|http\\w+|[[:punct:]]", "", s) 
[1] " Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental" 

但是,如果你有大量的模式,或者將一個模式的結果造成比賽給別人,這可能成爲問題。

考慮創建您的remove載體如你所說,然後通過循環

> s1 <- s 
> remove<-c("@\\w+","http\\w+","[[:punct:]]") 
> for (p in remove) s1 <- gsub(p, "", s1) 
> s1 
[1] " Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental" 

這種方法需要將它擴大到其應用到整個表或載體,當然。但是,如果將它放入返回最終字符串的函數中,則應該可以將它傳遞給apply變體之一

0

如果您正在查找的多個模式是固定的,並且不會從大小寫在這種情況下,您可以考慮創建一個連接的正則表達式,將所有模式組合成一個超級正則表達式模式。

對於您所提供的例子,你可以嘗試:

removePat <- "(@\\w+)|(http\\w+)|([[:punct:]])" 

a.vector <- gsub(removePat, "", a.vector)