解析推文以提取R中的＃標籤

我想知道是否有人有快速解決方案來從推文R中提取推文標籤。例如，給定以下字符串，如何解析它以提取帶有標籤的單詞？解析推文以提取R中的＃標籤

string <- 'Crowdsourcing is awesome. #stackoverflow'

來源

2012-07-18 notrockstar

不像HTML，我想你大概可以解析用正則表達式的井號標籤。

library(stringr) 
string <- "#hashtag Crowd#sourcing is awesome. #stackoverflow #question" 
# I don't use Twitter, so maybe this regex is not right 
# for the set of allowable hashtag characters. 
hashtag.regex <- perl("(?<=^|\\s)#\\S+") 
hashtags <- str_extract_all(string, hashtag.regex)

其中產量：

> print(hashtags) 
[[1]] 
[1] "#hashtag"  "#stackoverflow" "#question"

注意，這也無需修改，如果string實際上是許多鳴叫的載體。它返回一個字符向量列表。

來源

2012-07-18 22:25:17

謝謝！它像一個魅力一樣工作！ – notrockstar 2012-07-18 22:41:33

是這樣的嗎？

string <- c('Crowdsourcing is awesome. #stackoverflow #answer', 
    "another #tag in this tweet") 
step1 <- strsplit(string, "#") 
step2 <- lapply(step1, tail, -1) 
result <- lapply(step2, function(x){ 
    sapply(strsplit(x, " "), head, 1) 
})

來源

2012-07-18 22:21:00 Thierry

謝謝。但是，如果推文的長度/字數不同，會發生什麼呢？有沒有更多的一般方法只獲得井號標籤？我有超過2萬條推文。 – notrockstar 2012-07-18 22:26:59

解析推文以提取R中的＃標籤

回答

相關問題