從twitter中提取hashtags - 在R中輸入字符串

-1

我有twitter數據。使用庫（stringr）我已經提取了所有的網頁鏈接。但是，當我嘗試做同樣的事情時，我得到錯誤。前幾天相同的代碼已經工作。以下是代碼：從twitter中提取hashtags - 在R中輸入字符串

library(stringr) 
hash <- "#[a-zA-Z0-9]{1, }" 
hashtag <- str_extract_all(travel$texts, hash)

以下是錯誤：

Error in stri_extract_all_regex(string, pattern, simplify = simplify, : 
    Error in {min,max} interval. (U_REGEX_BAD_INTERVAL)

我已經重新安裝stringr包....但於事無補。

，我用網絡鏈接的代碼是：

pat1 <- "http://t.co/[a-zA-Z0-9]{1,}" 
twitlink <- str_extract_all(travel$texts, pat1)

的reproduceable示例如下：

rtt <- structure(data.frame(texts = c("Review Anthem of the Seas Anthems  maiden voyage httptcoLPihj2sNEP #stevenewman", "#Job #Canada #Marlin Travel Agentagente de voyages Full Time in #St Catharines ON httptconMHNlDqv69", "Experience #Fiji amp #NewZealand like never before on a great 10night voyage 4033 pp departing Vancouver httptcolMvChSpaBT"), source = c("Twitter Web Client", "Catch a Job Canada", "Hootsuite"), tweet_time = c("2015-05-07 19:32:58", "2015-05-07 19:37:03", "2015-05-07 20:45:36")))

來源

2015-05-23 Apricot

你能提供一些重複的例子 – akrun

RTT < - 結構（data.frame（文本= C（「海洋國歌審查國歌首航httptcoLPihj2sNEP #stevenewman 「，#Job #Canada #Marlin旅行代理商全日制在#St Catharines ON httptconMHNlDqv69」，「體驗#Fiji amp #NewZealand前所未有的美好的10晚航行4033 pp離開溫哥華httptcolMvChSpaBT」）， source = c （「Twitter Web客戶端」，「趕上加拿大工作」，「Hootsuite」）， tweet_time = c（「 2015-05-07 19:32:58「，」2015-05-07 19:37:03「，」2015-05-07 20:45:36「））） – Apricot

請在您的帖子中更新此信息，而不是在評論 – akrun

你的問題來自於空白在hash：

#Not working (look the whitespace after the comma) 
str_extract_all(rtt$texts,"#[a-zA-Z0-9]{1, }") 
#working 
str_extract_all(rtt$texts,"#[a-zA-Z0-9]{1,}")

來源

2015-05-23 11:39:11 nicola

非常感謝nicola ...它的工作...雖然它似乎很奇怪!!! .....感謝噸泰勒林克....我會盡快嘗試qdapRegex並分享結果。 – Apricot

爲什麼很奇怪？你無法在量詞中使用空格：http://www.regular-expressions.info/repeat.html#lazy這是典型的行爲。 –

您可能需要考慮使用qdapRegex我爲此任務維護的包。它使得提取URL和哈希標記變得容易。 qdapRegex是一個包含一堆罐頭正則表達式的軟件包，並使用了令人驚歎的stringi包作爲後端來執行正則表達式任務。

rtt <- structure(data.frame(texts = c("Review Anthem of the Seas Anthems  maiden voyage httptcoLPihj2sNEP #stevenewman", "#Job #Canada #Marlin Travel Agentagente de voyages Full Time in #St Catharines ON httptconMHNlDqv69", "Experience #Fiji amp #NewZealand like never before on a great 10night voyage 4033 pp departing Vancouver httptcolMvChSpaBT"), source = c("Twitter Web Client", "Catch a Job Canada", "Hootsuite"), tweet_time = c("2015-05-07 19:32:58", "2015-05-07 19:37:03", "2015-05-07 20:45:36"))) 

library(qdapRegex) 
## first combine the built in url + twitter regexes into a function 
rm_twitter_n_url <- rm_(pattern=pastex("@rm_twitter_url", "@rm_url"), extract=TRUE) 
rm_twitter_n_url(rtt$texts) 

rm_hash(rtt$texts, extract=TRUE)

給出以下輸出：

## > rm_twitter_n_url(rtt$texts) 
## [[1]] 
## [1] "httptcoLPihj2sNEP" 
## 
## [[2]] 
## [1] "httptconMHNlDqv69" 
## 
## [[3]] 
## [1] "httptcolMvChSpaBT" 


## > rm_hash(rtt$texts, extract=TRUE) 
## [[1]] 
## [1] "#stevenewman" 
## 
## [[2]] 
## [1] "#Job" "#Canada" "#Marlin" "#St"  
## 
## [[3]] 
## [1] "#Fiji"  "#NewZealand"

來源

2015-05-23 13:13:41

從twitter中提取hashtags - 在R中輸入字符串

回答

相關問題