2015-05-23 36 views
-1

我有twitter數據。使用庫(stringr)我已經提取了所有的網頁鏈接。但是,當我嘗試做同樣的事情時,我得到錯誤。前幾天相同的代碼已經工作。以下是代碼:從twitter中提取hashtags - 在R中輸入字符串

library(stringr) 
hash <- "#[a-zA-Z0-9]{1, }" 
hashtag <- str_extract_all(travel$texts, hash) 

以下是錯誤:

Error in stri_extract_all_regex(string, pattern, simplify = simplify, : 
    Error in {min,max} interval. (U_REGEX_BAD_INTERVAL) 

我已經重新安裝stringr包....但於事無補。

,我用網絡鏈接的代碼是:

pat1 <- "http://t.co/[a-zA-Z0-9]{1,}" 
twitlink <- str_extract_all(travel$texts, pat1) 

的reproduceable示例如下:

rtt <- structure(data.frame(texts = c("Review Anthem of the Seas Anthems  maiden voyage httptcoLPihj2sNEP #stevenewman", "#Job #Canada #Marlin Travel Agentagente de voyages Full Time in #St Catharines ON httptconMHNlDqv69", "Experience #Fiji amp #NewZealand like never before on a great 10night voyage 4033 pp departing Vancouver httptcolMvChSpaBT"), source = c("Twitter Web Client", "Catch a Job Canada", "Hootsuite"), tweet_time = c("2015-05-07 19:32:58", "2015-05-07 19:37:03", "2015-05-07 20:45:36"))) 
+0

你能提供一些重複的例子 – akrun

+0

RTT < - 結構(data.frame(文本= C(「海洋國歌審查國歌首航httptcoLPihj2sNEP #stevenewman 「,#Job #Canada #Marlin旅行代理商全日制在#St Catharines ON httptconMHNlDqv69」,「體驗#Fiji amp #NewZealand前所未有的美好的10晚航行4033 pp離開溫哥華httptcolMvChSpaBT」), source = c (「Twitter Web客戶端」,「趕上加拿大工作」,「Hootsuite」), tweet_time = c(「 2015-05-07 19:32:58「,」2015-05-07 19:37:03「,」2015-05-07 20:45:36「))) – Apricot

+0

請在您的帖子中更新此信息,而不是在評論 – akrun

回答

1

你的問題來自於空白在hash

#Not working (look the whitespace after the comma) 
str_extract_all(rtt$texts,"#[a-zA-Z0-9]{1, }") 
#working 
str_extract_all(rtt$texts,"#[a-zA-Z0-9]{1,}") 
+0

非常感謝nicola ...它的工作...雖然它似乎很奇怪!!! .....感謝噸泰勒林克....我會盡快嘗試qdapRegex並分享結果。 – Apricot

+0

爲什麼很奇怪?你無法在量詞中使用空格:http://www.regular-expressions.info/repeat.html#lazy這是典型的行爲。 –

0

您可能需要考慮使用qdapRegex我爲此任務維護的包。它使得提取URL和哈希標記變得容易。 qdapRegex是一個包含一堆罐頭正則表達式的軟件包,並使用了令人驚歎的stringi包作爲後端來執行正則表達式任務。

rtt <- structure(data.frame(texts = c("Review Anthem of the Seas Anthems  maiden voyage httptcoLPihj2sNEP #stevenewman", "#Job #Canada #Marlin Travel Agentagente de voyages Full Time in #St Catharines ON httptconMHNlDqv69", "Experience #Fiji amp #NewZealand like never before on a great 10night voyage 4033 pp departing Vancouver httptcolMvChSpaBT"), source = c("Twitter Web Client", "Catch a Job Canada", "Hootsuite"), tweet_time = c("2015-05-07 19:32:58", "2015-05-07 19:37:03", "2015-05-07 20:45:36"))) 

library(qdapRegex) 
## first combine the built in url + twitter regexes into a function 
rm_twitter_n_url <- rm_(pattern=pastex("@rm_twitter_url", "@rm_url"), extract=TRUE) 
rm_twitter_n_url(rtt$texts) 

rm_hash(rtt$texts, extract=TRUE) 

給出以下輸出:

## > rm_twitter_n_url(rtt$texts) 
## [[1]] 
## [1] "httptcoLPihj2sNEP" 
## 
## [[2]] 
## [1] "httptconMHNlDqv69" 
## 
## [[3]] 
## [1] "httptcolMvChSpaBT" 


## > rm_hash(rtt$texts, extract=TRUE) 
## [[1]] 
## [1] "#stevenewman" 
## 
## [[2]] 
## [1] "#Job" "#Canada" "#Marlin" "#St"  
## 
## [[3]] 
## [1] "#Fiji"  "#NewZealand"