重塑基於從單個列

regexed我有一個包含鳴叫列表的數據表中使用Twitter的庫抓取並希望得到與重塑基於從單個列

因此，例如註釋鳴叫的列表中選擇多個項和其他行的data.frame ，我開始：

tmp=data.frame(tweets=c("this tweet with #onehashtag","#two hashtags #here","no hashtags"),dummy=c('random','other','column')) 
> tmp 
         tweets dummy 
1 this tweet with #onehashtag random 
2   #two hashtags #here other 
3     no hashtags column

，並希望產生：

result=data.frame(tweets=c("this tweet with #onehashtag","#two hashtags #here","#two hashtags #here","no hashtags"),dummy=c('random','other','other','column'),tag=c('#onehashtag','#two','#here',NA)) 
> result 
         tweets dummy  tag 
1 this tweet with #onehashtag random #onehashtag 
2   #two hashtags #here other  #two 
3   #two hashtags #here other  #here 
4     no hashtags column  <NA>

我可以使用正則表達式：

library(stringr) 
str_extract_all("#two hashtags #here","#[a-zA-Z0-9]+")

來提取鳴叫標籤到一個列表，可能使用類似：

tmp$tags=sapply(tmp$tweets,function(x) str_extract_all(x,'#[a-zA-Z0-9]+')) 
> tmp 
         tweets dummy  tags 
1 this tweet with #onehashtag random #onehashtag 
2   #two hashtags #here other #two, #here 
3     no hashtags column

但我缺少某處一招並不能看到如何使用這個作爲基礎創建重複的行...

來源

2012-02-20 psychemedia

使用和不使用標籤的不同行的行爲，所以如果你分開處理這些情況，你的代碼將更容易理解。

像以前一樣使用str_extract_all來獲取標籤。

tags <- str_extract_all(tmp$tweets, '#[a-zA-Z0-9]+')

（您也可以使用正則表達式快捷alnum讓所有字母數字字符。'#[[:alnum:]]+'）

使用rep找出多少次重複每一行。

index <- rep.int(seq_len(nrow(tmp)), sapply(tags, length))

展開tmp使用該指數，並添加一個標籤欄。

tagged <- tmp[index, ] 
tagged$tags <- unlist(tags)

沒有標籤的行應該出現一次（不是零次），並且在標籤列中有NA。

has_no_tag <- sapply(tags, function(x) length(x) == 0L) 
not_tagged <- tmp[has_no_tag, ] 
not_tagged$tags <- NA

結合這兩者。

all_data <- rbind(tagged, not_tagged)

來源

2012-02-20 11:34:13

首先讓我們得到比賽：

matches <- gregexpr("#[a-zA-Z0-9]+",tmp$tweets) 
matches 
[[1]] 
[1] 17 
attr(,"match.length") 
[1] 11 

[[2]] 
[1] 1 15 
attr(,"match.length") 
[1] 4 5 

[[3]] 
[1] -1 
attr(,"match.length") 
[1] -1

現在，我們可以用它來從原來得到正確的行數：

rep(seq(matches),times=sapply(matches,length)) 
[1] 1 2 2 3 
tmp2 <- tmp[rep(seq(matches),times=sapply(matches,length)),]

現在使用火柴得到的起點和終點的位置：

starts <- unlist(matches) 
ends <- starts + unlist(sapply(matches,function(x) attr(x,"match.length"))) - 1

並使用substr提取：

tmp2$tag <- substr(tmp2$tweets,starts,ends) 
tmp2 
         tweets dummy   tag 
1 this tweet with #onehashtag random #onehashtag 
2   #two hashtags #here other  #two 
2.1   #two hashtags #here other  #here 
3     no hashtags column

來源

2012-02-20 11:20:49 James

重塑基於從單個列

回答

相關問題