2017-10-14 39 views
0

我剛開始學習正則表達式並陷入一個問題。 我收到了一個包含電影獎項信息的數據集。正則表達式模式 - 獲取特定單詞前的數字-gsub

**Award** 
    Won 2 Oscars. Another 7 wins & 37 nominations. 
    6 wins& 30 nominations 
    5 wins 
    Nominated for 1 BAFTA Film Award. Another 1 win & 3 nominations. 

我想拉出「勝利」和「提名」之前的數字,併爲每個添加兩列。例如,對於第一個,這將是6勝列和37列提名

我使用的模式是

df2$nomination <- gsub(".*win[s]?|[[:punct:]]? | nomination.*", "",df2$Awards) 

都不盡如人意。我不知道如何編寫「勝利」模式。 :( 任何人都可以請幫助?

非常感謝!

+0

對不起,第一個對於win列將是7。 –

回答

0

我們可以提取數字的list,然後填充NAS進行情況後rbind那裏只有一個單一的元素

lst <- regmatches(df2$Award, gregexpr("\\d+(?= \\b(wins?|nominations)\\b)", 
       df2$Award, perl = TRUE)) 
df2[c('new1', 'new2')] <- do.call(rbind, lapply(lapply(lst, `length<-`, 
          max(lengths(lst))), as.numeric)) 
df2 
#                Award new1 new2 
#1     Won 2 Oscars. Another 7 wins & 37 nominations. 7 37 
#2           6 wins& 30 nominations 6 30 
#3               5 wins 5 NA 
#4 Nominated for 1 BAFTA Film Award. Another 1 win & 3 nominations. 1 3 
0

我們可以使用str_extract以正則表達式得到值

library(stringr) 
text <- c("Won 2 Oscars. Another 7 wins & 37 nominations.", 
      "6 wins& 30 nominations", 
      "5 wins", 
      "Nominated for 1 BAFTA Film Award. Another 1 win & 3 nominations.") 
df <- data.frame(text = text) 

df$value1 <- str_extract(string = df$text, "\\d+\\b(?=\\swin)") 
df$value2 <- str_extract(string = df$text, "\\d+\\b(?=\\snomination)") 

> df 
                   text value1 value2 
1     Won 2 Oscars. Another 7 wins & 37 nominations.  7  37 
2           6 wins& 30 nominations  6  30 
3               5 wins  5 <NA> 
4 Nominated for 1 BAFTA Film Award. Another 1 win & 3 nominations.  1  3