匹配排名最高的字在數據幀R列文本

我有兩個數據幀， DF1：匹配排名最高的字在數據幀R列文本

df1 <- c("A large bunch of purple grapes", "large green potato sack", "small red tomatoes", "yellow and black bananas") 
df1 <- data.frame(df1)

DF2：

Word <- c("green", "purple", "grapes", "small", "sack", "yellow", "bananas", "large) 

Rank <- c(20,18,22,16,15,17,6,12) 

df2 <- data.frame(Word,Rank)

DF1：

ID  Sentence 
1  A large bunch of purple grapes 
2  large green potato sack 
3  small red tomatoes 
4  yellow and black bananas

DF2：

ID  Word  Rank 
1  green  20 
2  purple  18 
3  grapes  22 
4  small  16 
5  Sack  15 
6  yellow  17 
7  bananas 6 
8  large  12

我想要做的是;將df2中的單詞與「Sentence」列中包含的單詞相匹配，並在df1中插入一個包含df2中排名最高的匹配單詞的新列。因此，像這樣：

DF1：

ID  Sentence       Word 
1  A large bunch of purple grapes grapes 
2  large green potato sack   green 
3  small red tomatoes    small 
4  yellow and black bananas   yellow

我最初用於下面的代碼相匹配的話，當然這會創建一個包含所有匹配的單詞列：

x <- sapply(df2$Word, function(x) grepl(tolower(x), tolower(df1$Sentence))) 

df1$top_match <- apply(x, 1, function(i) paste0(names(i)[i], collapse = " "))

來源

2017-10-11 Jammin

如果一個句子沒有匹配'df2'的是，做你想做的只是返回'NA'任何文字？在這種情況下，所有的句子都有匹配，但我只是想確保你沒有尋找更一般的東西。 – useR

是的，返回N/A很好，謝謝！ – Jammin

另外，你能否提供你的數據爲'deput（df1）'deput（df2）'或者你用來生成它們的代碼？ – useR

這是tidyverse + stringr解決方案：

library(tidyverse) 
library(stringr) 

df1$Sentence %>% 
    str_split_fixed(" ", Inf) %>% 
    as.data.frame(stringsAsFactors = FALSE) %>% 
    cbind(ID = rownames(df1), .) %>% 
    gather(word_count, Word, -ID) %>% 
    inner_join(df2, by = "Word") %>% 
    group_by(ID) %>% 
    filter(Rank == max(Rank)) %>% 
    select(ID, Word) %>% 
    right_join(rownames_to_column(df1, "ID"), by = "ID") %>% 
    select(ID, Sentence, Word)

結果：

# A tibble: 4 x 3 
# Groups: ID [4] 
    ID      Sentence Word 
    <chr>       <chr> <chr> 
1  1 A large bunch of purple grapes grapes 
2  2  large green potato sack green 
3  3    small red tomatoes small 
4  4  yellow and black bananas yellow

注：

您可以忽略說，來自因子脅迫ID成字符的警告。我還修改了您的數據集，以包含df1的適當列名，並禁止自動強制角色轉換爲因素。

數據：

df1 <- c("A large bunch of purple grapes", "large green potato sack", "small red tomatoes", "yellow and black bananas") 
df1 <- data.frame(Sentence = df1, stringsAsFactors = FALSE) 

Word <- c("green", "purple", "grapes", "small", "sack", "yellow", "bananas", "large") 
Rank <- c(20,18,22,16,15,17,6,12) 
df2 <- data.frame(Word,Rank, stringsAsFactors = FALSE)

來源

2017-10-11 14:44:38 useR

乾杯，完美的工作！非常感謝！ – Jammin

我已經寫了一小段（但具有不同的變量名稱）

> inp1 
    ID       Word new_word 
1 1  large green potato sack green 
2 2 A large bunch of purple grapes grapes 
3 3  yellow and black bananas yellow 
> 
> inp2 
    ID Word Rank 
1 1 green 20 
2 2 purple 18 
3 3 grapes 22 
4 4 small 16 
5 5 Sack 15 
6 6 yellow 17 
7 7 bananas 6 
8 8 large 12 
> 
> inp1$new_word <- lapply(inp1$Word, function(text){ inp2$Word[inp2$Rank == max(inp2$Rank[inp2$Word %in% unique(as.vector(str_match(text,inp2$Word)))])]}) 
> 
> inp1 
    ID       Word new_word 
1 1  large green potato sack green 
2 2 A large bunch of purple grapes grapes 
3 3  yellow and black bananas yellow 
>

來源

2017-10-11 14:12:19 amrrs

匹配排名最高的字在數據幀R列文本

回答

相關問題