Vectorize data.table like，grepl或類似的大數據字符串比較

我需要檢查一列中的字符串是否包含來自另一列的同一行（對於所有行）的相應（數值）值。Vectorize data.table like，grepl或類似的大數據字符串比較

如果我只是檢查一個模式的字符串，這將是簡單的使用data.table的like或grepl。但是，我的模式值對於每一行都不相同。

有一個有點相關的問題here，但不像那個問題，我需要創建一個邏輯標誌，指示模式是否存在。

假設這是我的數據集;

DT <- structure(list(category = c("administration", "nurse practitioner", 
            "trucking", "administration", "warehousing", "warehousing", "trucking", 
            "nurse practitioner", "nurse practitioner"), industry = c("admin", 
                          "truck", "truck", "admin", "nurse", "admin", "truck", "nurse", 
                          "truck")), .Names = c("category", "industry"), class = "data.frame", row.names = c(NA, 
                                               -9L)) 
setDT(DT) 
> DT 
      category industry 
1:  administration admin 
2: nurse practitioner truck 
3:   trucking truck 
4:  administration admin 
5:  warehousing nurse 
6:  warehousing admin 
7:   trucking truck 
8: nurse practitioner nurse 
9: nurse practitioner truck

我期望的結果會是這樣一個向量：

> DT 
    matches 
1: TRUE 
2: FALSE 
3: TRUE 
4: TRUE 
5: FALSE 
6: FALSE 
7: TRUE 
8: TRUE 
9: FALSE

當然，1和0是一樣TRUE和FALSE一樣好。

這裏有一些事情我想，沒有工作：

apply(DT,1,grepl, pattern = DT[,2], x = DT[,1]) 
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 

> apply(DT,1,grepl, pattern = DT[,1], x = DT[,2]) 
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 

> grepl(DT[,2], DT[,1]) 
[1] FALSE 

> DT[Vectorize(grepl)(industry, category, fixed = TRUE)] 
      category industry 
1:  administration admin 
2:   trucking truck 
3:  administration admin 
4:   trucking truck 
5: nurse practitioner nurse 

> DT[stringi::stri_detect_fixed(category, industry)] 
      category industry 
1:  administration admin 
2:   trucking truck 
3:  administration admin 
4:   trucking truck 
5: nurse practitioner nurse 

> for(i in 1:nrow(DT)){print(grepl(DT[i,2], DT[i,1]))} 
[1] FALSE 
[1] FALSE 
[1] FALSE 
[1] FALSE 
[1] FALSE 
[1] FALSE 
[1] FALSE 
[1] FALSE 
[1] FALSE 

> for(i in 1:nrow(DT)){print(grepl(DT[i,2], DT[i,1], fixed = T))} 
[1] FALSE 
[1] FALSE 
[1] FALSE 
[1] FALSE 
[1] FALSE 
[1] FALSE 
[1] FALSE 
[1] FALSE 
[1] FALSE 

> DT[category %like% industry] 
     category industry 
1: administration admin 
2: administration admin 
Warning message: 
In grepl(pattern, vector) : 
    argument 'pattern' has length > 1 and only the first element will be used

來源

2016-02-26 Hack-R

在OP的代碼，沒有使用,。因此，根據data.table方法，它將對與i索引相對應的行進行子集分類。

但是，如果我們指定我們打與j的,，我們得到的邏輯向量結果

DT[, stri_detect_fixed(category, industry)] 
#[1] TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE FALSE

假設，我們把它放在一個list，然後我們得到了data.table用列

DT[, list(match=stri_detect_fixed(category, industry))]

來源

2016-02-26 20:09:06 akrun

@akrun對解決方案是正確的，Frank對錯誤是正確的。非常感謝！ –

@Frank謝謝，我更新瞭解決方案。如果缺少任何東西，請隨時添加。 – akrun

或使用：

apply(DT, 1, function(x) grepl(x[2], x[1],fixed=T))

來源

2016-02-26 20:15:11 count

也可以。這是我在第一個例子中試圖做的。我想知道爲什麼我索引它的方式打破了它。我猜想在apply（）中，行的含義是隱含的。 –

我通常會這樣做：

DT[, flag := grepl(industry, category, fixed = TRUE), by = industry]

來源

2016-02-26 20:35:43 eddi

Vectorize data.table like，grepl或類似的大數據字符串比較

回答

相關問題