更改因子水平 - 「f」中的未知水平 - 無法更改水平

我有一個包含許多行業名稱的因子。我需要將它們分解成大類和行業。例如，因爲我允許受訪者以他們想要的方式做出迴應，所以我有很多級別的金額（例如金融服務，金融服務，銀行，金融）。由於這些情況不匹配，他們出來作爲一個附加的水平，所以我想用forcats塌陷他們：更改因子水平 - 「f」中的未知水平 - 無法更改水平

test <- fct_collapse(PrescreenF$Industry, Finance = c("Banking", 
    "Corporate Finance", "Finance", "Financial", "financial services", 
    "financial services", "Financial Services", "Financial services"), 
    NULL = "H")

我得到的說，一個警告：「金融服務」是未知的。這是非常令人沮喪的，因爲當我調用向量時，我可以看到它確實存在。我試着複製和粘貼來自通話的確切單詞，重新寫入，而且好像有隱藏字符阻止了它被更改。

如何正確摺疊這些值？

-> test$industry 
Banking 
Corporate Finance 
Finance Financial 
financial services 
financial services 
Financial Services 
Financial services

當我去「重估」說，最後一級「金融服務」，它告訴我它是一個未知的字符串。

編輯輸出dput的（X $行業）

structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 
4L, 3L, 3L, 3L, 5L, 7L, 8L, 9L, 10L, 11L, 12L, 12L, 13L, 14L, 
15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 16L, 16L, 16L, 16L, 
16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 
16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 17L, 18L, 18L, 18L, 
18L, 19L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 25L, 26L, 27L, 28L 
), .Label = c("", "{\"ImportId\":\"QID8_TEXT\"}", "Finance", 
"Financial ", "Financial services ", "Please indicate the industry you work in (e.g. technology, healthcare etc):", 
"Cleantech", "Delivery", "e-commerce/fashion", "Food", "Food & Bev", 
"Retail", "Service", "tech", "technology", "Technology", "IT, technology", 
"Software", "Technology ", "Tehcnology", "Consulting", "Digital advertising", 
"Education", "Higher education", "Technology, management consulting", 
"University professor; teaching, research and service", "Information Technology and Services", 
"mobile technology"), class = "factor")

編輯想通了。有些術語在結束後有額外的空間。例如，儘管當我打電話給Prescreen $ Industry時，它會返回一些名稱，如「銀行」和「公司金融」，它並沒有告訴我該級別後有空間。銀行業實際上是......「銀行業」，有一個無形的空間，並沒有在R中出現。人們如何確保這一點可見並且不再發生？

我可以在列中運行len函數嗎？如果是這樣，那是如何工作的？ PrescreenF $ Industry（「Banking」）？

來源

2017-10-05 D500

請分享您的數據的一個可重複的例子，以便我們可以解決這個問題。 –

如果有隱藏的字符，它們可能是空白的。 'stringr :: str_trim'可以提供幫助，但是你必須首先將這些因素改爲字符，然後返回因子。 – shea

你可以發佈'dput（test $ industry）'或'dput（head（test，20））'的輸出嗎？ –

如果「x」是您dataframe

library(stringr) 

x$industry <- as.character(x$industry) 
x$industry <- str_trim(x$industry) 
x$industry <- as.factor(x$industry)

然後你就可以回到fct_collapse()來簡化你的因素。

來源

2017-10-05 21:14:19 shea

更改因子水平 - 「f」中的未知水平 - 無法更改水平

回答

相關問題