2017-04-10 43 views
-1

我試圖將兩列字符數據轉換爲因子,因此我可以分析它們的「級別」。無法將字符列轉換爲R中的數據類型因子

問題出在代碼的最後。 兩列之一處理得很好。當我運行「levels」命令時會發現一些字符串。

> levels(austinCrime2014_data_selected_zips$highestOffenseDesc) 
[1] "AGG ROBBERY BY ASSAULT" "AGG ROBBERY/DEADLY WEAPON" "BURG NON RESIDENCE SHEDS" "BURGLARY NON RESIDENCE" 
[5] "BURGLARY OF RESIDENCE"  "ROBBERY BY ASSAULT"  "ROBBERY BY THREAT" 

當我運行的另一列「級別」,我看到它出現在數據從字符轉換爲有麻煩 - >因素的數據類型。

> levels(austinCrime2014_data_selected_zips$NIBRS_OffenseDesc) 
[1] "Burglary/\nBreaking & Entering" "Robbery" 

我希望有人能幫助我理解這裏發生了什麼,以及如何糾正它。

這裏是我一起工作的代碼:

library(data.table) 
library(readr) 
library(dplyr) 

#### 
#### Import 2014 neighborhood economic data 
#### 
# Import data 
austin2014_data_raw <- read_csv('https://data.austintexas.gov/resource/hcnj-rei3.csv', na = '') 
glimpse(austin2014_data_raw) 
nrow(austin2014_data_raw) 

# Clean it: Remove NAs 
austin2014_data <- na.omit(austin2014_data_raw) 
nrow(austin2014_data) # now there's one less row. 

columnSelection <- c("Zip Code", "Population below poverty level", "Median household income", "Unemployment", "Median rent", "Percentage of rental units in poor condition") 

## Our neighborhood economic data subset 
austin2014_data_selection <- subset(austin2014_data, select=columnSelection) 
names(austin2014_data_selection) 

# Extract the zip codes for mapping & comparison with crime data 
zipCodesOfData <- austin2014_data_selection$`Zip Code` 



#### 
#### Import crime data 
#### 

# Import data 
austinCrime2014_data_raw <- read_csv('https://data.austintexas.gov/resource/7g8v-xxja.csv', na = '') 
glimpse(austinCrime2014_data_raw) 
nrow(austinCrime2014_data_raw) 

# Select and rename required columns 
columnSelection_Crime <- c("GO Location Zip", "GO Highest Offense Desc", "Highest NIBRS/UCR Offense Description") 
austinCrime_dataset <- select(austinCrime2014_data_raw, one_of(columnSelection_Crime)) 
names(austinCrime_dataset) <- c("zipcode", "highestOffenseDesc", "NIBRS_OffenseDesc") 
glimpse(austinCrime_dataset) 
nrow(austinCrime_dataset) 

# Filter crime data by zipcodes available in the neighborhood economic data subset 
austinCrime2014_data_selected_zips <- filter(austinCrime_dataset, zipcode %in% zipCodesOfData) 
glimpse(austinCrime2014_data_selected_zips) 
nrow(austinCrime2014_data_selected_zips) 
typeof(austinCrime2014_data_selected_zips) 

#### 
#### Convert our crime data subset from string/char data into factorized data so we can see levels 
#### 

# let's make the character data columns c("highestOffenseDesc", "NIBRS_OffenseDesc") into factors so we can check its levels 
glimpse(austinCrime2014_data_selected_zips) # characters 
cols <- c("highestOffenseDesc", "NIBRS_OffenseDesc") # columns with character datatype to convert to factor datatype 
austinCrime2014_data_selected_zips[cols] <- lapply(austinCrime2014_data_selected_zips[cols], factor) 
glimpse(austinCrime2014_data_selected_zips) # factors 

View(austinCrime2014_data_selected_zips) 
levels(austinCrime2014_data_selected_zips$highestOffenseDesc) #--> looks good 
levels(austinCrime2014_data_selected_zips$NIBRS_OffenseDesc) # output is weird: "Burglary/\nBreaking & Entering" "Robbery" 
+0

的問題是,你需要做的字符數據的更清潔和擺脫的\ n。 – Elin

回答

1

有與轉換沒有問題。它只是向您展示實際存在的內容:數據表的「單元格」包含一個新的行字符:\n

如果你想清理它,你可以使用gsub來替換轉義字符。或者可能只是爲該級別分配一個新名稱。

到這裏看看:Remove escapes from a string, or, "how can I get \ out of the way?"

+0

謝謝。作爲一個相關的後續,我在同一個代碼中的變量有問題。該變量是「zipCodesOfData」。當我使用這個命令:「查看(zipCodesOfData)」我得到這個奇怪的輸出:http://imgur.com/tvbK0wz 我希望你可能知道是什麼原因造成這個問題......這是非常奇怪的B/C它是就好像有一種包含整個郵編列表字符串的「鬼」單元格。 –

+0

@PatrickMeaney,什麼是'class(zipCodesOfData)'?我的猜測是這是一個蹣跚,而且你期待數字。你所看到的不是數據中的單元格,而是列的標題。要麼是這個,要麼'view'確實是很奇怪的因素。我個人從不使用'view',所以我不是最好的人問。 –

相關問題