2015-05-01 144 views
-2

我想從數據框中創建一個邏輯模型。從char向量中提取變量

#''data.frame': 6532 obs. of 12 variables: 
#$ NewsDesk  : chr "Business" "Culture" "Business" "Business" ... 
#$ SectionName : chr "Crosswords/Games" "Arts" "Business Day" "Business Day" ... 
#$ SubsectionName: chr "" "" "Dealbook" "Dealbook" ... 
#$ Headline  : chr "More School Daze" "New 96-Page Murakami Work Coming in December" "Public Pension Funds Stay Mum on Corporate Expats" "Boot Camp for Bankers" ... 
#$ Snippet  : chr "A puzzle from Ethan Cooper that reminds me that a bill is due." "The Strange Library will arrive just three and a half months after Mr. Murakamis latest novel, Colorless Tsukuru Tazaki and His"| __truncated__ "Public pension funds have major stakes in American companies moving overseas to cut their tax bills. But they are saying little"| __truncated__ "As they struggle to find new business to bolster sluggish earnings, banks consider the nations 25 million veterans and service "| __truncated__ ... 
#$ Abstract  : chr "A puzzle from Ethan Cooper that reminds me that a bill is due." "The Strange Library will arrive just three and a half months after Mr. Murakamis latest novel, Colorless Tsukuru Tazaki and His"| __truncated__ "Public pension funds have major stakes in American companies moving overseas to cut their tax bills. But they are saying little"| __truncated__ "As they struggle to find new business to bolster sluggish earnings, banks consider the nations 25 million veterans and service "| __truncated__ ... 
#$ WordCount  : int 508 285 1211 1405 181 245 258 893 1077 188 ... 
#$ PubDate  : POSIXlt, format: "2014-09-01 22:00:09" "2014-09-01 21:14:07" ... 
#$ Popular  : int 1 0 0 1 1 1 0 1 1 0 ... 

NewsDesk有11種類別。

 # Business Culture Foreign Magazine Metro National  OpEd Science Sports 
# 1846  1548  676  375  31  198  4  521  194  2 
#Styles Travel TStyle 
# 297  116  724 

但是,我只需要OpEd, Business, Science, Culture, TStyle根據重要性創建模型。我不知道如何從NewsDesk中提取這些因子?有關於此的任何想法?

+0

@Alex A:也許我只是用模糊的方式問我的問題。我已經從頭條和摘要中提取語料庫,從PubDate中提取週日和小時。我想用所有自變量來製作glm模型來預測博客的流行。但是我認爲由於係數過多,會出現過度擬合或多重無規律問題。所以我想從NewsDesk和SectionName中提取一些關卡。 –

+0

好吧,我想我現在明白了。你需要做的是對你的數據框進行分類以刪除那些觀察結果或保留觀察結果,但將不需要的值重新編碼到別的東西上。 –

回答

0

我會做如下。

set.seed(1237) 
NewDesk <- sample(c("OpEd", "Business", "Science", "Culture", "TStyle", "Foreign", 
     "Magazine", "Metro", "Sports", "Styles", "Travel"), 100, replace = T) 
df <- data.frame(Popular = sample(0:1, 100, replace = T), NewDesk = NewDesk) 
filter <- c("OpEd", "Business", "Science", "Culture", "TStyle") 

head(df[df$NewDesk %in% filter, ]) 

# Popular NewDesk 
#1  0 Culture 
#3  0  OpEd 
#4  0 Business 
#5  1 Science 
#8  1 TStyle 
#11  1 Business