2013-04-30 76 views
4

確定的載體,這有我絕對感到困惑和worried- 作爲日常工作的一部分,我已經根據被分類的變量個別觀測作爲TRUEFALSE它們的值是否高於或低於/等於中值。但是,我一直在R中得到一個行爲,這在很大程度上是意外執行這個簡單的測試。令人困惑的行爲價值

所以借這個組觀察:

data=c(0.6666667, 0.8333, 0.6666667, 0.8333, 0.8333, 0.75, 0.9999, 0.7499667, 0.25, 0.6666667, 0.1667, 0.7499667, 0.5, 0.2500333, 0.3333667, 0.0834, 0.0001, 0.2500333, 0.8333, 0.9999, 0.9999, 0.2500333, 0.2500333, 0.3333667, 0.9166, 0.5, 0.2500333, 0.4166667, 0.0001, 0.1667333, 0.6666333, 0.0834, 0.1667, 0.6666333, 0.9166, 0.1667, 0.7499333, 0.9166, 0.9166, 0.9166, 0.7499667, 0.7499667, 0.4166667, 0.5, 0.2500333, 0.9166, 0.6666667, 0.1667333, 0.25, 0.0001, 0.3333667, 0.0001, 0.25, 0.0834, 0.9999, 0.0834, 0.1667, 0.5, 0.2500333, 0.3333667, 0.9166, 0.9166, 0.8333, 0.9166, 0.75, 0.0834, 0.4166667, 0.5, 0.0001, 0.9999, 0.8333, 0.6666667, 0.9166) 

對我來說,這些值進行分類,我所做的:

data_med=median(data) 
quant_data=data 
quant_data[quant_data>data_med]="High" 
quant_data[quant_data<=data_med]="Low" 

我知道有更有效地這樣做的1點極大數的方式,但什麼我擔心的是,從這個輸出沒有意義。由於有上集中沒有NaN S和測試全包(><=),我應該結束了,只有TRUE/FALSE值的列表,而是我得到:

[1] "High" "High" "High" "High" "High" "High" "High" "High" "Low" "High" "Low" "High" "Low" "Low" "Low" "Low" "1e-04" 
[18] "Low" "High" "High" "High" "Low" "Low" "Low" "High" "Low" "Low" "Low" "1e-04" "Low" "High" "Low" "Low" "High" 
[35] "High" "Low" "High" "High" "High" "High" "High" "High" "Low" "Low" "Low" "High" "High" "Low" "Low" "1e-04" "Low" 
[52] "1e-04" "Low" "Low" "High" "Low" "Low" "Low" "Low" "Low" "High" "High" "High" "High" "High" "Low" "Low" "Low" 
[69] "1e-04" "High" "High" "High" "High" 

請參閱「 1E-04" S?更奇怪的,讓我們挑值69,返回奇數值的那些之一:

data[69] 
>1e-04 

如果我單獨測試這個值,我得到了我的預期得到:

data[69]<=data_med 
TRUE 

人解釋這種行爲?它看起來完全危險......

+2

刪除這一行:'quant_data = data'並在'[。中使用'data'而不是'quant_data'。 Arun 2013-04-30 17:54:07

+2

完成這項任務的一個相對較好的方法是使用'ifelse'作爲前綴:'quant_data < - ifelse(data> data_med,「High」,「Low」)' – Arun 2013-04-30 17:56:05

+0

爲什麼選擇down-vote? – Arun 2013-04-30 17:59:01

回答

7

讓我們來看看你在這裏做了什麼。

data=c(0.6666667, 0.8333, 0.6666667, 0.8333, 0.8333, 0.75, 0.9999, 0.7499667, 0.25, 0.6666667, 0.1667, 0.7499667, 0.5, 0.2500333, 0.3333667, 0.0834, 0.0001, 0.2500333, 0.8333, 0.9999, 0.9999, 0.2500333, 0.2500333, 0.3333667, 0.9166, 0.5, 0.2500333, 0.4166667, 0.0001, 0.1667333, 0.6666333, 0.0834, 0.1667, 0.6666333, 0.9166, 0.1667, 0.7499333, 0.9166, 0.9166, 0.9166, 0.7499667, 0.7499667, 0.4166667, 0.5, 0.2500333, 0.9166, 0.6666667, 0.1667333, 0.25, 0.0001, 0.3333667, 0.0001, 0.25, 0.0834, 0.9999, 0.0834, 0.1667, 0.5, 0.2500333, 0.3333667, 0.9166, 0.9166, 0.8333, 0.9166, 0.75, 0.0834, 0.4166667, 0.5, 0.0001, 0.9999, 0.8333, 0.6666667, 0.9166) 



data_med=median(data) ## 0.5 
quant_data=data  ## irrelevant 
quant_data[quant_data>data_med]="High" 

但是,這樣做已經轉換quant_data爲一個字符向量

str(quant_data) 
## chr [1:73] "High" "High" "High" "High" "High" "High" "High" ... 

現在字符值和data_med值之間的比較幾乎是沒有意義的,因爲data_med將得到裹挾一個字符的值也是:

"High" < "0.5" ## FALSE 
"1e-4" < "0.5" ## FALSE -- this is your problem. 
quant_data[quant_data<=data_med]="Low" 

你想的大概是意味着做(和理由分配quant_data=data)是:

quant_data[data>data_med]="High" 
quant_data[data<=data_med]="Low" 
table(quant_data) 
## High Low 
## 35 38 

正如@Arun在上述評論指出,quant_data <- ifelse(data>data_med,"High","Low")將工作太。所以會適當使用cut()

+0

實際上,'quant_data <= data_med'主要工作。 '「0.1」<1'給出「TRUE」,但「1e-4」<1'給出錯誤。所以在前者中字符被成功轉換爲數字,但它不適用於科學格式。 – Roland 2013-04-30 18:02:48

+1

它*不*轉換爲數字,它只是字典排序(見上面我的評論)碰巧給你正確的答案。 – 2013-04-30 18:05:23

+0

啊,我明白了。謝謝。 – Roland 2013-04-30 18:07:18