2016-08-05 39 views
0

我有兩列的Dataframe。第一列是標識號,第二列是化合物。然而,第2欄中的化合物通常是重複的(不同形式的相同化合物)。我想除去化合物的簡單形式以外的每個副本。從R Dataframe中刪除準複製品

這是數據框:

>NISTSpecR 

    NIST              NAME 
    366620        Formic acid, TMS derivative 
    366765 2-[2-(2-Butoxyethoxy)ethoxy] Acetic acid, TMS derivative 
    342340        Acetic acid, TMS derivative 
    352374       Propanoic acid, TMS derivative 
    333858        Butyric Acid, TMS derivative 
    352377       Pentanoic acid, TMS derivative 
    24239       Hexanoic acid, TMS derivative 
    333733       Heptanoic acid, TMS derivative 
    352455        Oxalic acid, 2TMS derivative 
    414056     Succinic acid, monoethyl ester-, (TMS) 
    332809        Adipic acid, TMS derivative 
    30799       Pimelic acid, 2TMS derivative 
    292699       Suberic acid, 2TMS derivative 
    333874        Citric acid, 4TMS derivative 
    366657        Citric acid, 3TMS derivative 
    333513       (-)-Epinephrine, 3TMS derivative 
    16985     Epinephrine, (.beta.)-, 3TMS derivative 
    24795     Norepinephrine, (R)-, 5TMS derivative 
    332935      DL-Norepinephrine, 4TMS derivative 

這裏是它的結構:

> str(NISTSpecR) 

'data.frame': 154 obs. of 3 variables: 
$ Spec: Factor w/ 239429 levels "1 0; 13 2; 14 27; 15 239; 16 3; 18 2; 26 3; 27 36; 28 32; 29 113; 30 9; 31 64; 32 9; 33 17; 34 17; 35 20; 36 1; 37 1; 41 8; 42 "| __truncated__,..: 23720 32791 3011 32175 12349 29069 193166 26108 28713 73845 ... 
$ NIST: chr "366620" "366765" "342340" "352374" ... 
$ NAME: Factor w/ 239430 levels "-4'-Dimethylamino-2'-(trimethylsilyl)acetanilide",..: 157152 39442 108436 210392 133148 199151 169386 168243 195800 229235 ... 

我想最終的結果看起來是這樣的:

>NISTSpecR 

    NIST              NAME 
    366620        Formic acid, TMS derivative 
    342340        Acetic acid, TMS derivative 
    352374       Propanoic acid, TMS derivative 
    333858        Butyric Acid, TMS derivative 
    352377       Pentanoic acid, TMS derivative 
    24239       Hexanoic acid, TMS derivative 
    333733       Heptanoic acid, TMS derivative 
    352455        Oxalic acid, 2TMS derivative 
    414056     Succinic acid, monoethyl ester-, (TMS) 
    332809        Adipic acid, TMS derivative 
    30799       Pimelic acid, 2TMS derivative 
    292699       Suberic acid, 2TMS derivative 
    366657        Citric acid, 3TMS derivative 
    333513       (-)-Epinephrine, 3TMS derivative 
    24795     Norepinephrine, (R)-, 5TMS derivative 

有每種母體化合物只有一種(即甲酸,...)。它需要是最簡單的版本(字符最少的版本)。

> dput(as.character(NISTSpecR$NAME)) 

c("Formic acid, TMS derivative", "2-[2-(2-Butoxyethoxy)ethoxy] Acetic acid, TMS derivative", 
"Acetic acid, TMS derivative", "Propanoic acid, TMS derivative", 
"Butyric Acid, TMS derivative", "Pentanoic acid, TMS derivative", 
"Hexanoic acid, TMS derivative", "Heptanoic acid, TMS derivative", 
"Oxalic acid, 2TMS derivative", "Succinic acid, monoethyl ester-, (TMS)", 
"Adipic acid, TMS derivative", "Pimelic acid, 2TMS derivative", 
"Suberic acid, 2TMS derivative", "Citric acid, 4TMS derivative", 
"Citric acid, 3TMS derivative", "Citric acid 3TMS", "Citric acid, ethyl ester, tri-TMS", 
"Isocitric acid lactone, 2TMS derivative", "Glyoxylic acid, di-TMS", 
"Pyruvic acid, TMS derivative", "Malic acid, 2TMS derivative", 
"Malic acid 1-ethyl ester, 2TMS", "Malic acid, 4-ethyl ester, 2TMS", 
"Malic acid, 3TMS derivative", "4-Hydroxybutanoic acid, 2TMS derivative", 
"Prostaglandin A1, 2TMS derivative", "Prostaglandin A2, 2TMS derivative", 
"Prostaglandin E2, 3TMS", "D-Arabinose, 4TMS derivative", "D-Xylose, 4TMS derivative", 
"D-Lyxose, 4TMS derivative", "D-Ribose, 4TMS derivative", "D-Glucose, 5TMS derivative", 
"D-Galactose, 5TMS derivative", "D-Mannose, 5TMS derivative", 
"D-Allose, oxime (isomer 1), 6TMS derivative", "D-Allose, oxime (isomer 2), 6TMS derivative", 
"D-Altrose, 5TMS derivative", "Dihydroxyacetone, 2TMS derivative", 
"1,3-Dihydroxyacetone dimer, 4TMS derivative", "D-Fructose, 5TMS  derivative", 

「d阿洛酮糖,5TMS衍生物」, 「景天庚酮糖,6TMS衍生物」, 「d-2-脫氧核糖,3TMS衍生物」, 「2-脫氧核糖,3TMS衍生物」, 「L-巖藻糖, 「4TMS衍生物」,「L-鼠李糖,(R,R,S,S) - ,4TMS衍生物」, 「D-鼠李糖,4TMS衍生物」,「N-乙酰基-D-葡糖胺,4TMS衍生物」, 「D 「2-葡萄糖酸,6TMS衍生物」,「甘油單硬脂酸酯,2TMS衍生物」, 「甘油2-月桂酸酯,2TMS衍生物」,「甘油,3TMS衍生物」, 「木糖醇,5TMS衍生物」,「D-山梨糖醇, ,「D-甘露糖醇,6TMS衍生物」, 「蔗糖,8TMS衍生物」,「D-乳糖(異構體1),8TMS衍生物「, 」β-D-乳糖,(異構體1),8TMS衍生物「,」D-乳糖,(異構體2),8TMS衍生物「, 」β-D - 乳糖,(異構體2),8TMS衍生物「,」α-D-乳糖,8TMS衍生物「, 」α-D-乳糖,8TMS衍生物「,」β-乳糖,8TMS衍生物「 「麥芽糖,8TMS衍生物,異構體1」,「麥芽糖,8TMS衍生物,異構體2」,「麥芽糖,8TMS衍生物」 「D-海藻糖,7TMS衍生物」,「Melibiose,衍生物「, 」L-鳥氨酸,3TMS衍生物「,」DL-鳥氨酸,3TMS衍生物「, 」DL-鳥氨酸,4TMS衍生物「,」L-鳥氨酸,4TMS衍生物「, 」L-高絲氨酸,2TMS衍生物「 ,「L-瓜氨酸,3TMS衍生物」, 「3-碘-L-酪氨酸,3TMS衍生物」,「3-氨基異丁酸,TMS衍生物」, 「3-氨基異丁酸,3TMS衍生物」,「3-氨基異丁酸,2TMS衍生物」, 「D-異亮氨酸,N-乙酰基 - ,TMS衍生物「,」L-羥脯氨酸,(E) - ,2TMS衍生物「, 」3-羥基脯氨酸,(E) - ,3TMS衍生物「,」羥基脯氨酸,3TMS衍生物「 「3-羥基脯氨酸,3TMS衍生物」,「L-胱氨酸,4TMS衍生物」, 「乙醇胺,3TMS衍生物」,「乙醇胺,2TMS衍生物」, 「3-氨基丙醇,TMS衍生物」,「腐胺,4TMS衍生物」 2TMS衍生物「,」組胺,3TMS衍生物「,」多巴胺,4TMS衍生物「, 」多巴胺,3TMS衍生物「,」5-羥色胺,4TMS衍生物「,」酪胺,3TMS衍生物「,「TMS衍生物」,「酪胺,2TMS衍生物」,「苯乙胺,2TMS衍生物」, 「 17α-雌三醇,3TMS衍生物「, 」雌三醇,3TMS衍生物「,」16α,17α-雌三醇,3TMS衍生物「, 」16β,17β。「雌三醇,3TMS衍生物」,「雌酮,TMS衍生物」, 「16-雌酮,TMS衍生物」,「雌酮,O-甲基肟,TMS衍生物」, 「Equilin,TMS衍生物」,「Equilenin,(14.β (E) - ,TMS衍生物「,」脫氫表雄酮,(E) - ,TMS衍生物「, 」Equilenin,TMS衍生物「,」2-羥基雌二醇,3TMS衍生物「, 「TMS衍生物」,「5.α-二氫睾酮,TMS衍生物」, 「睾酮O-甲基肟,TMS衍生物」,「睾酮,TMS衍生物」, 「孕烯醇酮,TMS衍生物」 ,「Aldosterone,2TMS衍生物」, 「醛固酮,N-甲氧基 - 三-TMS」,「皮質酮,雙(O-甲基肟)」, 「脫氧膽酸,2TMS衍生物」,「脫氧HolMS酸,3TMS衍生物「, 」石膽酸,2TMS衍生物「,」膽固醇,TMS衍生物「, 」Desmosterol,TMS衍生物「,」Ergosterol,TMS衍生物「, 」Campesterol,TMS衍生物「,」Fucosterol,TMS衍生物「, 」Stigmastanol,TMS衍生物「,」Stigmasterol,TMS衍生物「, 」11-脫氧皮質醇,雙(O-甲基肟)「,」褪黑激素,2TMS衍生物「, 」腎上腺素,4TMS衍生物「腎上腺素,4TMS衍生物「, 」甘氨酸3TMS衍生物「」甘氨酸TMS衍生物「」甘氨酸2TMS衍生物「 」天冬氨酸3TMS衍生物「」L-天冬氨酸3TMS衍生物「 」L 「 - 天冬氨酸,2TMS衍生物」,「L-穀氨酸,3TMS衍生物」, 「( - ) - 腎上腺素,3TMS衍生物「β腎上腺素,βTMS衍生物」, 「( - ) - 腎上腺素,4TMS衍生物」,「去甲腎上腺素,(R) - ,5TMS衍生物」, 「DL-去甲腎上腺素,4TMS衍生物」 「去甲腎上腺素,(R) - , - 4TMS衍生物」, 「環絲氨酸,3TMS衍生物」, 「放線菌酮,2TMS衍生物」, 「氯黴素,2TMS衍生物」, 「氯黴素,3TMS衍生物」 )

謝謝。

回答

1

按照你的編輯我做了如下:首先,提取的措辭相匹配的後綴

parents <- extract_indices <- str_split(nist, ",") %>% 
    lapply(str_extract, "[A-z][a-z]+(ine|ol|in|ose|ic|one|ide)") 

然後,因爲其中的一些話有比他們一個逗號多,提取 將非NA值出現在列表extract_indices中,並將發生在每個列表元素中的這個索引保存到向量中indvec

extract_indices <- parents %>% 
    lapply(function(x) which(!is.na(x))) 
indvec <- do.call("c",extract_indices) 

然後通過父項循環併爲每個列表元素提取父代化合物發生的向量。

answer <- sapply(seq_along(parents), 
     function(i){ 
     parents[[i]][indvec][i] 
     }) 

    answer 

    [1] "Formic"     "Acetic"     "Acetic"     "Propanoic"    "Butyric"    
    [6] "Pentanoic"    "Hexanoic"    "Heptanoic"    "Oxalic"     "Succinic"    
[11] "Adipic"     "Pimelic"    "Suberic"    "Citric"     "Citric"     
[16] "Citric"     "Citric"     "Isocitric"    "Glyoxylic"    "Pyruvic"    
[21] "Malic"     "Malic"     "Malic"     "Malic"     "Hydroxybutanoic"  
[26] "Prostaglandin"   "Prostaglandin"   "Prostaglandin"   "Arabinose"    "Xylose"     
[31] "Lyxose"     "Ribose"     "Glucose"    "Galactose"    "Mannose"    
[36] "Allose"     "Allose"     "Altrose"    "Dihydroxyacetone"  "Dihydroxyacetone"  
[41] "Fructose"    "Psicose"    "Sedoheptulose"   "Deoxyribose"   "Deoxyribose"   
[46] "Fucose"     "Rhamnose"    "Rhamnose"    "glucosamine"   "Gluconic"    
[51] "Glycerol"    "Glycerol"    "Glycerol"    "Xylitol"    "Sorbitol"    

它繼續這樣的...

現在,考慮到你只需要在最短的每一個,由至少字符計算,先算字符的原始數據集,然後每個簡短的答案都有匹配,從最初的數據中選擇一個最短的字符。

nchar_parent <- nchar(nist) 
final <- vector(mode = "character", length(nist)) 
for(i in seq_along(nist)){ 
    temp_matches <- which(match(answer,answer[i])==TRUE) 
    shortest <- temp_matches[which.min(nchar_parent[temp_matches])] 
    final[i] <- nist[shortest] 
} 

你最終的答案是這樣的

[1] "Formic acid, TMS derivative"     "Acetic acid, TMS derivative"     
    [3] "Acetic acid, TMS derivative"     "Propanoic acid, TMS derivative"    
    [5] "Butyric Acid, TMS derivative"     "Pentanoic acid, TMS derivative"    
    [7] "Hexanoic acid, TMS derivative"    "Heptanoic acid, TMS derivative"    
    [9] "Oxalic acid, 2TMS derivative"     "Succinic acid, monoethyl ester-, (TMS)"  
[11] "Adipic acid, TMS derivative"     "Pimelic acid, 2TMS derivative"    
[13] "Suberic acid, 2TMS derivative"    "Citric acid 3TMS"        
[15] "Citric acid 3TMS"        "Citric acid 3TMS"        
[17] "Citric acid 3TMS"        "Isocitric acid lactone, 2TMS derivative"  
[19] "Glyoxylic acid, di-TMS"      "Pyruvic acid, TMS derivative"     
[21] "Malic acid, 2TMS derivative"     "Malic acid, 2TMS derivative"     
[23] "Malic acid, 2TMS derivative"     "Malic acid, 2TMS derivative"  
+0

如果它們都是單純的酸,這將起作用。 df已經更新了一些其他值。另外,我只需要保留最簡單的版本,而不是任何版本 –

+0

其他建議? –

+0

這些化合物需要其他什麼形式?你有清單嗎?與已知列表進行匹配將很容易,否則會更加特殊的字符串分割。 – shayaa

0

如果您只需要第二列的第一部分(在逗號之前),則可以使用將第二列分隔成許多列的split函數;在此操作之後,您需要此結果的第一列;在此之後,可以根據計算的列刪除df的重複條目;最後的指令刪除(可選)第二列的第一部分。

df$foo <- data.frame(do.call('rbind', strsplit(as.character(df$NAME),',',fixed=TRUE)))[,1]#split values 
df<-df[!duplicated(df$foo),] 
df<-df[,-3] 
+1

當然,我很抱歉。我在這裏是新的 –

+0

這種格式刪除所有重複,並不保留其中一個不這樣做? –

+0

如果您有:pippo,pippo,pluto,它將只保留pippo(第一個)和pluto –