2015-11-27 35 views
1

我是新來的R,和我有類似下面的數據集:提取號碼/字符出不同格式的相同的字符串

Artist                    Medium.Size 
    1  HIROSHI SUGIMOTO (B. 1948)      gelatin silver print mounted on paper \n 20 x 24 in. (50.8 x 61 cm.) 
    2  HIROSHI SUGIMOTO (B. 1948)      gelatin silver print mounted on paper \n 20 x 24 in. (50.8 x 61 cm.) 
    3  HIROSHI SUGIMOTO (B. 1948)         gelatin silver print \n 20 x 24 inches (50.7 x 63.2 cm.) 
    4  HIROSHI SUGIMOTO (B. 1948)         gelatin silver print \n 20 x 24 inches (50.7 x 63.2 cm.) 
    5  HIROSHI SUGIMOTO (B. 1948)     gelatin silver print mounted on paper \n 20 x 24 in. (50.8 x 60.9 cm.) 
    6  HIROSHI SUGIMOTO (B. 1948)      gelatin silver print mounted on paper \n 20 x 24 in. (50.8 x 61 cm.) 
    7  Richard Phillips (b. 1963)          graphite on paper \n 12 x 8? in. (30.4 x 21.5 cm.) 
    8  Marlene Dumas (b. 1953)      ink, acrylic and graphite on paper \n 26 x 19? in. (66 x 50.1 cm.) 
    9  Lisa Yuskavage (b. 1962)       oil and graphite on panel \n 7 5/8 x 9? in. (19.3 x 24.7 cm.) 
    10  Lisa Yuskavage (b. 1962)     watercolor and graphite on paper \n 7 5/8 x 10? in. (19.3 x 26.6 cm.) 
    11  Barnaby Furnas (b. 1973)      urethane and wax medium on canvas \n 40 x 30 in. (101.6 x 76.2 cm.) 

我想在第二列中提取信息,並得到關於第一個「\ n」之前的中等單詞以及括號中的表達式的信息。

我曾嘗試使用

split = strsplit(impression$Medium.Size, ", | \n | \\(") 

,但似乎它返回到我大小不等

[[3517]] 
[1] "oil on canvas\n 25 ? x 32 in." "65.4 x 81.3 cm.)"    

[[3518]] 
[1] "bronze with green and brown patina\n Height: 15 in." "38 cm.); Length: 25 5/8"        
[3] "65 cm.); Width: 27 5/8 in."       "70 cm.)" 

什麼,我希望得到的是列表類似

medium    size 
graphite on paper  50.8*61cm 

回答

5

你可以使用分割棧形 -package for that as f ollows:

library(splitstackshape) 
cSplit(impression, "Medium", sep = "\n", direction = "wide", fixed = TRUE) 

這會給你一個data.table其中Medium -column被分成兩列。