2017-11-25 55 views
0

數據幀我有文字是這樣的:轉換文本通過特定模式中的R

text <- "Jeon Bo-ram (born March 22, 1986):She, better known mononymously as Boram, is a South Korean singer and actress. She is best known as a member of the South Korean girl group T-ara. Taylor Alison Swift (born December 13, 1989): She is an American singer-songwriter. One of the leading contemporary recording artists, she is known for narrative songs about her personal life, which have received widespread media coverage. Nickolas Gene Carter (born January 28, 1980): He is an American musician and actor. He is best known as a member of the pop group the Backstreet Boys. As of 2015, Carter has released three solo albums, Now or Never, I'm Taking Off and All American during breaks between Backstreet Boys schedules, and a collaboration with Jordan Knight titled Nick & Knight. He has made occasional television appearances and starred in his own reality shows, House of Carters and I (Heart) Nick Carter. He gained fame in the mid 1990s and early 2000s as a teen idol. He is also the older brother of singer Aaron Carter and the late Leslie Carter. Clyde Jackson Browne (born October 9, 1948): He is an American singer-songwriter, and musician who has sold over 18 million albums in the United States. Coming to prominence in the 1970s, Browne has written and recorded songs such as These Days, The Pretender, Running on Empty, Lawyers in Love, Doctor My Eyes, Take It Easy, and For a Rocker. In 2004, he was inducted into the Rock and Roll Hall of Fame in Cleveland, Ohio, and given an honorary doctorate of music by Occidental College in Los Angeles, California." 

任何建議,以得到這樣的:

name      born       detail 
Jeon Bo-ram    born March 22, 1986   She, better known mononymously as Boram, is a South Korean singer and actress. She is best known as a member of the South Korean girl group T-ara. Taylor Alison Swift (born December 13, 1989): She is an American singer-songwriter. One of the leading contemporary recording artists, she is known for narrative songs about her personal life, which have received widespread media coverage. 
Taylor Alison Swift  born December 13, 1989  She is an American singer-songwriter. One of the leading contemporary recording artists, she is known for narrative songs about her personal life, which have received widespread media coverage. 
Nickolas Gene Carter  born January 28, 1980  He is an American musician and actor. He is best known as a member of the pop group the Backstreet Boys. As of 2015, Carter has released three solo albums, Now or Never, I'm Taking Off and All American during breaks between Backstreet Boys schedules, and a collaboration with Jordan Knight titled Nick & Knight. He has made occasional television appearances and starred in his own reality shows, House of Carters and I (Heart) Nick Carter. He gained fame in the mid 1990s and early 2000s as a teen idol. He is also the older brother of singer Aaron Carter and the late Leslie Carter. 
Clyde Jackson Browne  born October 9, 1948   He is an American singer-songwriter, and musician who has sold over 18 million albums in the United States. Coming to prominence in the 1970s, Browne has written and recorded songs such as These Days, The Pretender, Running on Empty, Lawyers in Love, Doctor My Eyes, Take It Easy, and For a Rocker. In 2004, he was inducted into the Rock and Roll Hall of Fame in Cleveland, Ohio, and given an honorary doctorate of music by Occidental College in Los Angeles, California. 

我嘗試這一點,但未能解決問題。

cbind(do.call(rbind, strsplit(text, ":")), sub(".*[ ]", "", text)) 
+2

你已經採取了,並沒有WLD工作是有益的補充,因爲這通常不是一個代碼編寫服務的一些步驟。 – hrbrmstr

+0

正則表達式? '(。*?)\((born。*?)\):(。*)' – rbm

回答

2

正則表達式是去這裏提供你要麼保證格局不會改變你做出更適應正則表達式,如果他們做的方式。

library(stringi) 
library(purrr) 

text <- "Jeon Bo-ram (born March 22, 1986):She, better known mononymously as Boram, is a South Korean singer and actress. She is best known as a member of the South Korean girl group T-ara. Taylor Alison Swift (born December 13, 1989): She is an American singer-songwriter. One of the leading contemporary recording artists, she is known for narrative songs about her personal life, which have received widespread media coverage. Nickolas Gene Carter (born January 28, 1980): He is an American musician and actor. He is best known as a member of the pop group the Backstreet Boys. As of 2015, Carter has released three solo albums, Now or Never, I'm Taking Off and All American during breaks between Backstreet Boys schedules, and a collaboration with Jordan Knight titled Nick & Knight. He has made occasional television appearances and starred in his own reality shows, House of Carters and I (Heart) Nick Carter. He gained fame in the mid 1990s and early 2000s as a teen idol. He is also the older brother of singer Aaron Carter and the late Leslie Carter. Clyde Jackson Browne (born October 9, 1948): He is an American singer-songwriter, and musician who has sold over 18 million albums in the United States. Coming to prominence in the 1970s, Browne has written and recorded songs such as These Days, The Pretender, Running on Empty, Lawyers in Love, Doctor My Eyes, Take It Easy, and For a Rocker. In 2004, he was inducted into the Rock and Roll Hall of Fame in Cleveland, Ohio, and given an honorary doctorate of music by Occidental College in Los Angeles, California." 

讓按摩BLOB到的東西更容易管理:

stri_replace_all_regex(
    text, 
    "([[:alpha:][:space:]\\-]+ \\(born [[:alpha:]]+ [[:digit:]]+, [[:digit:]]+\\):)", 
    "\n$1\n" 
) %>% 
    stri_split_lines() %>% 
    flatten_chr() %>% 
    discard(`==`, "") %>% 
    stri_trim_both() -> lines 

lines看起來像現在這樣:

lines 
## [1] "Jeon Bo-ram (born March 22, 1986):"  
## [2] "She, better known mononymously as Boram, is a South Korean singer and actress. She is best known as a member of the South Korean girl group T-ara." 
## [3] "Taylor Alison Swift (born December 13, 1989):" 
## [4] "She is an American singer-songwriter. One of the leading contemporary recording artists, she is known for narrative songs about her personal life, which have received widespread media coverage." 
## [5] "Nickolas Gene Carter (born January 28, 1980):" 
## [6] "He is an American musician and actor. He is best known as a member of the pop group the Backstreet Boys. As of 2015, Carter has released three solo albums, Now or Never, I'm Taking Off and All American during breaks between Backstreet Boys schedules, and a collaboration with Jordan Knight titled Nick & Knight. He has made occasional television appearances and starred in his own reality shows, House of Carters and I (Heart) Nick Carter. He gained fame in the mid 1990s and early 2000s as a teen idol. He is also the older brother of singer Aaron Carter and the late Leslie Carter." 
## [7] "Clyde Jackson Browne (born October 9, 1948):" 
## [8] "He is an American singer-songwriter, and musician who has sold over 18 million albums in the United States. Coming to prominence in the 1970s, Browne has written and recorded songs such as These Days, The Pretender, Running on Empty, Lawyers in Love, Doctor My Eyes, Take It Easy, and For a Rocker. In 2004, he was inducted into the Rock and Roll Hall of Fame in Cleveland, Ohio, and given an honorary doctorate of music by Occidental College in Los Angeles, California."  

正則表達式尋找的名稱和日期模式,要麼把它分解上側。這種「過度分裂」,但discard()負責空白行。

現在我們有名稱/ DOB的線對的描述,並使用「2」指數比他們可以迭代:

starts <- seq(1, length(lines), 2) 
ends <- starts + 1 

map2_df(starts, ends, ~{ 

    stri_split_fixed(lines[.x], "(")[[1]] %>% 
    stri_replace_all_fixed("):", "") %>% 
    stri_replace_all_fixed("born ", "") -> name_dob 

    data_frame(
    name = name_dob[1], 
    born = name_dob[2], 
    detail = lines[.y] 
) 

}) -> xdf 

內部塊的第一部分分開&清潔名稱/ DOB 。後者使數據幀行和purrrmap2_df()變成了整個事情到數據幀:

glimpse(xdf) 
## Observations: 4 
## Variables: 3 
## $ name <chr> "Jeon Bo-ram ", "Taylor Alison Swift ", "Nickolas Gene Carter ",... 
## $ born <chr> "March 22, 1986", "December 13, 1989", "January 28, 1980", "Octo... 
## $ detail <chr> "She, better known mononymously as Boram, is a South Korean sing... 

如果您正在處理可能導致數千行的文本的一個巨大的博客,使用list() VS data_frame() (內部塊的最後部分)將會更快並消耗更少的臨時內存。

現在,可能會創建一個正則表達式來查找每個三元組並提取它們,但是如果他們傾向於創建可能近乎無法讀取的野獸,我會讓其他人展示他們的正則表達式技能。

+0

非常感謝您的回覆。 –

1

用正則表達式

A <- unlist(strsplit(text, "[(](?=b)|[)]:|(?<=\\.) (?=[^.]+?\\(b)", perl=TRUE)) 
B <- sapply(A, function(i) trimws(i, "both")) 
as.data.frame(matrix(B, ncol=3, byrow=TRUE)) 
相關問題