當使用R刪除空格時，如何在單詞之間正確插入空格？

我有一些調查數據，其中項目名稱是刪除空格的調查文本。我想重新添加空格。顯然這需要一些英語知識。當使用R刪除空格時，如何在單詞之間正確插入空格？

是否有一個R函數可以在空格被刪除後正確插入空格到句子中？
或者，是否有文本處理功能可以幫助這個過程（例如，通過確定一個字母序列是一個單詞還是非單詞）？

下面是一些樣本數據，但任何功能應該在任意合理的一句話工作：

x <- c("Shewrotehimalongletter,buthedidn'treadit.", 
     "Theshootersaysgoodbyetohislove.", 
     "WritingalistofrandomsentencesisharderthanIinitiallythoughtitwouldbe.", 
     "Letmehelpyouwithyourbaggage.", 
     "Pleasewaitoutsideofthehouse.", 
     "Iwantmoredetailedinformation.", 
     "Theskyisclear;thestarsaretwinkling.", 
     "Sometimes,allyouneedtodoiscompletelymakeanassofyourselfandlaughitofftorealisethatlifeisn’tsobadafterall.")

來源：http://www.randomwordgenerator.com/sentence.php

來源

2016-08-09 Jeromy Anglim

您的問題讓我想起[紐約客最近的一篇文章]（http://www.newyorker.com/magazine/2016/03/21/han-dynasty-tables-for-two），其中提到漢代餐廳的網站遇到了色情過濾器的問題，因爲它的URL www.handynasty.net可以用兩種不同的方式閱讀。 – eipi10

你有沒有嘗試過任何NLP詞標記器？他們可能不會直接工作，但如果您想編寫自己的功能，可以告訴您是否猜測是單詞。 – alistaire

@eipi - 經典的Pen Island問題。 – thelatemail

這裏有一個答案，但它更多的是「有可能ISN '唯一答案'的答案。

ScrabbleScore包有2006年錦標賽單詞列表，所以我會用它作爲我近似搜索的'英文單詞'。

library(ScrabbleScore)  
data("twl06")

我們可以通過在該列表中查找單詞是否是「英語」來檢查它。

findword <- function(string) { 
    if (string %in% twl06) return(string) else return(1) 
}

讓我們使用一個很好的模糊的文本，我們？這一次引起了轟動，因爲它被用來作爲蘇珊大媽的專輯黨哈希標籤

x <- c("susanalbumparty")

我們可以檢查「英語」的詞語串並逐步縮短字符串作爲我們找到的話。這可以從一開始或結束時完成的，所以我會做既證明答案是不是唯一的

sentence_splitter <- function(x) { 

    z <- y <- x 
    words1 <- list() 
    while(nchar(z) > 1) { 
    while(findword(y) == 1 & nchar(y) > 1) { 
     y <- substr(y, 2, nchar(y)) 
    } 
    if (findword(y) != 1) words1 <- append(words1, y) 
    y <- z <- substr(z, 1, nchar(z) - nchar(y)) 
    } 

    z <- y <- x 
    words2 <- list() 
    while(nchar(z) > 1) { 
    while(findword(y) == 1 & nchar(y) > 1) { 
     y <- substr(y, 1, nchar(y) - 1) 
    } 
    if (findword(y) != 1) words2 <- append(words2, y) 
    y <- z <- substr(z, 1 + nchar(y), nchar(z)) 
    } 

    return(list(paste(unlist(rev(words1)), collapse = " "), 
       paste(unlist(words2), collapse = " "))) 

}

結果：

sentence_splitter("susanalbumparty") 
#> [[1]] 
#> [1] "us an album party" 
#> 
#> [[2]] 
#> [1] "us anal bump arty"

注：該發現的最長子串搜索在每個方向（因爲我縮短了弦）。你也可以通過擴展字符串來找到最短的字符串。要正確地做到這一點，你需要查看所有隻留下有效單詞的「英語」子字符串。

最後，您會注意到'susan'沒有得到匹配，因爲在這個定義下它不是'有效的英語單詞'。

希望這足以說服你，這不會很簡單。

更新：在你的一些例子嘗試這種（它實際上沒有做得太慘，一旦你tolower並刪除標點符號）...這最後一個是一個謊言，但其餘的似乎做行不行

unlist(lapply(sub("[[:punct:]]", "", tolower(x))[1:7], sentence_splitter)) 
#> "she wrote him along letter the did re adit"          
#> "shew rote him along letter but he did tread it"         
#> "the shooter says goodbye to his love"           
#> "the shooters ays goodbye to his love"           
#> "writing alist of random sentence sis harder ani initially though tit would be" 
#> "writing alist of randoms en ten es is harder than initially thought it would be" 
#> "let me help you with your baggage"            
#> "let me help you withy our baggage"            
#> "please wait outside of the house"            
#> "please wait outside oft heh use"             
#> "want more detailed information"             
#> "want more detailed information"             
#> "the sky is clear the stars are twinkling"          
#> "the sky is clear the stars are twinkling"

來源

2016-08-09 02:59:02

這是否回答你的問題？ –

當使用R刪除空格時，如何在單詞之間正確插入空格？

回答

相關問題