2014-01-12 54 views
5

我有R中一個字符串作爲如何獲取R中的字符串中的前10個單詞?

x <- "The length of the word is going to be of nice use to me" 

我想上述特定字符串的第一個10個字。

又例如我有一個CSV文件,其中的格式如下: -

Keyword,City(Column Header) 
The length of the string should not be more than 10,New York 
The Keyword should be of specific length,Los Angeles 
This is an experimental basis program string,Seattle 
Please help me with getting only the first ten words,Boston 

我想只有從列「關鍵字」第10個字的每一行,並將其寫入到CSV文件。 請在這方面幫助我。

回答

2

這是一個小函數,它將字符串取消列表,將前十個單詞分組,然後將它們粘貼在一起。

string_fun <- function(x) { 
    ul = unlist(strsplit(x, split = "\\s+"))[1:10] 
    paste(ul,collapse=" ") 
} 

string_fun(x) 

df <- read.table(text = "Keyword,City(Column Header) 
The length of the string should not be more than 10 is or are in,New York 
The Keyword should be of specific length is or are in,Los Angeles 
       This is an experimental basis program string is or are in,Seattle 
       Please help me with getting only the first ten words is or are in,Boston", sep = ",", header = TRUE) 

df <- as.data.frame(df) 

使用申請(該功能未在第二列做任何事情)

df$Keyword <- apply(df[,1:2], 1, string_fun) 

編輯 這大概是使用功能的更一般的方式。

df[,1] <- as.character(df[,1]) 
df$Keyword <- unlist(lapply(df[,1], string_fun)) 

print(df) 
#      Keyword       City.Column.Header. 
# 1 The length of the string should not be more than   New York 
# 2 The Keyword should be of specific length is or are   Los Angeles 
# 3 This is an experimental basis program string is or    Seattle 
# 4  Please help me with getting only the first ten    Boston 
+1

函數的第二行可以簡化爲:'paste(ul,collapse =「」)' – thelatemail

+0

什麼是library for unlist in r?@Martin Bel – user3188390

+0

'''unlist()'''是在base中,不需要加載它!閱讀文檔'''?unlist''' – marbel

15

使用\w(字符)和它的否定\W正則表達式(regex)答案:

gsub("^((\\w+\\W+){9}\\w+).*$","\\1",x) 
  1. ^令牌的開始(零寬度)
  2. ((\\w+\\W+){9}\\w+)十個字由不可─分離話。
    1. (\\w+\\W+){9}甲字,接着不是一個單詞,9倍
      1. \\w+的一個或多個字字符(即一個字)
      2. \\W+一個或多個非單詞字符(即,空間)
      3. {9}九重複
    2. \\w+第十字
  3. .*別的,包括其它下列詞語
  4. $完令牌(零寬度)
  5. \\1當該令牌發現,與第一捕獲組替換它的(10個字)
+0

爲我完美地工作,如何理解將來使用的正則表達式? – user3188390

+1

爲什麼不只是'gsub(「^((\\ w + \\ W +){10})。*」,「\\ 1」,x)'? – thelatemail

+0

@thelatemail包括尾隨空格(如果存在),儘管如果末尾有空格但總共不超過10個單詞,所提議的方法也會如此。 – Dason

2
x <- "The length of the word is going to be of nice use to me" 
head(strsplit(x, split = "\ "), 10) 
+6

正確的想法,但不完全。試試'head(unlist(strsplit(x,split =「\\ s +」)),10)' – thelatemail

4

如何使用從哈德利韋翰的stringrword功能?

word(string = x, start = 1, end = 10, sep = fixed(" "))

相關問題