2015-04-03 36 views
0

以下是我正在嘗試執行的操作: 當我正在分析的術語是「蘋果」時,我想知道需要多少換位符到「蘋果」,以便它可以在字符串中找到。計算字符串所需的換位符以便可以在另一個字符串中找到

「現在購買蘋果」=> 0需要換位(蘋果存在)。

「cheap aples online」=>需要1個換位(蘋果換成aples)。

「在這裏找到你需要的東西」=>需要2個換位(蘋果到蘋果)。

「aple」=> 2轉座需要(蘋果到aple)。

「bananas」=>需要5個換位(蘋果去香蕉)。

stringdist和adist函數不起作用,因爲它們告訴我需要多少換位才能將一個字符串轉換爲另一個字符串。總之,這裏是我寫到目前爲止:

#build matrix 
a <- c(rep("apples",5),rep("bananas",3)) 
b <- c("buy apples now","cheap aples online","find your ap ple here","aple","bananas","cherry and bananas","pumpkin","banana split") 
d<- data.frame(a,b) 
colnames(d)<-c("term","string") 

#count transpositions needed 
d$transpositions <- mapply(adist,d$term,d$string) 
print(d) 
+0

好的,謝謝,我是否應該將它添加到標題中,或標籤是否足夠? – 2015-04-03 17:34:48

+0

我編輯你的代碼(在我的答案)蘋果在'a < - c(代表(「蘋果」,5),代表(「香蕉」,3))' – infominer 2015-04-03 18:13:08

+0

ouch,感謝infominer,讓我糾正它這個問題呢! – 2015-04-03 21:02:32

回答

0

你需要先檢查蘋果,然後做換位

a <- c(rep("apples",5),rep("bananas",3)) 
b <- c("buy apples now","cheap aples online","find your ap ple here","aple","bananas","cherry and bananas","pumpkin","banana split") 
d<- data.frame(a,b, stringsAsFactors = F) 
colnames(d)<-c("term","string") 

#check for apples first 
d$apples <-grepl("apples", d$string) 

#count transpositions needed 
d$transpositions <- ifelse(d$apples ==FALSE, mapply(adist,d$term,d$string), 0) 
print(d) 
+0

嗯我剛剛重讀你的問題,將不得不重新考慮我的答案。我稍後再處理時會發布。你如何處理句子而不是一個單詞轉換? – infominer 2015-04-03 18:33:56

+0

坦克@infominer!非常感謝:) grepl很有用。第一步實際上是檢測字符串中正確拼寫的術語的存在。 如果找不到拼寫正確的術語,那麼我需要隔離與我的術語最相似的那一串字符串,最後計算出這段字符串與術語之間的相似度。 關於與「一個字」相對的句子,我想避免「現在購買」比「aple」更糟的分數,因爲額外的詞「buy and now」。重要的是,「現在購買aple」的部分「aple」與「apple」這個詞有多相似。 – 2015-04-03 22:27:57

0

所以,這裏是我想出了迄今爲止骯髒的解決方案:

#create a data.frame 
a <- c(rep("apples",5),rep("banana split",3)) 
b <- c("buy apples now","cheap aples online","find your ap ple here","aple","bananas","cherry and bananas","pumpkin","banana split") 
d <- data.frame(a,b) 
colnames(d) <- c("term","string") 

#split the string into sequences of consecutive characters whose length is equal to the length of the term on the same row. Calculate the similarity to the term of each sequence of characters and identify the most relevant piece of string for each row. 

mostrelevantpiece <- NULL 

for (j in 1:length(d$string)){ 
    pieces<-NULL 
    piecesdist<-NULL 
    for (i in 1:max((nchar(as.character(d$string[j]))-nchar(as.character(d$term[j])))+1,1)){ 
    addpiece <- substr(d$string[j],i,i+nchar(as.character(d$term[j]))-1) 
    dist <- adist(addpiece,d$term[j]) 
    pieces[i] <- str_trim(addpiece) 
    piecesdist[i] <- dist 
    mostrelevantpiece[j] <- pieces[which.min(piecesdist)] 
    } 
} 

#calculate the number of transpositions needed to transform the "most relevant piece of string" into the term. 

d$transpositionsneeded <- mapply(adist,mostrelevantpiece,d$term) 
相關問題