首先,我建議不要使用R來完成此任務。 R對很多事情都很好,但文本操作不是其中之一。 Python可能是一個很好的選擇。
這就是說,如果我是R中實現這一點,我可能會做這樣的事情(非常非常粗糙):
# You will probably read these from an external file or a database
goodWords <- c("candesartan cilexetil", "glyburide", "nifedipine", "digoxin", "blabla", "warfarin", "hydrochlorothiazide")
badWords <- c("no significant", "other drugs")
mytext <- "no significant drug interactions have been reported in studies of candesartan cilexetil given with other drugs such as glyburide, nifedipine, digoxin, warfarin, hydrochlorothiazide."
mytext <- tolower(mytext) # Let's make life a little bit easier...
goodPos <- NULL
badPos <- NULL
# First we find the good words
for (w in goodWords)
{
pos <- regexpr(w, mytext)
if (pos != -1)
{
cat(paste(w, "found at position", pos, "\n"))
}
else
{
pos <- NA
cat(paste(w, "not found\n"))
}
goodPos <- c(goodPos, pos)
}
# And then the bad words
for (w in badWords)
{
pos <- regexpr(w, mytext)
if (pos != -1)
{
cat(paste(w, "found at position", pos, "\n"))
}
else
{
pos <- NA
cat(paste(w, "not found\n"))
}
badPos <- c(badPos, pos)
}
# Note that we use -badPos so that when can calculate the distance with rowSums
comb <- expand.grid(goodPos, -badPos)
wordcomb <- expand.grid(goodWords, badWords)
dst <- cbind(wordcomb, abs(rowSums(comb)))
mn <- which.min(dst[,3])
cat(paste("The closest good-bad word pair is: ", dst[mn, 1],"-", dst[mn, 2],"\n"))
我幾乎找到了我正在尋找的東西。謝謝尼科! – 2010-06-21 15:26:38