2014-03-13 43 views
3

我寫了一個程序,從用戶處收集問題。然後它將該問題與預先定義的問題列表進行匹配並返回答案。它應該是準確的,並且只與接近(模糊匹配)的問題或用戶輸入的問題相匹配。字符串/句子最近匹配算法不起作用?

我SSSCE:

http://ideone.com/JTcF73

代碼:

#include <iostream> 
#include <cstdint> 
#include <algorithm> 
#include <numeric> 
#include <functional> 

int min_index(const std::vector<int>& list) 
{ 
    int index = 0; 
    int smallest = list[0]; 

    for (size_t i = 0; i < list.size(); ++i) { 
     if (list[i] < smallest) { 
      smallest = list[i]; 
      index = i; 
     } 
    } 
    return index; 
} 

std::uint32_t LevenshteinDistance(const std::string &First, const std::string &Second) 
{ 
    const std::size_t FirstLength = First.size(); 
    const std::size_t SecondLength = Second.size(); 

    std::vector<std::uint32_t> Current(SecondLength + 1); 
    std::vector<std::uint32_t> Previous(SecondLength + 1); 

    std::size_t I = 0; 
    std::generate(Previous.begin(), Previous.end(), [&] {return I++; }); 

    for (I = 0; I < FirstLength; ++I) 
    { 
     Current[0] = I + 1; 

     for (std::size_t J = 0; J < SecondLength; ++J) 
     { 
      auto Cost = First[I] == Second[J] ? 0 : 1; 
      Current[J + 1] = std::min(std::min(Current[J] + 1, Previous[J + 1] + 1), Previous[J] + Cost); 
     } 

     Current.swap(Previous); 
    } 
    return Previous[SecondLength]; 
} 

std::vector<std::string> questions = 
{ 
    "What is the most popular program at GBC?", 
    "How much is the tuition at GBC?", 
    "Do I have to pay my fees before I can register?", 
    "What are my fee payment options?", 
    "How do I know when I'm allowed to register?", 
    "How do I add and/or drop courses from my timetable?", 
    "What do I do if I can't find my PASSWORD?", 
    "How do I withdraw from a program?", 
    "What are the college policies?", 
    "How much math do I need to know?", 
    "What is the program code for computer programming?", 
    "What is stu-view?", 
    "What is the college best known for?", 
    "How easy is it to find work after program completion?", 
    "What if I plan to continue my education after?" 
}; 

std::vector<std::string> answers = 
{ 
    "Fashion", 
    "3000 a semester", 
    "Yes you have to pay the fees before registering", 
    "You may pay online on your student account through the student portal", 
    "You may register two weeks or more before the start of the program", 
    "You may drop courses from online through the student portal", 
    "You can call ... and an agent will assist you", 
    "You may withdraw using the student portal online", 
    "They are located at the following link...", 
    "It depends on the program you are entering", 
    "T127 is the code for computer science", 
    "Stu-View is a student portal to manage student account and view marks.", 
    "The cafeteria food", 
    "Depends on the field of work and timing", 
    "You may do so within three years after program completion" 
}; 

int main() 
{ 
    std::string user_question = "program"; 

    std::vector<int> distances = std::vector<int>(questions.size(), 0); 

    for (size_t I = 0; I < questions.size(); ++I) 
    { 
     int dist = LevenshteinDistance(user_question, questions[I]); 
     distances[I] = dist; 
    } 

    std::cout<<"Distance:  "<<distances[min_index(distances)]<<"\n"; 
    std::cout<<"User-Question: "<<user_question<<"\n"; 
    std::cout<<"Question-Key: "<<questions[min_index(distances)]<<"\n"; 
    std::cout<<"Answer-Value: "<<answers[min_index(distances)]<<"\n"; 

    return 0; 
} 

所以在上面,用戶輸入"program"。它應該找到問題列表最接近的匹配並返回相應的答案..

但是,它打印出:

Distance:  17 
User-Question: program 
Question-Key: What is stu-view? 
Answer-Value: Stu-View is a student portal to manage student account and view marks. 

它應該有更好的結果或準確性,但它似乎並不在乎某個句子是否包含用戶輸入的關鍵字:S適用於小型案例,但對於大型數據庫或上述有5個以上的句子,這是一個艱難的時間..尤其是與關鍵字少; l

我做錯了什麼?任何想法如何我可以修復它,並使其更加準確?我嘗試了HammingDistance,類似的結果..

+0

好了,'編輯距離(a,b)'永遠不會大於'max(a.size(),b.size())'。 17的距離是正確的答案,因爲「什麼是stu-view?」只有17個字符長。你的程序根本沒有任何條件,它不會**找到答案。 –

+1

微小的文體建議:你可以用'std :: iota'替換你的'std :: generate',並將'I'的聲明向下移動到'for'循環中。 (對不起,我正在進行一場討論,讓'iota'知道......) – screwnut

回答

3

與其他字符串相比,"program""What is stu-view?"都相當短。儘管"program"這個詞很常見,但將"program"更改爲"What is stu-view?"比將"program"更改爲"What is the most popular program at GBC?"更容易。

我在做什麼錯?

我不認爲你做錯了什麼。如果你對結果不滿意,這意味着你現在的形式主義(儘量減少Levenshtein距離)並不是你想要的。

您可以選擇更多本地解決方案:符號化的字符串,計算兩兩Levenshtein距離之間的話,然後合併結果(平均,SUP INF ......)

更好的解決方案將需要做一些參考書目(可能是無監督的機器學習主題)