我寫了一個程序,從用戶處收集問題。然後它將該問題與預先定義的問題列表進行匹配並返回答案。它應該是準確的,並且只與接近(模糊匹配)的問題或用戶輸入的問題相匹配。字符串/句子最近匹配算法不起作用?
我SSSCE:
代碼:
#include <iostream>
#include <cstdint>
#include <algorithm>
#include <numeric>
#include <functional>
int min_index(const std::vector<int>& list)
{
int index = 0;
int smallest = list[0];
for (size_t i = 0; i < list.size(); ++i) {
if (list[i] < smallest) {
smallest = list[i];
index = i;
}
}
return index;
}
std::uint32_t LevenshteinDistance(const std::string &First, const std::string &Second)
{
const std::size_t FirstLength = First.size();
const std::size_t SecondLength = Second.size();
std::vector<std::uint32_t> Current(SecondLength + 1);
std::vector<std::uint32_t> Previous(SecondLength + 1);
std::size_t I = 0;
std::generate(Previous.begin(), Previous.end(), [&] {return I++; });
for (I = 0; I < FirstLength; ++I)
{
Current[0] = I + 1;
for (std::size_t J = 0; J < SecondLength; ++J)
{
auto Cost = First[I] == Second[J] ? 0 : 1;
Current[J + 1] = std::min(std::min(Current[J] + 1, Previous[J + 1] + 1), Previous[J] + Cost);
}
Current.swap(Previous);
}
return Previous[SecondLength];
}
std::vector<std::string> questions =
{
"What is the most popular program at GBC?",
"How much is the tuition at GBC?",
"Do I have to pay my fees before I can register?",
"What are my fee payment options?",
"How do I know when I'm allowed to register?",
"How do I add and/or drop courses from my timetable?",
"What do I do if I can't find my PASSWORD?",
"How do I withdraw from a program?",
"What are the college policies?",
"How much math do I need to know?",
"What is the program code for computer programming?",
"What is stu-view?",
"What is the college best known for?",
"How easy is it to find work after program completion?",
"What if I plan to continue my education after?"
};
std::vector<std::string> answers =
{
"Fashion",
"3000 a semester",
"Yes you have to pay the fees before registering",
"You may pay online on your student account through the student portal",
"You may register two weeks or more before the start of the program",
"You may drop courses from online through the student portal",
"You can call ... and an agent will assist you",
"You may withdraw using the student portal online",
"They are located at the following link...",
"It depends on the program you are entering",
"T127 is the code for computer science",
"Stu-View is a student portal to manage student account and view marks.",
"The cafeteria food",
"Depends on the field of work and timing",
"You may do so within three years after program completion"
};
int main()
{
std::string user_question = "program";
std::vector<int> distances = std::vector<int>(questions.size(), 0);
for (size_t I = 0; I < questions.size(); ++I)
{
int dist = LevenshteinDistance(user_question, questions[I]);
distances[I] = dist;
}
std::cout<<"Distance: "<<distances[min_index(distances)]<<"\n";
std::cout<<"User-Question: "<<user_question<<"\n";
std::cout<<"Question-Key: "<<questions[min_index(distances)]<<"\n";
std::cout<<"Answer-Value: "<<answers[min_index(distances)]<<"\n";
return 0;
}
所以在上面,用戶輸入"program"
。它應該找到問題列表最接近的匹配並返回相應的答案..
但是,它打印出:
Distance: 17
User-Question: program
Question-Key: What is stu-view?
Answer-Value: Stu-View is a student portal to manage student account and view marks.
它應該有更好的結果或準確性,但它似乎並不在乎某個句子是否包含用戶輸入的關鍵字:S適用於小型案例,但對於大型數據庫或上述有5個以上的句子,這是一個艱難的時間..尤其是與關鍵字少; l
我做錯了什麼?任何想法如何我可以修復它,並使其更加準確?我嘗試了HammingDistance,類似的結果..
好了,'編輯距離(a,b)'永遠不會大於'max(a.size(),b.size())'。 17的距離是正確的答案,因爲「什麼是stu-view?」只有17個字符長。你的程序根本沒有任何條件,它不會**找到答案。 –
微小的文體建議:你可以用'std :: iota'替換你的'std :: generate',並將'I'的聲明向下移動到'for'循環中。 (對不起,我正在進行一場討論,讓'iota'知道......) – screwnut