2012-03-23 49 views
3

我正在研究一個基本上是基於知識的問答系統的項目。我的系統從用戶處獲取查詢,從維基百科下載相關文檔,去除所有html標籤並提取純文本。在此之後,它將文檔標記爲句子,然後形成術語 - 文檔(TD)矩陣(查詢也作爲句子傳遞)。這個TD矩陣然後被轉發到pLSA(概率潛伏體系分析)算法。然後,最終計算文檔(語句)向量與查詢向量之間的餘弦相似度。基於與查詢向量的相似性,最相關的句子被顯示爲答案。 (在TD Matrix的組建過程中也進行了干預)。 問題是顯示結果,但不是最相關的。我哪裏錯了?我所遵循的策略是否正確,或者其他算法是否存在可能有所幫助? 下面我展示了一些問題和他們的答案由我的系統返回:基於知識的Q-A系統沒有給出最合適的答案

What is photosynthesis? 
ANSWER 1 : The stroma contains stacks (grana) of thylakoids, which are the site of photosynthesis 

ANSWER 2 : Factors leaf is the primary site of photosynthesis in plants 

ANSWER 3 : Samuel Ruben and Martin Kamen used radioactive isotopes to determine that the oxygen liberated in photosynthesis came from the water 

ANSWER 4 : In plants, algae and cyanobacteria, photosynthesis releases oxygen 

另一個問題

What is Artificial Intelligence? 
ANSWER 1 : the problem of creating 'artificial intelligence' will substantially be solved" 

ANSWER 2 : 37 The leading-edge definition of artificial intelligence research is changing over time 

ANSWER 3 : Stories of these creatures and their fates discuss many of the same hopes, fears and ethical concerns that are presented by artificial intelligence 

ANSWER 4 : History of artificial intelligence and Timeline of artificial intelligence Thinking machines and artificial beings appear in Greek myths , such as Talos of Crete , the bronze robot of Hephaestus , and Pygmalion's Galatea 13 Human likenesses believed to have intelligence were built in every major civilization 

另一個問題

Who is a hacker? 

ANSWER 1 : 19 Hackers (short stories) Helba from the 

ANSWER 2 : 16 Rafael Núñez aka RaFa was a notorious most wanted hacker by the FBI since 2001 

ANSWER 3 : Often, this type of 'white hat' hacker is called an ethical hacker 
ANSWER 4 : Hackers also commonly use port scanners 

又跑

What is biology? 
ANSWER 1 : Molecular biology is the study of biology at a molecular level 

ANSWER 2 : molecular biology studies the complex interactions of systems of biological molecules 

ANSWER 3 : The similarities and differences between cell types are particularly relevant to molecular biology 

ANSWER 4 : Contents History Foundations of modern biology 2 

回答

1

我認爲如果您保持完整的統計方法,那麼改進您的系統將會很困難。從統計NLP的角度來看,你確實做了正確的事情。現在,您可以微調一些參數。要做到這一點,你必須建立一個訓練語料庫,告訴系統哪個答案是正確的......然後看看參數必須採用哪個值來給你這個答案。

這就是說,我不認爲微調參數會使您的準確度提高20%〜30%。

如果你想進一步,你需要更多的語義方法,並象徵性地表達知識。檢查例如http://www.jfsowa.com/

2

這是一個很好研究的問題,稱爲問答(QA)。我已在another answer中提供了有關質量檢查的摘要。具體而言,根據TREC,您的所有示例均屬於「定義問題」類別。我建議您仔細閱讀GoogleGoogle Scholar的「TREC定義問題」查詢所產生的一些論文。

相關問題