在數據庫中搜索（類似）字符串的可擴展方式

讓我來描述我的問題。有一個輸入字符串和一個包含數千個字符串的表。我正在尋找最好的方式來搜索輸入字符串最相似的*字符串。搜索應返回約10個建議字符串的列表，按相似度排序。如果可能的話，字符串在數據庫中也有與其相關的數字權重（受歡迎度），因此權重更高的字詞在結果中應該有更高的出現機會。在數據庫中搜索（類似）字符串的可擴展方式

什麼是最好的圖書館來實現這一目標？我在尋找類似於Elasticsearch的東西。我對這些類型的庫沒有太多經驗，所以我需要一些容易包含在我的項目中的東西，最好是開源的。我使用Python（Flask和SQLAlchemy）和Postgresql，但也可以使用例如Node.js，如果需要的話。

*我也想澄清一下我在尋找什麼樣的相似性。理想情況下，它會是語義相似性，但詞彙相似性也很好。我會很滿意任何可以正常工作的，易於實現的，並且儘可能具有可擴展性和高性能的工具。

例輸入句子：

我不喜歡cangaroos。從數據庫

例建議：

Cangaroos不是我喜歡的。
Cangaroos是邪惡的。
我曾經有一個cangaroo。再也不。

這些建議應該首先出現，因爲'cangaroo'在我的數據庫中不是一個常用單詞，所以任何帶有'cangaroo'單詞的字符串都應該在結果中出現。可能難以發現「不喜歡」，因此這部分對我來說是完全可選的。

P.s. PostgreSQL的全文搜索能做到這樣嗎？

謝謝。

來源

2016-12-24 Ognjen

PostgreSQL的全文搜索不能做你要找的東西。然而，PostgreSQL trigram similarity可以做到這一點。

需要先通過執行（一次）在數據庫中安裝有「卦相似」和「btree_gist」，包裝：

CREATE EXTENSION pg_trgm; 
CREATE EXTENSION btree_gist;

我假設你有一個表，看起來像這樣的：

CREATE TABLE sentences 
(
    sentence_id integer PRIMARY KEY, 
    sentence text 
) ; 

INSERT INTO sentences (sentence_id, sentence) 
VALUES 
    (1, 'Cangaroos are not my favorite.'), 
    (2, 'A vegetable sentence.'), 
    (3, 'Cangaroos are evil.'), 
    (4, 'Again, some plants in my garden.'), 
    (5, 'I once had a cangaroo. Never again.') ;

該表需要'trigram索引'來允許PostgreSQL數據庫'按相似性索引'。這是通過執行來實現：

要找到你要找的答案，你執行：

-- Set the minimum similarity you want to be able to search 
SELECT set_limit(0.2) ; 

-- And now, select the sentences 'similar' to the input one 
SELECT 
    similarity(sentence, 'I don''t like cangaroos') AS similarity, 
    sentence_id, 
    sentence 
FROM 
    sentences 
WHERE 
    /* That's how you choose your sentences: 
     % means 'similar to', in the trigram sense */ 
    sentence % 'I don''t like cangaroos' 
ORDER BY 
    similarity DESC ;

，你得到的是結果：

similarity | sentence_id | sentence 
-----------+-------------+------------------------------------- 
    0.3125 |   3 | Cangaroos are evil.  
    0.2325 |   1 | Cangaroos are not my favorite. 
    0.2173 |   5 | I once had a cangaroo. Never again.

希望這給你想要什麼...

來源

2016-12-24 23:56:08 joanolo

謝謝Joanolo，它工作完美！ – Ognjen

如果有人需要在Flask-SQLAlchemy中執行此操作，請告訴我，我將發佈我的代碼。 – Ognjen

在數據庫中搜索（類似）字符串的可擴展方式

回答

相關問題