在RDBMS中存儲uni/bi/trigrams ngram的正確方法是什麼？

我有一列unigrams（單個單詞），bigrams（兩個單詞）和trigrams（三個單詞）我已經從一堆文件中抽出。我的目標是靜態分析報告以及a搜索我可以在這些文件上使用。在RDBMS中存儲uni/bi/trigrams ngram的正確方法是什麼？

John Doe 
Xeon 5668x 
corporate tax rates 
beach 
tax plan 
Porta San Giovanni

ngrams標有日期和文檔。例如，我可以找到兩個bigrams之間的關係，以及它們的詞組何時首先出現以及文檔之間的關係。我也可以搜索包含這些X數量的un/bi/trigram短語的文檔。

所以我的問題是如何存儲他們來優化這些搜索。

最簡單的方法只是每個短語的簡單字符串列，然後每次在文檔中找到該單詞/短語時，都會將關係添加到document_ngram表中。

table document 
{ 
    id 
    text 
    date 
} 

table ngram 
{ 
    id 
    ngram varchar(200); 
} 

table document_ngram 
{ 
    id 
    ngram_id 
    document_id 
    date 
}

然而，這意味着，如果我想通過卦爲一個字來搜索我必須使用字符串搜索。例如，可以說我想要所有卦中帶有「夏天」的單詞。

所以，如果我代替了分裂的話，使存儲在NGRAM的唯一的事情就是一個字，然後添加三列，使所有1，2，3 &字鏈可以裝進document_ngram？

table document_ngram 
{ 
    id 
    word1_id NOT NULL 
    word2_id DEFAULT NULL 
    word3_id DEFAULT NULL 
    document_id 
    date 
}

這是正確的方法嗎？他們是更好的方法嗎？我目前使用PostgreSQL和MySQL，但我相信這是一個通用的SQL問題。

來源

2012-06-09 Xeoncross

「document_ngram」的最後一個版本包含重複組。你需要一張額外的桌子來避免這種情況。（第二個版本把重複組放在一個字符串中，這更糟糕） – wildplasser

@wildplasser，你是什麼意思的「重複組」？ – Xeoncross

1NF：word1_id，word2_id，word3_id本質上是*數組。 – wildplasser

這就是我將如何建模您的數據（請注意'''被引用兩次）您還可以爲單個單詞添加權重。

DROP SCHEMA ngram CASCADE; 
CREATE SCHEMA ngram; 

SET search_path='ngram'; 

CREATE table word 
    (word_id INTEGER PRIMARY KEY 
    , the_word varchar 
    , constraint word_the_word UNIQUE (the_word) 
    ); 
CREATE table ngram 
    (ngram_id INTEGER PRIMARY KEY 
    , n INTEGER NOT NULL -- arity 
    , weight REAL -- payload 
    ); 

CREATE TABLE ngram_word 
    (ngram_id INTEGER NOT NULL REFERENCES ngram(ngram_id) 
    , seq INTEGER NOT NULL 
    , word_id INTEGER NOT NULL REFERENCES word(word_id) 
    , PRIMARY KEY (ngram_id,seq) 
    ); 

INSERT INTO word(word_id,the_word) VALUES 
(1, 'the') ,(2, 'man') ,(3, 'who') ,(4, 'sold') ,(5, 'world'); 

INSERT INTO ngram(ngram_id, n, weight) VALUES 
(101, 6, 1.0); 

INSERT INTO ngram_word(ngram_id,seq,word_id) VALUES 
(101, 1, 1) 
, (101, 2, 2) 
, (101, 3, 3) 
, (101, 4, 4) 
, (101, 5, 1) 
, (101, 6, 5) 
    ; 

SELECT w.* 
FROM ngram_word nw 
JOIN word w ON w.word_id = nw.word_id 
WHERE ngram_id = 101 
ORDER BY seq;

結果：

word_id | the_word 
---------+---------- 
     1 | the 
     2 | man 
     3 | who 
     4 | sold 
     1 | the 
     5 | world 
(6 rows)

現在，假設你想將4克添加到現有的（6克）的數據：

INSERT INTO word(word_id,the_word) VALUES 
(6, 'is') ,(7, 'lost') ; 

INSERT INTO ngram(ngram_id, n, weight) VALUES 
(102, 4, 0.1); 

INSERT INTO ngram_word(ngram_id,seq,word_id) VALUES 
(102, 1, 1) 
, (102, 2, 2) 
, (102, 3, 6) 
, (102, 4, 7) 
    ; 

SELECT w.* 
FROM ngram_word nw 
JOIN word w ON w.word_id = nw.word_id 
WHERE ngram_id = 102 
ORDER BY seq;

其他結果：

INSERT 0 2 
INSERT 0 1 
INSERT 0 4 
word_id | the_word 
---------+---------- 
     1 | the 
     2 | man 
     6 | is 
     7 | lost 
(4 rows)

BTW：將文檔類型對象添加到此模型將爲此模型添加兩個附加表格：一個用於文檔，另一個用於文檔* ngram。（或以另一種方式：對於文檔*詞）遞歸模型也是可能的。

更新：上述模型將需要一個額外的約束，這將需要觸發器（或規則+一個額外的表）來實施。僞代碼：

ngram_word.seq >0 AND ngram_word.seq <= (select ngram.n FROM ngram ng WHERE ng.ngram_id = ngram_word.ngram_id)

來源

2012-06-09 18:31:10 wildplasser

尼斯。這是存儲單詞的正確方式，因此它們不會被重複，並且易於查詢。 – Gerrat

單詞是實體。 Ngrams是實體。文件是實體。其餘的是實體之間的關係。 – wildplasser

我正在努力解決這個問題。 ngram表上的權重列是多少？因爲我需要鏈接ngram到文檔，所以我會添加一個包含'date，ngram_word_id，document_id'的document_ngram表並修改'ngram_word'表還有一個主鍵？另外，'word_id'＆'ngram_id'在現實生活中會是一個序列嗎？ – Xeoncross

一個想法是修改你原來的表格佈局。考慮ngram varchar（200）列只包含ngram的1個字，在word_no（1，2或3）列中添加並添加到分組列中，以便例如這兩個詞的兩個記錄在一個bigram是相關的（給他們相同的word_group）。[在Oracle中，我會從Sequence拉word_group號碼 - 我想的Postgres將有類似的東西）

table document 
{ 
    id 
    text 
    date 
} 

table ngram 
{ 
    id 
    word_group 
    word_no 
    ngram varchar(200); 
} 

table document_ngram 
{ 
    id 
    ngram_id 
    document_id 
    date 
}

來源

2012-06-09 18:32:09 Gerrat

在RDBMS中存儲uni/bi/trigrams ngram的正確方法是什麼？

回答

相關問題