2013-06-28 59 views
10

我想找出最好的方法,(可能無所謂在這種情況下)找到一個表的行,基於一個標誌的存在和關係的ID在另一個表中連續排列。SQLite3查詢優化連接vs子查詢

這裏的模式:

CREATE TABLE files (
id INTEGER PRIMARY KEY, 
dirty INTEGER NOT NULL); 

    CREATE TABLE resume_points (
id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL , 
scan_file_id INTEGER NOT NULL); 

我使用的SQLite3

有文件表將非常大,10K-5M行一般。 的resume_points將小< 10K只用1-2個不同scan_file_id

所以我的第一個想法是:

select distinct files.* from resume_points inner join files 
on resume_points.scan_file_id=files.id where files.dirty = 1; 

同事提示轉身聯接:

select distinct files.* from files inner join resume_points 
on files.id=resume_points.scan_file_id where files.dirty = 1; 

然後我想,因爲我們知道不同的scan_file_id的數量會很小,所以可能是最優的(在這種罕見的情況下):

select * from files where id in (select distinct scan_file_id from resume_points); 

explain輸出分別具有以下行:42,42和48。

+1

這取決於您的數據和硬件。你必須自己衡量一下。 –

+1

您錯過了並且最後一次查詢中的files.dirty = 1 – eglasius

回答

11

TL; DR:最好的查詢和索引:

create index uniqueFiles on resume_points (scan_file_id); 
select * from (select distinct scan_file_id from resume_points) d join files on d.scan_file_id = files.id and files.dirty = 1; 

因爲我通常使用SQL Server,起初我以爲肯定是查詢優化器會找到最佳的執行計劃這樣一個簡單的查詢不管用哪種方式編寫這些等效的SQL語句。於是我下載了SQLite,並開始玩耍。令我驚訝的是,表現有很大的差異。

這裏的設置代碼:

CREATE TABLE files (
id INTEGER PRIMARY KEY autoincrement, 
dirty INTEGER NOT NULL); 

CREATE TABLE resume_points (
id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL , 
scan_file_id INTEGER NOT NULL); 

insert into files (dirty) values (0); 
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; 
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; 
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; 
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; 
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; 
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; 
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; 
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; 
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; 
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; 
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; 
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; 
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; 
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; 
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; 
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; 
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; 
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; 
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; 
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; 
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; 
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; 
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; 

insert into resume_points (scan_file_id) select (select abs(random() % 8000000)) from files limit 5000; 

insert into resume_points (scan_file_id) select (select abs(random() % 8000000)) from files limit 5000; 

我考慮兩個指標:

create index dirtyFiles on files (dirty, id); 
create index uniqueFiles on resume_points (scan_file_id); 
create index fileLookup on files (id); 

下面是我試過的查詢和執行時間對我的酷睿i5筆記本電腦。數據庫文件大小隻有200MB,因爲它沒有任何其他數據。

select distinct files.* from resume_points inner join files on resume_points.scan_file_id=files.id where files.dirty = 1; 
4.3 - 4.5ms with and without index 

select distinct files.* from files inner join resume_points on files.id=resume_points.scan_file_id where files.dirty = 1; 
4.4 - 4.7ms with and without index 

select * from (select distinct scan_file_id from resume_points) d join files on d.scan_file_id = files.id and files.dirty = 1; 
2.0 - 2.5ms with uniqueFiles 
2.6-2.9ms without uniqueFiles 

select * from files where id in (select distinct scan_file_id from resume_points) and dirty = 1; 
2.1 - 2.5ms with uniqueFiles 
2.6-3ms without uniqueFiles 

SELECT f.* FROM resume_points rp INNER JOIN files f on rp.scan_file_id = f.id 
WHERE f.dirty = 1 GROUP BY f.id 
4500 - 6190 ms with uniqueFiles 
8.8-9.5 ms without uniqueFiles 
    14000 ms with uniqueFiles and fileLookup 

select * from files where exists (
select * from resume_points where files.id = resume_points.scan_file_id) and dirty = 1; 
8400 ms with uniqueFiles 
7400 ms without uniqueFiles 

它看起來像SQLite的查詢優化器是不是很先進的。最好的查詢首先將resume_points減少到少量的行(測試用例中的兩個,OP表示它將是1-2),然後查找文件以查看它是否髒。 dirtyFiles索引對於任何文件都沒有太大區別。我認爲這可能是因爲數據在測試表中的排列方式。它可能會影響生產表格。然而,這種差異並不是很大,因爲只有少數幾個查詢。 uniqueFiles確實有所作爲,因爲它可以將10000行resume_points減少爲2行,而無需掃描大部分行。 fileLookup確實稍微提高了一些查詢速度,但不足以顯着改變結果。值得注意的是它讓團隊非常緩慢。總而言之,儘早減少結果集以產生最大的差異。

+1

您是否在創建索引後運行了分析命令? – Giorgi

1

由於files.id是主鍵,嘗試GROUP荷蘭國際集團BY這一領域,而不是檢查DISTINCT files.*

SELECT f.* 
FROM resume_points rp 
INNER JOIN files f on rp.scan_file_id = f.id 
WHERE f.dirty = 1 
GROUP BY f.id 

另一個選擇要考慮性能,添加索引resume_points.scan_file_id

CREATE INDEX index_resume_points_scan_file_id ON resume_points (scan_file_id) 
1

你可以嘗試exists,這不會產生任何重複files

select * from files 
where exists (
    select * from resume_points 
    where files.id = resume_points.scan_file_id 
) 
and dirty = 1; 

當然它可能幫助有適當的指標:

files.dirty 
resume_points.scan_file_id 

無論指數有用的將取決於你的數據。

0

如果表「resume_points」只有一個或兩個不同的文件ID號,它似乎只需要一行或兩行,並且似乎需要scan_file_id作爲主鍵。該表只有兩列,而id號是沒有意義的。

而且如果那就是的情況,你不需要任何一個ID號碼。

pragma foreign_keys = on; 
CREATE TABLE resume_points (
    scan_file_id integer primary key 
); 

CREATE TABLE files (
    scan_file_id integer not null references resume_points (scan_file_id), 
    dirty INTEGER NOT NULL, 
    primary key (scan_file_id, dirty) 
); 

現在你不需要連接了。只需查詢「文件」表。

1

我認爲jtseng給出瞭解決方案。

select * from (select distinct scan_file_id from resume_points) d 
join files on d.scan_file_id = files.id and files.dirty = 1 

基本上,它就是你已經張貼作爲你最後的選擇是一樣的:

select * from files where id in (select distinct scan_file_id from resume_points) and dirty = 1; 

這是怎麼一回事,因爲你必須避免全表掃描/加盟。

所以剛開始,你需要你的1-2不同IDS:

select distinct scan_file_id from resume_points 

後,只有您的1-2列對另一個表,而不是所有的10K,這給性能優化來進行連接。

如果您需要多次這個語句,我會把它放進一個視圖。該視圖不會改變性能,但它看起來更清潔/更易於閱讀。

也檢查查詢優化文檔:http://www.sqlite.org/optoverview.html