我正在解析文本文件並將它們插入到PostgreSQL數據庫中。我的代碼是用Java編寫的,我使用JDBC來連接數據庫。向數據庫添加數據時遇到了非常奇怪的錯誤 - 似乎在不可預知的時刻(主循環的迭代次數不同)Postgres沒有看到只添加到表的行,並且無法正確執行更新。這是PostgreSQL SQL引擎的錯誤,以及如何避免(解決方法)它?
也許我做錯了什麼,所以也許有一種方法來糾正我的代碼?或者是PostgreSQL的嚴重錯誤,我應該將它發佈在PostgreSQL主頁上(作爲錯誤報告)?
以下是我正在做的事情和發生了什麼問題的細節。我簡化了我的代碼以隔離錯誤 - 簡化版本不解析任何文本,但是我使用生成的單詞進行了模擬。 包含源文件(java和sql)在我的問題結束時。
在我的問題的簡單示例中,我使用了單線程代碼,一個JDBC連接,3個表和少量SQL語句(完整的Java源代碼少於90行)。
主循環適用於「文檔」 - 20個字以及隨後的doc_id(整數)。
- 緩衝區表
spb_word4obj
被清除以便doc_id被插入。 - 字被插入緩衝表(
spb_word4obj
) - 然後獨特的新詞被插入到表
spb_word
- 最後 - 文件的話插入到
spb_obj_word
- 從spb_word
(引用由字IDS替代字體)。
雖然這個迭代循環一段時間(例如2000次或15000cps迭代 - 它是不可預測的),它失敗SQL錯誤 - 無法插入空word_id到spb_word
。由於用手重複最後一次迭代不會導致錯誤,這會變得更加奇怪。看起來PostgreSQL在記錄插入和語句執行速度方面存在一些問題 - 它會丟失一些數據或使其在稍後的延遲後對後續語句可見。
生成的單詞的序列是可重複的 - 每次運行代碼時它會生成相同的單詞序列,但是當代碼失敗時的迭代次數每次都不相同。
這裏是我的SQL代碼來創建表:
create sequence spb_word_seq;
create table spb_word (
id bigint not null primary key default nextval('spb_word_seq'),
word varchar(410) not null unique
);
create sequence spb_obj_word_seq;
create table spb_obj_word (
id int not null primary key default nextval('spb_obj_word_seq'),
doc_id int not null,
idx int not null,
word_id bigint not null references spb_word (id),
constraint spb_ak_obj_word unique (doc_id, word_id, idx)
);
create sequence spb_word4obj_seq;
create table spb_word4obj (
id int not null primary key default nextval('spb_word4obj_seq'),
doc_id int not null,
idx int not null,
word varchar(410) not null,
word_id bigint null references spb_word (id),
constraint spb_ak_word4obj unique (doc_id, word_id, idx),
constraint spb_ak_word4obj2 unique (doc_id, word, idx)
);
而且在這裏不用Java代碼 - 它可能只是被執行(它有靜態main
方法)。
package WildWezyrIsAstonished;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.Statement;
public class StrangePostgresBehavior {
private static final String letters = "abcdefghijklmnopqrstuvwxyząćęłńóśźż";
private static final int llen = letters.length();
private Connection conn;
private Statement st;
private int wordNum = 0;
public void runMe() throws Exception {
Class.forName("org.postgresql.Driver");
conn = DriverManager.getConnection("jdbc:postgresql://localhost:5433/spb",
"wwspb", "*****");
conn.setAutoCommit(true);
st = conn.createStatement();
st.executeUpdate("truncate table spb_word4obj, spb_word, spb_obj_word");
for (int j = 0; j < 50000; j++) {
try {
if (j % 100 == 0) {
System.out.println("j == " + j);
}
StringBuilder sb = new StringBuilder();
for (int i = 0; i < 20; i++) {
sb.append("insert into spb_word4obj (word, idx, doc_id) values ('"
+ getWord() + "'," + i + "," + j + ");\n");
}
st.executeUpdate("delete from spb_word4obj where doc_id = " + j);
st.executeUpdate(sb.toString());
st.executeUpdate("update spb_word4obj set word_id = w.id "
+ "from spb_word w "
+ "where w.word = spb_word4obj.word and doc_id = " + j);
st.executeUpdate("insert into spb_word (word) "
+ "select distinct word from spb_word4obj "
+ "where word_id is null and doc_id = " + j);
st.executeUpdate("update spb_word4obj set word_id = w.id "
+ "from spb_word w "
+ "where w.word = spb_word4obj.word and "
+ "word_id is null and doc_id = " + j);
st.executeUpdate("insert into spb_obj_word (word_id, idx, doc_id) "
+ "select word_id, idx, doc_id from spb_word4obj "
+ "where doc_id = " + j);
} catch (Exception ex) {
System.out.println("error for j == " + j);
throw ex;
}
}
}
private String getWord() {
int rn = 3 * (++wordNum + llen * llen * llen);
rn = (rn + llen)/(rn % llen + 1);
rn = rn % (rn/2 + 10);
StringBuilder sb = new StringBuilder();
while (true) {
char c = letters.charAt(rn % llen);
sb.append(c);
rn /= llen;
if (rn == 0) {
break;
}
}
return sb.toString();
}
public static void main(String[] args) throws Exception {
new StrangePostgresBehavior().runMe();
}
}
如此反覆:(?究竟)是我做錯了什麼或者是在PosgreSQL SQL引擎嚴重缺陷(比 - 是有變通辦法)?
我已經在Windows Vista盒上測試過:Java 1.6/PostgreSQL 8.3.3和8.4.2/JDBC PostgreSQL驅動程序postgresql-8.2-505.jdbc3和postgresql-8.4-701.jdbc4。所有組合都會導致上述錯誤。爲了確保它與我的機器不同,我在其他機器上的類似環境中進行了測試。
更新:我打開Postgres的記錄 - 由Depesz的建議。下面是已執行的最新SQL語句:
2010-01-18 16:18:51 CETLOG: execute <unnamed>: delete from spb_word4obj where doc_id = 1453
2010-01-18 16:18:51 CETLOG: execute <unnamed>: insert into spb_word4obj (word, idx, doc_id) values ('ouc',0,1453)
2010-01-18 16:18:51 CETLOG: execute <unnamed>:
insert into spb_word4obj (word, idx, doc_id) values ('rbjb',1,1453)
2010-01-18 16:18:51 CETLOG: execute <unnamed>:
insert into spb_word4obj (word, idx, doc_id) values ('pvr',2,1453)
2010-01-18 16:18:52 CETLOG: execute <unnamed>:
insert into spb_word4obj (word, idx, doc_id) values ('gal',3,1453)
2010-01-18 16:18:52 CETLOG: execute <unnamed>:
insert into spb_word4obj (word, idx, doc_id) values ('cai',4,1453)
2010-01-18 16:18:52 CETLOG: execute <unnamed>:
insert into spb_word4obj (word, idx, doc_id) values ('żjg',5,1453)
2010-01-18 16:18:52 CETLOG: execute <unnamed>:
insert into spb_word4obj (word, idx, doc_id) values ('egf',6,1453)
2010-01-18 16:18:52 CETLOG: execute <unnamed>:
insert into spb_word4obj (word, idx, doc_id) values ('śne',7,1453)
2010-01-18 16:18:52 CETLOG: execute <unnamed>:
insert into spb_word4obj (word, idx, doc_id) values ('ęęd',8,1453)
2010-01-18 16:18:52 CETLOG: execute <unnamed>:
insert into spb_word4obj (word, idx, doc_id) values ('lnd',9,1453)
2010-01-18 16:18:52 CETLOG: execute <unnamed>:
insert into spb_word4obj (word, idx, doc_id) values ('cbd',10,1453)
2010-01-18 16:18:52 CETLOG: execute <unnamed>:
insert into spb_word4obj (word, idx, doc_id) values ('dąc',11,1453)
2010-01-18 16:18:52 CETLOG: execute <unnamed>:
insert into spb_word4obj (word, idx, doc_id) values ('łrc',12,1453)
2010-01-18 16:18:52 CETLOG: execute <unnamed>:
insert into spb_word4obj (word, idx, doc_id) values ('zmł',13,1453)
2010-01-18 16:18:52 CETLOG: execute <unnamed>:
insert into spb_word4obj (word, idx, doc_id) values ('zxo',14,1453)
2010-01-18 16:18:52 CETLOG: execute <unnamed>:
insert into spb_word4obj (word, idx, doc_id) values ('oćj',15,1453)
2010-01-18 16:18:52 CETLOG: execute <unnamed>:
insert into spb_word4obj (word, idx, doc_id) values ('zlh',16,1453)
2010-01-18 16:18:52 CETLOG: execute <unnamed>:
insert into spb_word4obj (word, idx, doc_id) values ('lńf',17,1453)
2010-01-18 16:18:52 CETLOG: execute <unnamed>:
insert into spb_word4obj (word, idx, doc_id) values ('cóe',18,1453)
2010-01-18 16:18:52 CETLOG: execute <unnamed>:
insert into spb_word4obj (word, idx, doc_id) values ('uge',19,1453)
2010-01-18 16:18:52 CETLOG: execute <unnamed>: update spb_word4obj set word_id = w.id from spb_word w where w.word = spb_word4obj.word and doc_id = 1453
2010-01-18 16:18:52 CETLOG: execute <unnamed>: insert into spb_word (word) select distinct word from spb_word4obj where word_id is null and doc_id = 1453
2010-01-18 16:18:52 CETLOG: execute <unnamed>: update spb_word4obj set word_id = w.id from spb_word w where w.word = spb_word4obj.word and word_id is null and doc_id = 1453
2010-01-18 16:18:52 CETLOG: execute <unnamed>: insert into spb_obj_word (word_id, idx, doc_id) select word_id, idx, doc_id from spb_word4obj where doc_id = 1453
2010-01-18 16:18:52 CETERROR: null value in column "word_id" violates not-null constraint
2010-01-18 16:18:52 CETSTATEMENT: insert into spb_obj_word (word_id, idx, doc_id) select word_id, idx, doc_id from spb_word4obj where doc_id = 1453
現在 - 代碼檢查什麼是錯在表spb_word4obj
:
select *
from spb_word4obj w4o left join spb_word w on w4o.word = w.word
where w4o.word_id is null
它表明兩個詞:'gal', 'zxo'
導致的問題。但是......它們在spb_word
表中找到 - 只是用來自日誌(以上包含)的sql語句剛剛插入。
所以 - 它不是JDBC驅動程序的問題,它相當於Postgres本身?
UPDATE2:如果我消除產生的話波蘭國家字符(ąćęłńóśźż
),沒有錯誤 - 代碼執行50,000名迭代。我已經測試過幾次了。因此,對於這一行:
private static final String letters = "abcdefghijklmnopqrstuvwxyz";
沒有錯誤,似乎一切都很好,但是這條線(或在完整的源代碼上述原線):
private static final String letters = "ąćęłńóśźżjklmnopqrstuvwxyz";
我收到描述錯誤以上。
UPDATE3:我剛剛張貼類似的問題,而不使用Java的 - 完全移植到純PLPGSQL,請看這裏:Why this code fails in PostgreSQL and how to fix it (work-around)? Is it Postgres SQL engine flaw?。現在我知道它與Java無關 - 這是Postgres單獨的問題。
很高興看到你找到答案。請在這種情況下接受您自己的答案,以標記回答的問題。 – 2010-01-22 21:21:33