I'm parsing text documents and inserting them into PostgreSQL DB. My code is written in Java and I use JDBC for DB connectivity. I have encountered very strange error when adding data to DB - it seems that at unpredictable moment (different number of iteration of main loop) Postgres does not see rows just added to tables and fails to perform updates properly.
Maybe I am doing something wrong, so maybe there is a way to correct my code? Or is it serious error of PostgreSQL and I should post it at PostgreSQL home page (as bug report)?
Here are the details of what I'm doing and what is going wrong. I've simplified my code to isolate the error - simplified version doesn't parse any text but I simulate it with generated words. Source files are included (java and sql) at the end of my question.
In simplified example of my problem I have one-threaded code, one JDBC Connection, 3 tables and few SQL statements involved (full Java sources are less than 90 lines).
Main loop works for "documents" - 20 words with subsequent doc_id (integer).
- Buffer table
spb_word4obj
is cleared for doc_id to be just inserted. - Words are inserted into buffer table (
spb_word4obj
), - Then unique new words are inserted into table
spb_word
- And finally - document's words are inserted into
spb_obj_word
- with word bodies replaced by word-ids fromspb_word
(references).
While iterating this loop for some time (e.g. 2,000 or 15,000 iterations - it is unpredictable) it fails with SQL error - cannot insert null word_id into spb_word
. It gets more strange as repeating this very last iteration by hand gives no error. It seems like PostgreSQL have some issue with record insertion and statement execution speed - it loses some data or makes it is visible for subsequent statement after small delay.
Sequence of generated words is repeatable - every time code is run it generates the same sequence of words, but iteration number when code fails is different every time.
Here is my sql code to create tables:
create sequence spb_word_seq;
create table spb_word (
id bigint not null primary key default nextval('spb_word_seq'),
word varchar(410) not null unique
);
create sequence spb_obj_word_seq;
create table spb_obj_word (
id int not null primary key default nextval('spb_obj_word_seq'),
doc_id int not null,
idx int not null,
word_id bigint not null references spb_word (id),
constraint spb_ak_obj_word unique (doc_id, word_id, idx)
);
create sequence spb_word4obj_seq;
create table spb_word4obj (
id int not null primary key default nextval('spb_word4obj_seq'),
doc_id int not null,
idx int not null,
word varchar(410) not null,
word_id bigint null references spb_word (id),
constraint spb_ak_word4obj unique (doc_id, word_id, idx),
constraint spb_ak_word4obj2 unique (doc_id, word, idx)
);
And here goes Java code - it may just be executed (it has static main
method).
package WildWezyrIsAstonished;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.Statement;
public class StrangePostgresBehavior {
private static final String letters = "abcdefghijklmnopqrstuvwxyząćęłńóśźż";
private static final int llen = letters.length();
private Connection conn;
private Statement st;
private int wordNum = 0;
public void runMe() throws Exception {
Class.forName("org.postgresql.Driver");
conn = DriverManager.getConnection("jdbc:postgresql://localhost:5433/spb",
"wwspb", "*****");
conn.setAutoCommit(true);
st = conn.createStatement();
st.executeUpdate("truncate table spb_word4obj, spb_word, spb_obj_word");
for (int j = 0; j < 50000; j++) {
try {
if (j % 100 == 0) {
System.out.println("j == " + j);
}
StringBuilder sb = new StringBuilder();
for (int i = 0; i < 20; i++) {
sb.append("insert into spb_word4obj (word, idx, doc_id) values ('"
+ getWord() + "'," + i + "," + j + ");\n");
}
st.executeUpdate("delete from spb_word4obj where doc_id = " + j);
st.executeUpdate(sb.toString());
st.executeUpdate("update spb_word4obj set word_id = w.id "
+ "from spb_word w "
+ "where w.word = spb_word4obj.word and doc_id = " + j);
st.executeUpdate("insert into spb_word (word) "
+ "select distinct word from spb_word4obj "
+ "where word_id is null and doc_id = " + j);
st.executeUpdate("update spb_word4obj set word_id = w.id "
+ "from spb_word w "
+ "where w.word = spb_word4obj.word and "
+ "word_id is null and doc_id = " + j);
st.executeUpdate("insert into spb_obj_word (word_id, idx, doc_id) "
+ "select word_id, idx, doc_id from spb_word4obj "
+ "where doc_id = " + j);
} catch (Exception ex) {
System.out.println("error for j == " + j);
throw ex;
}
}
}
private String getWord() {
int rn = 3 * (++wordNum + llen * llen * llen);
rn = (rn + llen) / (rn % llen + 1);
rn = rn % (rn / 2 + 10);
StringBuilder sb = new StringBuilder();
while (true) {
char c = letters.charAt(rn % llen);
sb.append(c);
rn /= llen;
if (rn == 0) {
break;
}
}
return sb.toString();
}
public static void main(String[] args) throws Exception {
new StrangePostgresBehavior().runMe();
}
}
So again: is it me doing something wrong (what exactly?) or is it serious flaw in PosgreSQL SQL Engine (than - is there a way for work-around)?
I've tested above on Windows Vista box with: Java 1.6 / PostgreSQL 8.3.3 and 8.4.2 / JDBC PostgreSQL drivers postgresql-8.2-505.jdbc3 and postgresql-8.4-701.jdbc4. All combinations lead to error described above. To be sure that it is not something with my machine I've tested in in similar environment on other machine.
UPDATE: I've turned on Postgres logging - as suggested by Depesz. Here are latest sql statements that were executed:
2010-01-18 16:18:51 CETLOG: execute <unnamed>: delete from spb_word4obj where doc_id = 1453
2010-01-18 16:18:51 CETLOG: execute <unnamed>: insert into spb_word4obj (word, idx, doc_id) values ('ouc',0,1453)
2010-01-18 16:18:51 CETLOG: execute <unnamed>:
insert into spb_word4obj (word, idx, doc_id) values ('rbjb',1,1453)
2010-01-18 16:18:51 CETLOG: execute <unnamed>:
insert into spb_word4obj (word, idx, doc_id) values ('pvr',2,1453)
2010-01-18 16:18:52 CETLOG: execute <unnamed>:
insert into spb_word4obj (word, idx, doc_id) values ('gal',3,1453)
2010-01-18 16:18:52 CETLOG: execute <unnamed>:
insert into spb_word4obj (word, idx, doc_id) values ('cai',4,1453)
2010-01-18 16:18:52 CETLOG: execute <unnamed>:
insert into spb_word4obj (word, idx, doc_id) values ('żjg',5,1453)
2010-01-18 16:18:52 CETLOG: execute <unnamed>:
insert into spb_word4obj (word, idx, doc_id) values ('egf',6,1453)
2010-01-18 16:18:52 CETLOG: execute <unnamed>:
insert into spb_word4obj (word, idx, doc_id) values ('śne',7,1453)
2010-01-18 16:18:52 CETLOG: execute <unnamed>:
insert into spb_word4obj (word, idx, doc_id) values ('ęęd',8,1453)
2010-01-18 16:18:52 CETLOG: execute <unnamed>:
insert into spb_word4obj (word, idx, doc_id) values ('lnd',9,1453)
2010-01-18 16:18:52 CETLOG: execute <unnamed>:
insert into spb_word4obj (word, idx, doc_id) values ('cbd',10,1453)
2010-01-18 16:18:52 CETLOG: execute <unnamed>:
insert into spb_word4obj (word, idx, doc_id) values ('dąc',11,1453)
2010-01-18 16:18:52 CETLOG: execute <unnamed>:
insert into spb_word4obj (word, idx, doc_id) values ('łrc',12,1453)
2010-01-18 16:18:52 CETLOG: execute <unnamed>:
insert into spb_word4obj (word, idx, doc_id) values ('zmł',13,1453)
2010-01-18 16:18:52 CETLOG: execute <unnamed>:
insert into spb_word4obj (word, idx, doc_id) values ('zxo',14,1453)
2010-01-18 16:18:52 CETLOG: execute <unnamed>:
insert into spb_word4obj (word, idx, doc_id) values ('oćj',15,1453)
2010-01-18 16:18:52 CETLOG: execute <unnamed>:
insert into spb_word4obj (word, idx, doc_id) values ('zlh',16,1453)
2010-01-18 16:18:52 CETLOG: execute <unnamed>:
insert into spb_word4obj (word, idx, doc_id) values ('lńf',17,1453)
2010-01-18 16:18:52 CETLOG: execute <unnamed>:
insert into spb_word4obj (word, idx, doc_id) values ('cóe',18,1453)
2010-01-18 16:18:52 CETLOG: execute <unnamed>:
insert into spb_word4obj (word, idx, doc_id) values ('uge',19,1453)
2010-01-18 16:18:52 CETLOG: execute <unnamed>: update spb_word4obj set word_id = w.id from spb_word w where w.word = spb_word4obj.word and doc_id = 1453
2010-01-18 16:18:52 CETLOG: execute <unnamed>: insert into spb_word (word) select distinct word from spb_word4obj where word_id is null and doc_id = 1453
2010-01-18 16:18:52 CETLOG: execute <unnamed>: update spb_word4obj set word_id = w.id from spb_word w where w.word = spb_word4obj.word and word_id is null and doc_id = 1453
2010-01-18 16:18:52 CETLOG: execute <unnamed>: insert into spb_obj_word (word_id, idx, doc_id) select word_id, idx, doc_id from spb_word4obj where doc_id = 1453
2010-01-18 16:18:52 CETERROR: null value in column "word_id" violates not-null constraint
2010-01-18 16:18:52 CETSTATEMENT: insert into spb_obj_word (word_id, idx, doc_id) select word_id, idx, doc_id from spb_word4obj where doc_id = 1453
Now - code to check what is wrong in table spb_word4obj
:
select *
from spb_word4obj w4o left join spb_word w on w4o.word = w.word
where w4o.word_id is null
and it shows that two words: 'gal', 'zxo'
caused the problem. But... they are found in spb_word
table - just freshly inserted with sql statements from log (included above).
So - it is not issue with JDBC driver, it is rather Postgres itself?
UPDATE2: If I eliminate polish national chars (ąćęłńóśźż
) from generated words, there is no error - code performs all 50,000 iterations. I've tested it few times. So, for this line:
private static final String letters = "abcdefghijklmnopqrstuvwxyz";
there is no error, everything seems to be fine, but with this line (or with original line in full source above):
private static final String letters = "ąćęłńóśźżjklmnopqrstuvwxyz";
I get error described above.
UPDATE3: I've just posted similar question without Java usage - fully ported to pure plpgsql, look here: Why this code fails in PostgreSQL and how to fix it (work-around)? Is it Postgres SQL engine flaw?. Now I know that it is not related to Java - it is Postgres alone problem.