I'm developing a small ad-hoc replication framework (for a reatil company), which replicates only certain records of certain tables (about 200) depending on specific domain-level logic.
In order to know the replication status of each record for each target host, I have a repStatus character(NUMBER_OF_HOSTS) type column; where a host always represents the same position.
The values in each position of this column could be 0 (no action), 1 (replicate record), 2 (record replicated), 3 (resend after confirmation), A (first time error), B (second time error)... etc.
So for instance: 012A means:
- Don't send anything to host 1
- Send this record to host 2
- Record correctly received in host 3
- Error received from host 3
This looks pretty easy and simple, and has a "straightforward reading": in order to know the status of a record, I simply read the repStatus column.
However, looks like this approach leads to performance problems when the application has to look for the target records to replicate.
So I'm sure there's a better design to solve this problem performance-wise. Maybe an aditional table referencing table, record and host could be a solution:
CREATE TABLE repStatus (tableID int, recordID int, targetHostID int, status int);
Where status value now could even be normalized to a new table. However 200 tables * ~500000 records per table might be a considerably large number of rows to handle in a single table any way.
Any alternative based on experience is welcome.
Well, the first thing I would do is to drop the icky string parsing everywhere and replace it with PostgreSQL native types. To store the replication status on each record similar to your current solution:
CREATE TYPE replication_status AS ENUM (
'no_action',
'replicate_record',
'record_replicated',
'error_1',
'error_2',
'error_3'
);
ALTER TABLE t ADD COLUMN rep_status_array replication_status[];
This costs you a little more storage space -- enum values are 4 bytes instead of 1 and arrays have some overhead. However, by teaching the database your concepts instead of hiding them, you can write things like:
-- find all records that need to be replicated to host 4
SELECT * FROM t WHERE rep_status_array[4] = 'replicate_record';
-- find all records that contain any error status
SELECT * FROM t WHERE rep_status_array &&
ARRAY['error_1', 'error_2', 'error_3']::replication_status[];
You can put a GIN index right on rep_status_array
if that helps your use case, but it's better to look at your queries and create indexes specifically for what you use:
CREATE INDEX t_replication_host_4_key ON t ((rep_status_array[4]));
CREATE INDEX t_replication_error_key ON t (id)
WHERE rep_status_array && ARRAY['error_1', 'error_2', 'error_3']::replication_status[];
That said, given 200 tables, I'd be tempted to split this out into a single replication status table -- either one row with an array of statuses or one row per host, depending on how the rest of the replication logic works. I'd still use that enumeration:
CREATE TABLE adhoc_replication (
record_id bigint not null,
table_oid oid not null,
host_id integer not null,
replication_status status not null default 'no_action',
primary key (record_id,table_oid,host_id)
);
PostgreSQL internally assigns each table an OID (try SELECT *, tableoid FROM t LIMIT 1
), which is a convenient stable numeric identifier within a single database system. Put another way, it changes if the table is dropped and recreated (which can happen if you e.g. dump and restore the database), and for this same reason it's very likely different between development and production. If you'd rather have these situations work in exchange for breaking when you add or rename a table, use an enumeration instead of an OID.
Using a single table for all replication would allow you to easily re-use triggers and queries and such, decoupling most replication logic from the data it's replicating. It also allows you to query based on status for a given host across all your origin tables by referencing a single index, which could be important.
As for table size, PostgreSQL can definitely handle 10 million rows in the same table. If you went with a dedicated replication-related table, you could always partition per host. (Partitioning by table makes little sense to me; it seems worse than storing replication status on each upstream row.) Which way to partition or whether or not it's appropriate at all depends entirely on what kind of questions you intend to ask your database, and what sort of activity happens on the base tables. (Partitioning means maintaining many smaller blobs instead of a few large ones, and potentially accessing many smaller blobs to perform a single operation.) It's really a matter of choosing when you want your disk seeks to happen.
So your typical query would be to get all records to be replicated to host x ... having repStatus 1 or 3 for this particular target. (Making assumptions since that's not in the question.)
And the records to be replicated are the rare case because, normally, most records have been replicated already, correct? (More assumptions.)
Partial indexes on an expression may be a very fast solution for you.
If you just keep your design of adding a text string to each row, you could create a partial index for each target along these lines:
CREATE INDEX tbl_rep_part1_idx ON tbl (tbl_id, substr(repstatus,1,1))
WHERE substr(repstatus,1,1) = '1' OR
substr(repstatus,1,1) = '3';
CREATE INDEX tbl_rep_part2_idx ON tbl (tbl_id, substr(repstatus,2,1))
WHERE substr(repstatus,2,1) OR
substr(repstatus,2,1);
...
The sum of all partial indexes is only bigger than a full index due to overhead per index.
On write operations to your table only affected partial indexes have to be updated.
Would make these queries very fast:
SELECT * FROM tbl WHERE substr(repstatus,1,1) = '1';
SELECT * FROM tbl WHERE substr(repstatus,1,1) = '1' OR
substr(repstatus,1,1) = '3';
The added tbl_id
in the indexes is optional. I added it because an additional integer column of 4 bytes can capitalize on the space otherwise lost to padding (index size does not grow). Only include it (or another small column) if you have use for it.
What to expect compared to text array + GIN index?
The whole idea only applies if my assumptions hold. The reasons why I propose this route over text-array + GIN index is 3-fold:
Much smaller column size. Compare:
SELECT pg_column_size('{A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z}'::text[]) -- 232 byte
,pg_column_size('{A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z}'::"char"[]) -- 50 byte
,pg_column_size('ABCDEFGHIJKLMNOPQRSTUVWXYZ'::text); -- 27 byte
This matters with 200 x 500k rows. A lot.
Much smaller index per query and faster access.
While the sum of the partial indexes will be somewhat more than a single full GIN index - if you'd want to cover the whole table, this is not the case if we only need to cover rare cases. Either way, the required index per query will be much smaller. And I don't expect the indexes to be cached, considering their size. This puts even more weight on this point
Cheaper write operations. I expect updates on simple, small b-tree indexes to be substantially faster for the small partial indexes, since GIN is known for trouble in this area. This would have to be verified, though.