This is a proposal of fast data loading using batch update of indexes for 8.4.
It is a part of pg_bulkload (http://pgbulkload.projects.postgresql.org/) and
I'd like to integrate it in order to cooperate with other parts of postgres.
The basic concept is spooling new coming data, and merge the spool and
the existing indexes into a new index at the end of data loading. It is
5-10 times faster than index insertion per-row, that is the way in 8.3.
One of the problem is locking; Index building in bulkload is similar to
REINDEX rather than INSERT, so we need ACCESS EXCLUSIVE LOCK during it.
Bulkloading is not a upper compatible method, so I'm thinking about
adding a new "WITH LOCK" option for COPY command.
COPY tbl FROM 'datafile' WITH LOCK;
If the LOCK option is specified, the behavior of COPY will be changed
1. Lock the target table in ACCESS EXCLUSIVE mode instead of ROW EXCLUSIVE.
2. Prepare spooler (BTSpool) for each indexes.
3. For each new row, put index entries into the spools (_bt_spool)
instead of index_insert.
4. At the end of COPY, merge the spool and the existing indexes into a new
index file. The relfilenode of the index is changed like REINDEX.
However, there might be better interfaces for bulk index creation.
For example, if we want to use it with pgloader, we might need
"bulkload mode" for indexes. pgloader commits every 10000 rows,
so the index spooler must keep alive until end of the session
over transactions. (or end of the transaction over sub-transactions)
I'm working toward the simple "COPY WITH LOCK" approach for now,
but if there are other better ideas, I want to use them.
Advices and suggestions welcome.
NTT Open Source Software Center
pgsql-hackers by date
|Next:||From: ITAGAKI Takahiro||Date: 2008-02-21 04:35:23|
|Subject: Re: ANALYZE to be ignored by VACUUM|
|Previous:||From: Joshua D. Drake||Date: 2008-02-21 04:07:37|
|Subject: Re: Permanent settings|