Quick Links

Batch update of indexes on data loading

From:	ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Batch update of indexes on data loading
Date:	2008-02-21 04:26:20
Message-ID:	20080221130554.B2BA.52131E4D@oss.ntt.co.jp
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

This is a proposal of fast data loading using batch update of indexes for 8.4.
It is a part of pg_bulkload (http://pgbulkload.projects.postgresql.org/) and
I'd like to integrate it in order to cooperate with other parts of postgres.

The basic concept is spooling new coming data, and merge the spool and
the existing indexes into a new index at the end of data loading. It is
5-10 times faster than index insertion per-row, that is the way in 8.3.

One of the problem is locking; Index building in bulkload is similar to
REINDEX rather than INSERT, so we need ACCESS EXCLUSIVE LOCK during it.
Bulkloading is not a upper compatible method, so I'm thinking about
adding a new "WITH LOCK" option for COPY command.

COPY tbl FROM 'datafile' WITH LOCK;

If the LOCK option is specified, the behavior of COPY will be changed
as follows:

1. Lock the target table in ACCESS EXCLUSIVE mode instead of ROW EXCLUSIVE.
2. Prepare spooler (BTSpool) for each indexes.
3. For each new row, put index entries into the spools (_bt_spool)
instead of index_insert.
4. At the end of COPY, merge the spool and the existing indexes into a new
index file. The relfilenode of the index is changed like REINDEX.

However, there might be better interfaces for bulk index creation.
For example, if we want to use it with pgloader, we might need
"bulkload mode" for indexes. pgloader commits every 10000 rows,
so the index spooler must keep alive until end of the session
over transactions. (or end of the transaction over sub-transactions)

I'm working toward the simple "COPY WITH LOCK" approach for now,
but if there are other better ideas, I want to use them.
Advices and suggestions welcome.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Responses

Re: Batch update of indexes on data loading at 2008-02-21 14:28:00 from Alvaro Herrera
Re: Batch update of indexes on data loading at 2008-02-24 08:14:24 from Simon Riggs
Re: Batch update of indexes on data loading at 2008-03-05 15:56:17 from Bruce Momjian

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	ITAGAKI Takahiro	2008-02-21 04:35:23	Re: ANALYZE to be ignored by VACUUM
Previous Message	Joshua D. Drake	2008-02-21 04:07:37	Re: Permanent settings