Quick Links

parallel pg_restore design issues

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	parallel pg_restore design issues
Date:	2008-10-06 00:11:48
Message-ID:	48E957C4.8060008@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

There are a couple of open questions for parallel pg_restore.

First, we need a way to decide the boundary between the serially run
"pre-data" section and the remainder of the items in the TOC. Currently
the code uses the first TABLEDATA item as the boundary. That's not
terribly robust (what if there aren't any?). Also, people have wanted to
steer clear of hardcoding much knowledge of archive member types into
pg_restore as a way of future-proofing it somewhat. I'm wondering if we
should have pg_dump explicitly mark items as pre-data,data or post-data.
For legacy archives we could still check for either a TABLEDATA item or
something known to sort after those (i.e. a BLOB, BLOB COMMENT,
CONSTRAINT, INDEX, RULE, TRIGGER or FK CONSTRAINT item).

Another item we have already discussed is how to prevent concurrent
processes from trying to take conflicting locks. Her we really can't
rely on pg_dump to help us out, as lock requirements might change (a
little bird has already whispered in my ear about reducing the strength
of FK CONSTRAINT locks taken). I haven't got a really good answer here.

Last, there is the question of what algorithm to use in chosing the next
item to run. Currently, I am using "next item in the queue whose
dependencies have been met", with no queue reordering.

Another possible algorithm would reorder the queue by elevating any item
whose dependencies have been met. This will mean all the indexes for a
table will tend to be grouped together, which might well be a good
thing, and will tend to limit the tendency to do all the data loading at
once.

Both of these could be modified by explicitly limiting TABLEDATA items
to a certain proportion (say, one quarter) of the processing slots
available, if other items are available.

I'm actually somewhat inclined to make provision for all of these
possibilities via a command line option, with the first being the
default. One size doesn't fit all, I suspect, and if it does we'll need
lots of data before deciding what that size is. The extra logic won't
really involve all that much code, and it will all be confined to a
couple of functions.

Thoughts?

cheers

andrew

Responses

Re: parallel pg_restore design issues at 2008-10-06 03:04:09 from Philip Warner

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Greg Smith	2008-10-06 01:50:40	Re: Add default_val to pg_settings
Previous Message	Tom Lane	2008-10-05 22:52:01	Re: Common Table Expressions applied; some issues remain