Re: pg_restore takes ages

From: "scott(dot)marlowe" <scott(dot)marlowe(at)ihs(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl>, Vivek Khera <khera(at)kcilink(dot)com>, <pgsql-general(at)postgresql(dot)org>
Subject: Re: pg_restore takes ages
Date: 2003-10-03 22:10:44
Message-ID: Pine.LNX.4.33.0310031607310.28368-100000@css120.ihs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Fri, 3 Oct 2003, Tom Lane wrote:

> "scott.marlowe" <scott(dot)marlowe(at)ihs(dot)com> writes:
> > Yes, you are. Basically, with fsync on, things have to happen in order.
> > I.e.
> > write to WAL what you're gonna do. WAIT for confirmation on write
> > write the tuples out. wait for confirmation
> > checkpoint the WAL. wait for confirmation
>
> Not really. With fsync on, we *only* sync the WAL writes. Data writes
> can happen whenever, so long as we know the corresponding WAL writes
> went down first. We only wait for data writes to complete before
> considering that a checkpoint is complete --- which is something that is
> not in the main line of execution and doesn't block other activity.
>
> This is one good reason for keeping WAL on a separate drive from the
> data files --- you are then freeing the system to schedule data I/O as
> optimally as it can.

Oh, Ok. That's why the WAL is such a choking point.

> > Note that if you're running on IDE drives, you already ARE probably
> > running with fsync off if write caching is enabled, so you'll need to turn
> > it off (hdparm -W0 /dev/hdx in linux) to ensure fsync actually works.
>
> It'd be interesting to think about whether a write-caching IDE drive
> could safely be used for data storage, if WAL is elsewhere.

Well, I just so happen to have a machine with two drives in it. I'll get
back to you on that.

> Right offhand I think the only problem is how to know when it's safe
> to consider a checkpoint complete. Maybe all that would be needed is
> a long enough time delay after issuing sync(2) in the checkpoint code.
> Do these drives guarantee "data will be written within 30 seconds" or
> something like that? Or can the delay be indefinite when load is heavy?

I don't know. My guess is that they probably don't wait too long, they
just wait until they have enough to write to make it "worth their while"
i.e. they don't write 8k at a time, they write 64k or 512k at a time, now
that many have 8 Meg of cache, it would make sense to group writes to one
area together.

I'll test this out. Do you think "pgbench -c 50 -t 100000000" is enough
thrash, or do I need more connections at once? The machine I'm on has 2
gig ram, so it can probably open several hundred connections, but the time
to bring the database back up by replaying the WAL with hundreds of
concurrent transactions is gonna be a bit.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Kathy Zhu 2003-10-03 22:30:14 group by
Previous Message Relaxin 2003-10-03 21:52:01 Re: Beta 4 of 7.4