Re: Hint bits vs. OS readahead

From: Decibel! <decibel(at)decibel(dot)org>
To: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Hint bits vs. OS readahead
Date: 2008-12-08 15:34:34
Message-ID: 810239FB-A1F6-48D2-99DD-9EDBBD8F6BFC@decibel.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Dec 5, 2008, at 7:50 PM, Andrew Gierth wrote:
> While waiting for a large restore to complete (and investigating why
> parts of it were so slow), I came across this scenario. This isn't
> quite the same as some previous discussion of hint bits, but I thought
> it was something that could probably be taken into account in future.
> This also may be relevent to the tuplestore discussion.
>
> The environment: pg 8.3.5 on FreeBSD/amd64 7.1-prerelease; 32GB RAM;
> 16 cores of 2.93GHz Xeon 7350; 4x300GB 15krpm SAS data drives in
> software RAID10. shared_buffers=1700MB, maintenance_work_mem=1GB
>
> The scenario: pg_restore of a dump containing a large partitioned
> table
> (a dozen partitions of ~10GB each). The actual loading of the data
> proceeds as expected, the interesting part is the creation of indexes
> afterwards.
>
> Watching the progress of the backend, a large proportion of the time
> is taken up by the heap scan to retrieve the data. The problem is, of
> course, that the backend settles down into an access pattern like
> this:
>
> lseek(0x64,0x3ef7c000,SEEK_SET)
> read(0x64,0x864123340,0x2000) => 8192/0x2000
> lseek(0x64,0x3ef3e000,SEEK_SET)
> write(0x64,0x864125340,0x2000) => 8192/0x2000
>
> where fd 0x64 is the table heap file; the read is obvious, the
> write is
> caused by writing a previously hinted page back to disk when the
> backend
> wants to reuse the buffer. Notice that this write is happening in the
> same backend (and on the same fd).
>
> At least on unpatched FreeBSD this access pattern destroys OS-level
> readahead, though lower-level readahead on the actual disk drives
> themselves hides this fact to a significant extent (each read() call
> forces a SCSI transaction, but this transaction completes quite
> quickly due to read caching on the drive).
>
> In order to test how bad the effect was, I patched FreeBSD to use
> separate sequential-behaviour tracking for reads and writes (this
> patch turns out to be trivial, affecting only a couple of dozen
> lines). The effect was fairly dramatic; the total time taken for
> CREATE INDEX was cut by a factor of slightly better than 2 (typically
> from ~700 seconds per partition to ~320 seconds on my data).
>
> [for the patch see http://www.rhodiumtoad.org.uk/junk/seq.patch.txt ]
>
> The obvious question is whether this is something which should be left
> as the OS'es problem, or whether it would be worth having pg do some
> special handling of file opens to distinguish read and write accesses,
> or sequential from random accesses when both are likely to be
> happening
> at the same time. I've so far had conflicting answers about how well
> Linux handles this case (and not being a Linux user I have no easy way
> to test it myself).

We don't do restores very often, but we have noticed that recovery
mode is painfully slow for us, either from a crash or to bring up a
PITR snapshot. We're running on 16 core IA64 machines with 96GB
hitting iSCSI SANs (some SATA, some SAS). Under ideal conditions, PG
can read or write at 100+MB/s. Typically, we're pushing ~5MB/s, but
during recovery we'll only do 600-700kB/s. I've never straced a
backend to see exactly what's going on.
--
Decibel!, aka Jim C. Nasby, Database Architect decibel(at)decibel(dot)org
Give your computer some brain candy! www.distributed.net Team #1828

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2008-12-08 15:40:55 Re: new vacuum is slower for small tables
Previous Message Tom Lane 2008-12-08 15:32:00 Re: new vacuum is slower for small tables