Re: Bug: Buffer cache is not scan resistant

From: "Luke Lonergan" <LLonergan(at)greenplum(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Mark Kirkwood" <markir(at)paradise(dot)net(dot)nz>
Cc: "Pavan Deolasee" <pavan(at)enterprisedb(dot)com>, "Gavin Sherry" <swm(at)alcove(dot)com(dot)au>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>, "Doug Rady" <drady(at)greenplum(dot)com>, "Sherry Moore" <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-06 01:42:30
Message-ID: C3E62232E3BCF24CBA20D72BFDCB6BF802AF2821@MI8NYCMAIL08.Mi8.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Tom,

Good info - it's the same in Solaris, the routine is uiomove (Sherry wrote it).

- Luke

Msg is shrt cuz m on ma treo

-----Original Message-----
From: Tom Lane [mailto:tgl(at)sss(dot)pgh(dot)pa(dot)us]
Sent: Monday, March 05, 2007 07:43 PM Eastern Standard Time
To: Mark Kirkwood
Cc: Pavan Deolasee; Gavin Sherry; Luke Lonergan; PGSQL Hackers; Doug Rady; Sherry Moore
Subject: Re: [HACKERS] Bug: Buffer cache is not scan resistant

Mark Kirkwood <markir(at)paradise(dot)net(dot)nz> writes:
> Tom Lane wrote:
>> But what I wanted to see was the curve of
>> elapsed time vs shared_buffers?
> ...
> Looks *very* similar.

Yup, thanks for checking.

I've been poking into this myself. I find that I can reproduce the
behavior to some extent even with a slow disk drive (this machine is a
dual 2.8GHz Xeon EM64T running Fedora Core 5; the dd-to-dev-null test
shows the disk read speed as 43MB/sec or so). Test case is a
several-gig table, no indexes, fully vacuumed so that neither VACUUM nor
COUNT(*) have to do anything but seqscan as fast as they can. Given a
*freshly started* postmaster, I see

regression=# show shared_buffers;
shared_buffers
----------------
128MB
(1 row)

regression=# \timing
Timing is on.
regression=# vacuum lineitem;
VACUUM
Time: 63652.333 ms
regression=# vacuum lineitem;
VACUUM
Time: 63562.303 ms
regression=# select count(*) from lineitem;
count
----------
10240000
(1 row)

Time: 63142.174 ms
regression=# vacuum lineitem;
VACUUM
Time: 61638.421 ms
regression=# vacuum lineitem;
VACUUM
Time: 61785.905 ms

I didn't show it here, but you can repeat the VACUUM all you want before
the SELECT, and its times are stable; and you can repeat all you want
after the SELECT, and the times are stable but a couple seconds lower.
Restart the postmaster and it goes back to the slower behavior. (I'm
keeping autovac off so it doesn't change the results.)

I decided to get down and dirty with oprofile, and soon found that the
user-space CPU consumption is indistinguishable in both states:

CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)
with a unit mask of 0x01 (mandatory) count 240000
GLOBAL_POWER_E...|
samples| %|
------------------
141065 73.8193 /usr/lib/debug/lib/modules/2.6.18-1.2200.fc5/vmlinux
26368 13.7984 /home/tgl/testversion/bin/postgres
12765 6.6799 /libata
2238 1.1711 /lib64/libc-2.4.so
1112 0.5819 /dm_mod

CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)
with a unit mask of 0x01 (mandatory) count 240000
GLOBAL_POWER_E...|
samples| %|
------------------
113177 70.2169 /usr/lib/debug/lib/modules/2.6.18-1.2200.fc5/vmlinux
26284 16.3070 /home/tgl/testversion/bin/postgres
12004 7.4475 /libata
2093 1.2985 /lib64/libc-2.4.so
996 0.6179 /dm_mod

Inside the kernel, there's only one routine that's significantly different:

CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)
with a unit mask of 0x01 (mandatory) count 240000
samples % symbol name
57779 40.9591 copy_user_generic
18175 12.8841 __delay
3994 2.8313 _raw_spin_lock
2388 1.6928 put_page
2184 1.5482 mwait_idle
2083 1.4766 _raw_write_unlock
1909 1.3533 _raw_write_lock

CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)
with a unit mask of 0x01 (mandatory) count 240000
samples % symbol name
37437 33.0783 copy_user_generic
17891 15.8080 __delay
3372 2.9794 _raw_spin_lock
2218 1.9598 mwait_idle
2067 1.8263 _raw_write_unlock
1837 1.6231 _raw_write_lock
1531 1.3527 put_page

So that's part of the mystery: apparently copy_user_generic is coded in
such a way that it's faster to copy into memory that's already in
processor cache. This strikes me as something that probably
could/should be fixed in the kernel; I don't see any good reason why
overwriting a whole cache line oughtn't be the same speed either way.

The other thing that was bothering me is why does the SELECT change
VACUUM's behavior? A debugging printout added to ReadBuffer gave the
answer: after postmaster start, we see things like

read block 353094 into buffer 11386
read block 353095 into buffer 11387
read block 353096 into buffer 11388
read block 353097 into buffer 11389
read block 353098 into buffer 11390
read block 353099 into buffer 11391
read block 353100 into buffer 11392
read block 353101 into buffer 11393
read block 353102 into buffer 11394
read block 353103 into buffer 11395

and after the SELECT it behaves like

read block 336761 into buffer 9403
read block 336762 into buffer 9402
read block 336763 into buffer 9403
read block 336764 into buffer 9402
read block 336765 into buffer 9403
read block 336766 into buffer 9402
read block 336767 into buffer 9403
read block 336768 into buffer 9402
read block 336769 into buffer 9403
read block 336770 into buffer 9402

What's going on is that VACUUM puts each buffer it's finished with on
the tail of the freelist. In the post-SELECT state, there are just two
buffers cycling through the freelist (not sure why not only one, but it
doesn't matter) and so the cache footprint is limited. But immediately
after postmaster start, (nearly) all the buffers are in the freelist and
so they all cycle through VACUUM's usage. In any real-world situation,
of course, the freelist is going to be nearly empty most of the time and
so I don't think this part is worth changing.

So I now concede Luke's argument that this behavior is related to L2
cache usage. But the next question is whether we ought to change
regular seqscan's behavior to mimic VACUUM. I'm very worried about
pessimizing other cases if we do. ISTM there's a fairly clear case that
this might be fixable at the kernel level. Moreover, the issue only
arises because of the way that the copy-from-kernel-buffer-to-userspace
routine behaves, which means that if we go to a regime where we stop
relying on OS caching and ask for direct DMA into our buffers, the
advantage would disappear anyway. Lastly, I think the case where a win
is possible is fairly narrow --- as soon as you've got anything but the
one seqscan going on, it's not going to help.

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2007-03-06 02:08:54 Re: [HACKERS] [COMMITTERS] pgsql: Add GUC temp_tablespaces to provide a default location for
Previous Message Gregory Stark 2007-03-06 01:31:46 Re: Bug: Buffer cache is not scan resistant