Re: Unexpected page allocation behavior on insert-only tables

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: Michael Renner <michael(dot)renner(at)amd(dot)co(dot)at>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Unexpected page allocation behavior on insert-only tables
Date: 2010-05-31 02:42:25
Message-ID: 24230.1275273745@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> writes:
> Excerpts from Michael Renner's message of sáb may 15 20:24:36 -0400 2010:
>>> I've written a simple tool to generate traffic on a database [1], which
>>> did about 30 TX/inserts per second to a table. Upon inspecting the data
>>> in the table, I noticed the expected grouping of tuples which came from
>>> a single backend to matching pages [2]. The strange part was that the
>>> pages weren't completely filled but the backends seemed to jump
>>> arbitrarily from one page to the next [3]. For the table in question
>>> this resulted in about 10% wasted space.

> I think this may be related to the smgr_targblock stuff; if the relcache
> entry gets invalidated at the wrong time for whatever reason, the
> "current page" could be abandoned in favor of extending the rel. This
> has changed since 8.4, but a quick perusal suggests that it should be
> less likely on 9.0 than 8.4 but maybe there's something weird going on.

I found time to try this example finally. The behavior that I see in
HEAD is even worse than Michael describes: there is room for 136 rows
per block in the bid table, but most blocks have only a few rows. The
distribution after letting the exerciser run for 500 bids or so is
typically like this:

#rows block#
136 0
6 1
5 2
4 3
3 4
5 5
3 6
1 7
4 8
4 9
136 10
6 11
7 12
9 13
9 14
7 15
9 16
7 17
8 18
5 19
136 20
2 21
4 22
4 23
3 24
5 25
3 26
4 27
3 28
2 29
1 30

Examining the insertion timestamps and bidder numbers (client process
IDs), and correlating this with logged autovacuum activity, makes it
pretty clear what is going on. See the logic in
RelationGetBufferForTuple, and note that at no time do we have any FSM
data for the bid table:

1. Initially, all backends will decide to insert into block 0. They do
so until the block is full.

2. At that point, each active backend individually decides it needs to
extend the relation. They each create a new block and start inserting
into that one, each carefully not telling anyone else about the block
so as to avoid block-level insertion contention. In the above diagram,
blocks 1-9 are each created by a different backend and the rows inserted
into it come (mostly?) from just one backend. Block 10's first few rows
also come from the one backend that created it, but it doesn't manage to
fill the block entirely before ...

3. After awhile, autovacuum notices all the insert activity and kicks
off an autoanalyze on the bid table. When committed, this forces a
relcache flush for each other backend's relcache entry for "bid".
In particular, the smgr targblock gets reset.

4. Now, all the backends again decide to try to insert into the last
available block. So everybody jams into the partly-filled block 10,
until it gets filled.

5. Lather, rinse, repeat. Since there are exactly 10 active clients
(by default) in this test program, the repeat distance is exactly 10
blocks.

The obvious thing to do about this would be to not reset targblock
on receipt of a relcache flush event, but we can *not* do that in the
general case. The reason that that gets reset is so that it's not
left pointing to a no-longer-existent block after a VACUUM truncation.
Maybe we could develop a way to distinguish truncation events from
others, but right now the sinval signaling mechanism can't do that.
This looks like there might be sufficient grounds to do something,
though.

Attached exhibits: contents of relevant columns of the bid table
and postmaster log entries for autovacuum actions during the run.

regards, tom lane

Attachment Content-Type Size
unknown_filename text/plain 27.0 KB
unknown_filename text/plain 1.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2010-05-31 03:44:02 Re: Unexpected page allocation behavior on insert-only tables
Previous Message Daniele Varrazzo 2010-05-31 02:09:53 Adding regexp_match() function