Re: [HACKERS] autovacuum can't keep up, bloat just continues to rise

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc: Sokolov Yura <y(dot)sokolov(at)postgrespro(dot)ru>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: [HACKERS] autovacuum can't keep up, bloat just continues to rise
Date: 2018-07-11 23:08:41
Message-ID: CAH2-Wzn1tTQXODisR0x=8ptECXDvEQ_Fo9mnehe6xurSMXMVWg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jul 24, 2017 at 11:12 AM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> You can get about a 1/3 loss of space by inserting randomly, rather
> than inserting in sorted order, which is what REINDEX will more or
> less do for you. That's because random workloads almost entirely get
> 50:50 page splits, whereas sorted input will always split the
> rightmost page, and so will always get 90:10 splits. The space in the
> random case isn't exactly wasted; it's there for the taking, for key
> values that happen to fit on the page.

I looked into this again. I decided to see for myself what was up.
Actually, I used oltpbench [1], since that was the easiest TPC-C
benchmark available.

The order_line primary key may have been designed to be as unfriendly
as possible to implementations that don't have good suffix truncation.
This is also true of several of other TPC-C indexes.

I initialized a scale 50 TPC-C database:

pg(at)bat:~/code/oltpbench$ ./oltpbenchmark -b tpcc -c
config/my_postgres_tpcc_config.xml --create=true --load=true
15:26:57,214 (DBWorkload.java:259) INFO -
======================================================================

Benchmark: TPCC {com.oltpbenchmark.benchmarks.tpcc.TPCCBenchmark}
Configuration: config/sample_tpcc_config.xml
Type: POSTGRES
Driver: org.postgresql.Driver
URL: jdbc:postgresql:tpcc
Isolation: TRANSACTION_SERIALIZABLE
Scale Factor: 50.0

Note that there is no garbage for VACUUM to clean up just yet. There
hasn't been any deletes or updates yet. And yet, order_line_pkey is
bloated. It's 783 MB, but if I run REINDEX it shrinks right down to
451 MB. This is on the master branch -- not with one of my patches.
You can also arrange to make the index much smaller if you force the
insertions to occur in a totally random order, rather than the order
that the benchmark actually inserts them.

The problem for us is that tuples are inserted in
clustered/monotonically increasing order for the last 3 columns, while
the first column (ol_w_id) is a low cardinality column whose values
appear in random order, more or less. This isn't actually unrealistic
- it makes sense that associated records would be inserted as
per-warehouse groups like this, while the order of the groups remains
unpredictable.

I am working on adding suffix truncation at the moment. Right now, my
patch doesn't help, because it isn't sophisticated enough about the
choice of split point (we need to care about suffix truncation, and
not just balancing space among the two halves of a split). It should
try to find a split point that maximizes the chances of a new pivot
tuple not having any low cardinality final column (ol_number).

We should be able to mostly not have any of the last column in tuples
that make it into the branch/internal nodes:

pg(at)tpcc[1452]=# select count(*) from order_line ;
count
────────────
15,001,784
(1 row)

pg(at)tpcc[1452]=# select count(distinct(ol_w_id, ol_d_id, ol_o_id)) from
order_line ;
count
───────────
1,500,000
(1 row)

As I said, the final ol_number column makes the keyspace unbalanced by
unnecessarily appearing in the internal/branch nodes:

pg(at)tpcc[1452]=# select count(distinct(ol_number)) from order_line ;
count
───────
15
(1 row)

Since there can only be 15 distinct values for any given (ol_w_id,
ol_d_id, ol_o_id, *), and since over 100 index tuples will fit on each
leaf page, we just have to pick a split point that's between (x, y, z,
15) and (x, y, z + 1, 1). This makes it legal to truncate away the
ol_number column, which allows us to balance the use of free space for
items that are inserted after the split, by later transactions.

[1] https://github.com/oltpbenchmark/oltpbench
--
Peter Geoghegan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2018-07-11 23:35:33 Re: Tips on committing
Previous Message Mai Peng 2018-07-11 23:02:30 Re: ERROR: cannot start subtransactions during a parallel operation