Re: [WIP] Zipfian distribution in pgbench

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Alik Khilazhev <a(dot)khilazhev(at)postgrespro(dot)ru>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] Zipfian distribution in pgbench
Date: 2017-07-15 00:06:16
Message-ID: CAH2-Wzmf6intNY1ggiNzOziiO5Eq=DsXfeptODGxO=2j-i1NGQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jul 14, 2017 at 6:37 AM, Alik Khilazhev
<a(dot)khilazhev(at)postgrespro(dot)ru> wrote:
> I am attaching results of tests for 32 and 128 clients that were running for 10 minutes, and TPS remains 305 and 115 ktps respectively.
>
> Tests was executed with configuration set for YCSB. And there is very aggressively autovacuum, this can be reason why there is no decline appears with increasing working time.
> Autovacuum config:
>
> autovacuum_max_workers = 8
> autovacuum_naptime = 10s
> autovacuum_vacuum_scale_factor = 0.1
> autovacuum_vacuum_cost_delay = 0ms
> autovacuum_vacuum_cost_limit = 10000

I think that what this probably comes down to, more than anything
else, is that you have leftmost hot/bloated leaf pages like this:

idx | level | l_item | blkno | btpo_prev |
btpo_next | btpo_flags | type | live_items | dead_items |
avg_item_size | page_size | free_size | highkey
-----------------------+-------+--------+-------+-----------+-----------+------------+------+------------+------------+---------------+-----------+-----------+-------------------------
...
pgbench_accounts_pkey | 0 | 1 | 1 | 0 |
2751 | 65 | l | 100 | 41 | 16 |
8192 | 5328 | 11 00 00 00 00 00 00 00
pgbench_accounts_pkey | 0 | 2 | 2751 | 1 |
2746 | 65 | l | 48 | 90 | 16 |
8192 | 5388 | 32 00 00 00 00 00 00 00
...

The high key for the leftmost shows that only values below 0x11 belong
on the first page. This is about 16 or 17 possible distinct values,
and yet the page has 100 live items, and 41 dead items; in total,
there is room for 367 items. It's only slightly better with other
nearby pages. This is especially bad because once the keyspace gets
split up this finely, it's *impossible* to reverse it -- it's more or
less a permanent problem, at least until a REINDEX. You cannot merge
pages back together until one is completely empty, which in this case
and many cases will in fact never happen. Aggressive vacuuming is
probably helpful in part because it prevents the problem from ever
getting completely out of hand. That doesn't seem like a very
practical solution, though.

We should probably be treating unique indexes in a special way, since
inserting a new conflicting tuple necessarily supersedes whatever it
conflicted with. Postgres holds on to the old tuple today, but only
because the transaction might abort, or because an old snapshot might
be interested in old tuple versions. However, the general pattern with
unique indexes is that there must be one tuple visible to new
snapshots, and old versions are almost certainly going to became
garbage very quickly. Unique indexes really are quite special, which
nbtree doesn't take advantage of at all. If you read the coverage of
B-Trees within "Transaction Processing: Concepts and Techniques", and
many other publications, the general trend seems to be that unique
indexes have truly unique keys, based only on the user-visible key
values. They make a sharp distinction between primary and secondary
indexes, which doesn't really exist in Postgres (at least, not at the
access method level).

I suspect that the best long term solution is to add GIN-style
duplicate handling within nbtree for unique indexes, with special
pruning style optimizations to the posting list structure. This would
probably only happen with unique indexes. The useful thing about this
approach is it separates these two problems:

1. Representing what values are in the index for lookup, and their
latest row version.

2. Finding old row versions, in the less common case where you have an
old snapshot.

With a design like that, nbtree would never "cut up the keyspace too
finely" because of a temporary surge of UPDATE insertions. You still
get bloat, but you add it to a place where garbage collection can be
much better targeted. Under this scheme, it doesn't really matter so
much if garbage collection doesn't happen very frequently, because
there could be LP_DEAD style hints for the auxiliary posting list
structure. That could probably work based on atomic ops, and would
greatly reduce the number of pages that UPDATE workloads like this
dirtied.

It probably wouldn't make sense to add things like GIN's compression
of item pointers, since most data within the auxiliary posting list
structure is removed very quickly. I wouldn't expect btree_gin to be
faster for this workload today, because it doesn't have
kill_prior_tuple/LP_DEAD support, and because it doesn't support
unique indexes, and so cannot exploit the special situation that
exists with unique indexes.

--
Peter Geoghegan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Julien Rouhaud 2017-07-15 02:07:03 segfault in HEAD when too many nested functions call
Previous Message Alvaro Herrera 2017-07-14 22:40:56 Re: GSoC 2017: Foreign Key Arrays