Re: Reducing the size of BufferTag & remodeling forks

From: Andres Freund <andres(at)anarazel(dot)de>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Reducing the size of BufferTag & remodeling forks
Date: 2015-09-12 20:19:41
Message-ID: 20150912201941.GA8311@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2015-09-12 13:12:26 +0100, Simon Riggs wrote:
> Why do we have to do buffer lookups using the full buffer tag?

We don't necessarily.

> Why not just use (relNode, blockNum) and resolve hash collisions, if any?

I tried that and unfortunately it came out as a negative - the number of
collision gets higher and collisions in a hashtable will often imply a
cache miss. You can even see the comparison function rising in the
profile.

I've actually changed course a bit and I'm trying something different: A
two level structure. One hashtable that maps (RelFileNode, ForkNumber)
to a 'open relation' data structure, and from there a radix tree over
just the block number. To avoid having to look up in the hashtable
frequently there's a pointer in RelationData to a fork's radix tree.

This seems to have several advantages:
* It doesn't require changes to the physical representation
* A radix tree allows ordered traversal, allowing for efficient
prefetching et al.
* The ability to efficiently get all pages that belong to a relation
makes dropping/truncating tables radically more efficient than now.
* A single lookup in the radix tree is on average significantly faster
than in the hash table. I've measured ~350 vs 60 cycles in a
nonconcurrent workload and ~820 vs 72~ cycles in a concurrent
workload, using a recorded buffer access traces.
* Having a 'open relation' representation in shared memory also makes it
much easier to implement more efficient relation extension and
truncation - we can have pointers in there that signal how far the
relation has been extended and how far it has been initialized.

The biggest disadvantage is that it's a good bit more complex than what
we have right now. That we need dynamically many radix trees makes
memory management nontrivial.

Sounds halfway sane?

Andres

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2015-09-12 21:18:32 Re: typo in create policy doc
Previous Message Pavel Stehule 2015-09-12 18:27:52 Re: On-demand running query plans using auto_explain and signals