Re: Initdb-time block size specification

From: David Christensen <david(dot)christensen(at)crunchydata(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Stephen Frost <sfrost(at)snowman(dot)net>
Subject: Re: Initdb-time block size specification
Date: 2023-06-30 20:05:54
Message-ID: CAOxo6XL5vkKb0oc+AEuMJ9A3j1W=DwSAKJwzOA78zTHcO_QzEQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jun 30, 2023 at 2:39 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> Hi,
>
> On 2023-06-30 12:35:09 -0500, David Christensen wrote:
> > As discussed somewhat at PgCon, enclosed is v1 of a patch to provide
> > variable block sizes; basically instead of BLCKSZ being a compile-time
> > constant, a single set of binaries can support all of the block sizes
> > Pg can support, using the value stored in pg_control as the basis.
> > (Possible future plans would be to make this something even more
> > dynamic, such as configured per tablespace, but this is out of scope;
> > this just sets up the infrastructure for this.)
>
> I am extremely doubtful this is a good idea. For one it causes a lot of churn,
> but more importantly it turns currently cheap code paths into more expensive
> ones.
>
> Changes like
>
> > -#define BufHdrGetBlock(bufHdr) ((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
> > +#define BufHdrGetBlock(bufHdr) ((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * CLUSTER_BLOCK_SIZE))
>
> Note that CLUSTER_BLOCK_SIZE, despite looking like a macro that's constant, is
> actually variable.

Correct; that is mainly a notational device which would be easy enough
to change (and presumably would follow along the lines of the commit
Tomas pointed out above).

> I am fairly certain this is going to be causing substantial performance
> regressions. I think we should reject this even if we don't immediately find
> them, because it's almost guaranteed to cause some.

What would be considered substantial? Some overhead would be expected,
but I think having an actual patch to evaluate lets us see what
potential there is. Seems like this will likely be optimized as an
offset stored in a register, so wouldn't expect huge changes here.
(There may be other approaches I haven't thought of in terms of
getting this.)

> Besides this, I've not really heard any convincing justification for needing
> this in the first place.

Doing this would open up experiments in larger block sizes, so we
would be able to have larger indexable tuples, say, or be able to
store data types that are larger than currently supported for tuple
row limits without dropping to toast (native vector data types come to
mind as a candidate here). We've had 8k blocks for a long time while
hardware has improved over 20+ years, and it would be interesting to
see how tuning things would open up additional avenues for performance
without requiring packagers to make a single choice on this regardless
of use-case. (The fact that we allow compiling this at a different
value suggests there is thought to be some utility having this be
something other than the default value.)

I just think it's one of those things that is hard to evaluate without
actually having something specific, which is why we have this patch
now.

Thanks,

David

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2023-06-30 20:16:31 Re: pg_upgrade instructions involving "rsync --size-only" might lead to standby corruption?
Previous Message Nathan Bossart 2023-06-30 20:05:09 Should we remove db_user_namespace?