Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject: Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits
Date: 2019-04-08 01:04:27
Message-ID: 20190408010427.4l63qr7h2fjcyp77@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

Hi,

On 2019-04-04 21:04:49 -0700, Andres Freund wrote:
> On 2019-04-04 23:57:58 -0400, Tom Lane wrote:
> > Andres Freund <andres(at)anarazel(dot)de> writes:
> > > I think the right approach would be to do all of this in heap_insert and
> > > heap_multi_insert. Whenever starting to work on a page, if INSERT_FROZEN
> > > is specified, remember whether it is either currently empty, or is
> > > already marked as all-visible. If previously empty, mark it as all
> > > visible at the end. If already all visible, there's no need to change
> > > that. If not yet all-visible, no need to do anything, since it can't
> > > have been inserted with COPY FREEZE. Do you see any problem doing it
> > > that way?
> >
> > Do we want to add overhead to these hot-spot routines for this purpose?
>
> For heap_multi_insert I can't see it being a problem - it's only used
> from copy.c, and the cost should be "smeared" over many tuples. I'd
> assume that compared with locking a page, WAL logging, etc, it'd not
> even meaningfully show up for heap_insert. Especially because we already
> have codepaths for options & HEAP_INSERT_FROZEN in
> heap_prepare_insert(), and I'd assume those could be combined.
>
> I think we should measure it, but I don't think that one or two
> additional, very well predictd, branches are going to be measurable in
> in those routines.
>
> The patch, as implemented, has modifications in
> RelationGetBufferForTuple(), that seems like they'd be more expensive.

Here's a *prototype* patch for this. It only implements what I
described for heap_multi_insert, not for plain inserts. I wanted to see
what others think before investing additional time.

I don't think it's possible to see the overhead of the changed code in
heap_multi_insert(), and probably - with less confidence - that it's
also going to be ok for heap_insert(). But we gotta measure that.

This avoids an extra WAL record for setting empty pages to all visible,
by adding XLH_INSERT_ALL_VISIBLE_SET & XLH_INSERT_ALL_FROZEN_SET, and
setting those when appropriate in heap_multi_insert. Unfortunately
currently visibilitymap_set() doesn't really properly allow to do this,
as it has embedded WAL logging for heap.

I think we should remove the WAL logging from visibilitymap_set(), and
move it to a separate, heap specific, function. Right now different
tableams e.g. would have to reimplement visibilitymap_set(), so that's a
second need to separate that functionality. Let me try to come up with
a proposal.

The patch currently does a vmbuffer_pin() while holding an exclusive
lwlock on the page. That's something we normally try to avoid - but I
think it's probably OK here, because INSERT_FROZEN can only be used to
insert into a new relfilenode (which thus no other session can see). I
think it's preferrable to have this logic in specific to the
INSERT_FROZEN path, rather than adding nontrivial complications to
RelationGetBufferForTuple().

I noticed that, before this patch, we do a
if (vmbuffer != InvalidBuffer)
ReleaseBuffer(vmbuffer);
after every filled page - that doesn't strike me as particularly smart -
it's pretty likely that the next heap page to be filled is going to be
on the same vm page as the previous iteration.

I noticed one small oddity that I think is common to all the approaches
presented in this thread so far. After

BEGIN;
TRUNCATE foo;
COPY foo(i) FROM '/tmp/foo' WITH FREEZE;
COPY foo(i) FROM '/tmp/foo' WITH FREEZE;
COPY foo(i) FROM '/tmp/foo' WITH FREEZE;
COMMIT;

we currently end up with pages like:
┌───────┬───────────┬──────────┬───────┬───────┬───────┬─────────┬──────────┬─────────┬───────────┐
│ blkno │ lsn │ checksum │ flags │ lower │ upper │ special │ pagesize │ version │ prune_xid │
├───────┼───────────┼──────────┼───────┼───────┼───────┼─────────┼──────────┼─────────┼───────────┤
│ 0 │ 0/50B5488 │ 0 │ 4 │ 928 │ 960 │ 8192 │ 8192 │ 4 │ 0 │
│ 1 │ 0/50B6360 │ 0 │ 4 │ 928 │ 960 │ 8192 │ 8192 │ 4 │ 0 │
│ 2 │ 0/50B71B8 │ 0 │ 4 │ 928 │ 960 │ 8192 │ 8192 │ 4 │ 0 │
│ 3 │ 0/50B8028 │ 0 │ 4 │ 928 │ 960 │ 8192 │ 8192 │ 4 │ 0 │
│ 4 │ 0/50B8660 │ 0 │ 4 │ 408 │ 5120 │ 8192 │ 8192 │ 4 │ 0 │
│ 5 │ 0/50B94B8 │ 0 │ 4 │ 928 │ 960 │ 8192 │ 8192 │ 4 │ 0 │
│ 6 │ 0/50BA328 │ 0 │ 4 │ 928 │ 960 │ 8192 │ 8192 │ 4 │ 0 │
│ 7 │ 0/50BB180 │ 0 │ 4 │ 928 │ 960 │ 8192 │ 8192 │ 4 │ 0 │
│ 8 │ 0/50BBFD8 │ 0 │ 4 │ 928 │ 960 │ 8192 │ 8192 │ 4 │ 0 │
│ 9 │ 0/50BCF88 │ 0 │ 4 │ 928 │ 960 │ 8192 │ 8192 │ 4 │ 0 │
│ 10 │ 0/50BDDE0 │ 0 │ 4 │ 928 │ 960 │ 8192 │ 8192 │ 4 │ 0 │
│ 11 │ 0/50BEC50 │ 0 │ 4 │ 928 │ 960 │ 8192 │ 8192 │ 4 │ 0 │
│ 12 │ 0/50BFAA8 │ 0 │ 4 │ 928 │ 960 │ 8192 │ 8192 │ 4 │ 0 │
│ 13 │ 0/50C06F8 │ 0 │ 4 │ 792 │ 2048 │ 8192 │ 8192 │ 4 │ 0 │
└───────┴───────────┴──────────┴───────┴───────┴───────┴─────────┴──────────┴─────────┴───────────┘
(14 rows)

Note how block 4 has more space available. That's because the
visibilitymap_pin() called in the first COPY has to vm_extend(), which
in turn does:

/*
* Send a shared-inval message to force other backends to close any smgr
* references they may have for this rel, which we are about to change.
* This is a useful optimization because it means that backends don't have
* to keep checking for creation or extension of the file, which happens
* infrequently.
*/
CacheInvalidateSmgr(rel->rd_smgr->smgr_rnode);

which invalidates ->rd_smgr->smgr_targblock *after* the first COPY,
because that's when the pending smgr invalidations are sent out. That's
far from great, but it doesn't seem to be this patch's fault.

It seems to me we need a separate invalidation that doesn't close the
whole smgr relation, but just invalidates the VM specific fields.

Greetings,

Andres Freund

Attachment Content-Type Size
0002-WIP-copy-freeze-should-actually-freeze-right.patch text/x-diff 9.2 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2019-04-08 01:08:22 Re: reloption to prevent VACUUM from truncating empty pages at the end of relation
Previous Message Masahiko Sawada 2019-04-08 00:57:40 Re: Re: reloption to prevent VACUUM from truncating empty pages at the end of relation