Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

From: Andres Freund <andres(at)anarazel(dot)de>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
Date: 2023-01-19 23:38:57
Message-ID: 20230119233857.pjdz5ls43qbdlmke@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2023-01-19 15:10:38 -0800, Peter Geoghegan wrote:
> On Thu, Jan 19, 2023 at 2:54 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > Yea. Hence my musing about potentially addressing this by choosing to visit
> > additional blocks during the heap scan using vacuum's block sampling logic.
>
> I'd rather just invent a way for vacuumlazy.c to tell the top-level
> vacuum.c caller "I didn't update reltuples, but you ought to go
> ANALYZE the table now that I'm done, even if you weren't already
> planning to do so".

I'm worried about increasing the number of analyzes that much - on a subset of
workloads it's really quite slow.

Another version of this could be to integrate analyze.c's scan more closely
with vacuum all the time. It's a bit bonkers that we often sequentially read
blocks, evict them from shared buffers if we read them, just to then
afterwards do random IO for blocks we've already read. That's imo what we
eventually should do, but clearly it's not a small project.

> This wouldn't have to happen every time, but it would happen fairly often.

Do you have a mechanism for that in mind? Just something vacuum_count % 10 ==
0 like? Or remember scanned_pages in pgstats and re-computing

> > IME most of the time in analyze isn't spent doing IO for the sample blocks
> > themselves, but CPU time and IO for toasted columns. A trimmed down version
> > that just computes relallvisible should be a good bit faster.
>
> I worry about that from a code maintainability point of view. I'm
> concerned that it won't be very cut down at all, in the end.

I think it'd be fine to just use analyze.c and pass in an option to not
compute column and inheritance stats.

> Presumably you'll want to add the same I/O prefetching logic to this
> cut-down version, just for example. Since without that there will be
> no competition between it and ANALYZE proper. Besides which, isn't it
> kinda wasteful to not just do a full ANALYZE? Sure, you can avoid
> detoasting overhead that way. But even still.

It's not just that analyze is expensive, I think it'll also be confusing if
the column stats change after a manual VACUUM without ANALYZE.

It shouldn't be too hard to figure out whether we're going to do an analyze
anyway and not do the rowcount-estimate version when doing VACUUM ANALYZE or
if autovacuum scheduled an analyze as well.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2023-01-19 23:41:29 Re: Use appendStringInfoSpaces more
Previous Message Tomas Vondra 2023-01-19 23:34:34 Re: Add LZ4 compression in pg_dump