Quick Links

Re: tests against running server occasionally fail, postgres_fdw & tenk1

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	pgsql-hackers(at)postgresql(dot)org, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: tests against running server occasionally fail, postgres_fdw & tenk1
Date:	2023-02-09 00:29:35
Message-ID:	20230209002935.2mq47txxcpvcgyqv@awork3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On 2023-02-08 14:03:49 -0800, Peter Geoghegan wrote:
> On Tue, Feb 7, 2023 at 6:47 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > One thing I'm not quite sure what to do about is that we atm use a hardcoded
> > DEBUG2 (not controlled by VERBOSE) in a bunch of places:
> >
> > ereport(DEBUG2,
> > (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
> > vacrel->relname, (long long) index, vacuumed_pages)));
> >
> > ivinfo.message_level = DEBUG2;
> >
> > I find DEBUG2 hard to use to run the entire regression tests, it results in a
> > lot of output. Lots of it far less important than these kinds of details
> > here. So I'd like to use a different log level for them - but without further
> > complications that'd mean they'd show up in VACUUUM VERBOSE.
>
> I think that these DEBUG2 traces are of limited value, even for
> experts. I personally never use them -- I just use VACUUM
> VERBOSE/autovacuum logging for everything, since it's far easier to
> read, and isn't missing something that the DEBUG2 traces have. In fact
> I wonder if we should even have them at all.

I find it useful information when debugging problems. Without it, the log
doesn't tell you which index was processed when a problem started to occur. Or
even that we were scanning indexes at all.

IOW, I care a lot more about log entries denoting the start / end of an index
vacuum than attached numbers. Although I also care about those, to some
degree.

> I generally don't care about the details when VACUUM runs out of space
> for dead_items -- these days the important thing is to just avoid it
> completely.

I wonder if it'd possibly make sense to log more verbosely if we do end up
running out of space, but not otherwise? Particularly because it's important
to avoid multi-pass vacuums? Right now there's basically no log output until
the vacuum finished, which in a bad situation could take days, with many index
scan cycles. Logging enough to be able to see such things happening IMO is
important.

> I also generally don't care about how many index tuples
> were deleted by each index's ambulkdelete() call, since VACUUM VERBOSE
> already shows me the number of LP_DEAD stubs encountered/removed in
> the heap.

Isn't it actually quite useful to see how many entries were removed from the
index relative to the number of LP_DEAD removed in the heap? And relative to
other indexes? E.g. to understand how effective killtuple style logic is?

One annoyance with the multiple logs as part of a [auto]vacuum is that we end
up logging the context / statement repeatedly, leading to pointless output
like this:

2023-02-08 15:55:01.117 PST [3904676][client backend][2/55:0][psql] LOG: vacuuming "postgres.public.large"
2023-02-08 15:55:01.117 PST [3904676][client backend][2/55:0][psql] STATEMENT: VACUUM (PARALLEL 0) large ;
2023-02-08 15:55:02.671 PST [3904676][client backend][2/55:0][psql] LOG: scanned index "large_pkey" to remove 499994 row versions
2023-02-08 15:55:02.671 PST [3904676][client backend][2/55:0][psql] CONTEXT: while vacuuming index "large_pkey" of relation "public.large"
2023-02-08 15:55:02.671 PST [3904676][client backend][2/55:0][psql] STATEMENT: VACUUM (PARALLEL 0) large ;
...
2023-02-08 15:55:03.496 PST [3904676][client backend][2/55:0][psql] STATEMENT: VACUUM (PARALLEL 0) large ;
2023-02-08 15:55:03.498 PST [3904676][client backend][2/56:0][psql] LOG: vacuuming "postgres.pg_toast.pg_toast_3370138"
2023-02-08 15:55:03.498 PST [3904676][client backend][2/56:0][psql] STATEMENT: VACUUM (PARALLEL 0) large ;
2023-02-08 15:55:03.498 PST [3904676][client backend][2/56:0][psql] LOG: finished vacuuming "postgres.pg_toast.pg_toast_3370138": index scans: 0

I think we should emit most of the non-error/warning messages in vacuum with
errhidestmt(true), errhidecontext(true) to avoid that. The error context is
quite helpful for error messages due to corruption, cancellations and such,
but not for messages where we already are careful to include relation names.

I generally like the improved log output for [auto]vacuum, but the fact that
you can't see anymore when index scans start imo problematic. Right now you
can't even infer how long the first index scan takes, which really isn't
great.

I'd thus like to:

1) Use errhidestmt(true), errhidecontext(true) for vacuum
ereport(non-error-or-warning)

2) Add a message to lazy_vacuum() or lazy_vacuum_all_indexes(), that includes
- num_index_scans
- how many indexes we'll scan
- how many dead tids we're working on removing

3) Add a log at the end of lazy_vacuum_heap_rel() that's logged only (or more
verbosely) when lazy_vacuum() was run due to running out of space

If we just do the heap scan once, this can be easily inferred from the
other messages.

4) When we run out of space for dead tids, increase the log level for the rest
of vacuum. It's sufficiently bad if that happens that we really ought to
include it in the log by default.

5) Remove the row versions from vac_bulkdel_one_index()'s message, it's
already included in 2).

Instead we could emit the content from vac_cleanup_one_index(), imo that's
a lot more useful when separated for each scan.

6) Possibly remove the log output from vac_cleanup_one_index()?

2) and 3) together allow to figure out how long individual scan / vacuum
phases are taking. 1) should reduce log verbosity sufficiently to make it
easier to actually read the output.

FWIW, I'm not proposing to do all of that in one patch, once I understand a
bit more what we have concensus and what not I can split it into steps.

Greetings,

Andres Freund

In response to

Re: tests against running server occasionally fail, postgres_fdw & tenk1 at 2023-02-08 22:03:49 from Peter Geoghegan

Responses

Re: tests against running server occasionally fail, postgres_fdw & tenk1 at 2023-02-09 02:37:41 from Peter Geoghegan

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Peter Smith	2023-02-09 01:01:10	Re: [PATCH] Add pretty-printed XML output option
Previous Message	Nathan Bossart	2023-02-09 00:24:13	Re: Weird failure with latches in curculio on v15