Re: decoupling table and index vacuum

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: decoupling table and index vacuum
Date: 2021-04-22 20:01:18
Message-ID: 20210422200118.ag3cnhtyfuvztdoo@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2021-04-22 12:15:27 -0400, Robert Haas wrote:
> On Wed, Apr 21, 2021 at 5:38 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > I'm not sure that's the only way to deal with this. While some form of
> > generic "conveyor belt" infrastructure would be a useful building block,
> > and it'd be sensible to use it here if it existed, it seems feasible to
> > dead tids in a different way here. You could e.g. have per-heap-vacuum
> > files with a header containing LSNs that indicate the age of the
> > contents.
>
> That's true, but have some reservations about being overly reliant on
> the filesystem to provide structure here. There are good reasons to be
> worried about bloating the number of files in the data directory. Hmm,
> but maybe we could mitigate that. First, we could skip this for small
> relations. If you can vacuum the table and all of its indexes using
> the naive algorithm in <10 seconds, you probably shouldn't do anything
> fancy. That would *greatly* reduce the number of additional files
> generated. Second, we could forget about treating them as separate
> relation forks and make them some other kind of thing entirely, in a
> separate directory

I'm not *too* worried about this issue. IMO the big difference to the
cost of additional relation forks is that such files would only exist
when the table is modified to a somewhat meaningful degree. IME the
practical issues with the number of files due to forks are cases where
huge number of tables that are practically never modified exist.

That's not to say that I am sure that some form of "conveyor belt"
storage *wouldn't* be the right thing. How were you thinking of dealing
with the per-relation aspects of this? One conveyor belt per relation?

> especially if we adopted Sawada-san's proposal to skip WAL logging. I
> don't know if that proposal is actually a good idea, because it
> effectively adds a performance penalty when you crash or fail over,
> and that sort of thing can be an unpleasant surprise. But it's
> something to think about.

I'm doubtful about skipping WAL logging entirely - I'd have to think
harder about it, but I think that'd mean we'd restart from scratch after
crashes / immediate restarts as well, because we couldn't rely on the
contents of the "dead tid" files to be accurate. In addition to the
replication issues you mention.

> > One thing that you didn't mention so far is that this'd allow us to add
> > dead TIDs to the "dead tid" file outside of vacuum too. In some
> > workloads most of the dead tuple removal happens as part of on-access
> > HOT pruning. While some indexes are likely to see that via the
> > killtuples logic, others may not. Being able to have more aggressive
> > index vacuum for the one or two bloated index, without needing to rescan
> > the heap, seems like it'd be a significant improvement.
>
> Oh, that's a very interesting idea. It does impose some additional
> requirements on any such system, though, because it means you have to
> be able to efficiently add single TIDs. For example, you mention a
> per-heap-VACUUM file above, but you can't get away with creating a new
> file per HOT prune no matter how you arrange things at the FS level.

I agree that it'd be an issue, even though I think it's not too common
that only one tuple gets pruned. It might be possible to have a
per-relation file per backend or such... But yes, we'd definitely have
to think about it.

I've previously pondered adding some cross-page batching and deferring
of hot pruning in the read case, which I guess might be more
advantageous with this.

The main reason for thinking about batching & deferring of HOT pruning
is that I found during the AIO work that there's speed gains to be head
if we pad xlog pages instead of partially filling them - obviously
risking increasing WAL usage. One idea to reduce the cost of that was to
fill the padded space with actually useful things, like FPIs or hot
pruning records. A related speedup opportunity with AIO is to perform
useful work while waiting for WAL flushes during commit (i.e. after
initiating IO to flush the commit record, but before that IO has
completed).

> Actually, though, I think the big problem here is deduplication. A
> full-blown VACUUM can perhaps read all the already-known-to-be-dead
> TIDs into some kind of data structure and avoid re-adding them, but
> that's impractical for a HOT prune.

What is there to deduplicate during HOT pruning? It seems that hot
pruning would need to log all items that it marks dead, but nothing
else? And that VACUUM can't yet have put those items onto the dead tuple
map, because they weren't yet?

This actually brings up a question I vaguely had to the fore: How are
you assuming indexes would access the list of dead tids? As far as I can
see the on-disk data would not be fully sorted even without adding
things during HOT pruning - the dead tids from a single heap pass will
be, but there'll be tids from multiple passes, right?

Are you assuming that we'd read the data into memory and then merge-sort
between each of the pre-sorted "runs"? Or that we'd read and cache parts
of the on-disk data during index checks?

> > Have you thought about how we would do the scheduling of vacuums for the
> > different indexes? We don't really have useful stats for the number of
> > dead index entries to be expected in an index. It'd not be hard to track
> > how many entries are removed in an index via killtuples, but
> > e.g. estimating how many dead entries there are in a partial index seems
> > quite hard (at least without introducing significant overhead).
>
> No, I don't have any good ideas about that, really. Partial indexes
> seem like a hard problem, and so do GIN indexes or other kinds of
> things where you may have multiple index entries per heap tuple. We
> might have to accept some known-to-be-wrong approximations in such
> cases.

The gin case seems a bit easier than the partial index case. Keeping
stats about the number of new entries in a GIN index doesn't seem too
hard, nor does tracking the number of cleaned up index entries. But
knowing which indexes are affected when a heap tuple becomes dead seems
harder. I guess we could just start doing a stats-only version of
ExecInsertIndexTuples() for deletes, but obviously the cost of that is
not enticing. Perhaps it'd not be too bad if we only did it when there's
an index with predicates?

> > > One rather serious objection to this whole line of attack is that we'd
> > > ideally like VACUUM to reclaim disk space without using any more, in
> > > case the motivation for running VACUUM in the first place.
> >
> > I suspect we'd need a global limit of space used for this data. If above
> > that limit we'd switch to immediately performing the work required to
> > remove some of that space.
>
> I think that's entirely the wrong approach. On the one hand, it
> doesn't prevent you from running out of disk space during emergency
> maintenance, because the disk overall can be full even though you're
> below your quota of space for this particular purpose. On the other
> hand, it does subject you to random breakage when your database gets
> big enough that the critical information can't be stored within the
> configured quota.

What random breakage are you thinking of? I'm not thinking of a hard
limit that may not be crossed at any cost, by even a single byte, but
that [auto]VACUUMs would start to be more aggressive about performing
index vacuums once the limit is reached.

> I think we'd end up with pathological cases very much like what used
> to happen with the fixed-size free space map. What happened there was
> that your database got big enough that you couldn't track all the free
> space any more and it just started bloating out the wazoo. What would
> happen here is that you'd silently lose the well-optimized version of
> VACUUM when your database gets too big. That does not seem like
> something anybody wants.

I don't think the consequences would really be that comparable. Once the
FSM size was reached in the bad old days, we'd just loose track of of
free space. Whereas here we'd start to be more aggressive about cleaning
up once the "dead tids" data reaches a certain size. Of course that
would have efficiency impacts, but I think "global free space wasted" is
a valid input in deciding when to perform index vacuums.

I think max_wal_size has worked out pretty well, even if not perfect.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Janes 2021-04-22 20:04:18 Re: SQL-standard function body
Previous Message Andres Freund 2021-04-22 19:54:40 Re: posgres 12 bug (partitioned table)