Re: Berserk Autovacuum (let's save next Mandrill)

From: Darafei "Komяpa" Praliaskouski <me(at)komzpa(dot)net>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Berserk Autovacuum (let's save next Mandrill)
Date: 2019-03-31 10:12:21
Message-ID: CAC8Q8tJimQuKPFtWCkShRORODmYZR8ULgVBwExYrtJef=TwSwg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Mar 28, 2019 at 6:43 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
wrote:

> >> 1. heap vacuum
> >>
> >> 2. HOT pruning
> >
> > Is it worth skipping it if we're writing a page anyway for the sake of
> hint bits and new xids? This will all be no-op anyway on append-only tables
> and happen only when we actually need something?
> >
>
> Yeah, these operations are required only when the table has actual
> garbage. IOW, append-only tables never require them.
>
> >>
> >> 3. freezing tuples
> >> 4. updating visibility map (all-visible and all-frozen)
> >
> > These two are needed, and current autovacuum launch process does not
> take into account that this is also needed for non-dead tuples.
> >
> >>
> >> 5. index vacuum/cleanup
> >
> > There is a separate patch for that. But, since
> https://commitfest.postgresql.org/16/952/ for almost a year already
> Postgres skips index cleanup on tables without new dead tuples, so this
> case is taken care of already?
>
> I think that's not enough. The feature "GUC for cleanup index
> threshold" allows us to skip only index cleanup when there are less
> insertion than the fraction of the total number of heap tuples since
> last index cleanup. Therefore it helps only append-only tables (and
> supporting only btree index for now). We still have to do index
> vacuuming even if the table has just a few dead tuple. The proposed
> patch[1] helps this situation; vacuum can run while skipping index
> vacuuming and index cleanup.
>

So, the patch I posted can be technically applied after
https://commitfest.postgresql.org/22/1817/ gets merged?

The change with my patch is that a table with 49 insertions and one delete:
- previously will wait for 49 more deletes by default (and ignore
insertions), and only then clean up both table and indexes.
- with patch will freeze/update VM for insertions, and scan the index.

In my experience only btree index is requiring a slow full index scan,
that's why only it was in the "GUC for cleanup index
threshold" patch. Is it wrong and more index types do a full index scan on
vacuum after deletion of a single tuple?

> >> 6. truncation
> >
> > This shouldn't be a heavy operation?
> >
>
> I don't think so. This could take AccessExclusiveLock on the table and
> take a long time with large shared buffer as per reported on that
> thread[2].
>

While this can be a useful optimization, I believe it is out of scope for
this patch. I want to fix vacuum never coming to append only tables without
breaking other behaviors. Truncation is likely a case of enough dead tuples
to trigger a vacuum via currently existing mechanisms.

> >>
> >>
> >> With the proposed patch[1] we can control to do 5 or not. In addition
> >> to that, another proposed patch[2] allows us to control 6.
> >>
> >> For append-only tables (and similar tables), what we periodically want
> >> to do would be 3 and 4 (possibly we can do 2 as well). So maybe we
> >> need to have both an option of (auto)vacuum to control whether to do 1
> >> and something like a new autovacuum threshold (or an option) to invoke
> >> the vacuum that disables 1, 5 and 6. The vacuum that does only 2, 3
> >> and 4 would be much cheaper than today's vacuum and anti-wraparound
> >> vacuum would be able to skip almost pages.
> >
> >
> > Why will we want to get rid of 1? It's a noop from write perspective and
> saves a scan to do it if it's not noop.
> >
>
> Because that's for tables that have many inserts but have some
> updates/deletes. I think that this strategy would help not only
> append-only tables but also such tables.
>

How much do we save by skipping a heap vacuum on almost-append-only table,
where amount of updates is below 50 which is current threshold?

>
> > Why make it faster in emergency situations when situation can be made
> non-emergency from the very beginning instead?
> >
>
> I don't understand the meaning of "situation can be made non-emergency
> from the very beginning". Could you please elaborate on that?
>

Let's imagine a simple append-only workflow on current default settings
Postgres. You create a table, and start inserting tuples, one per
transaction. Let's imagine a page fits 50 tuples (my case for taxi movement
data), and Amazon gp2 storage which caps you say at 1000 IOPS in non-burst
mode.
Anti-wrap-around-auto-vacuum (we need a drawing of misreading of this term
with a crossed out car bent in Space) will be triggered
in autovacuum_freeze_max_age inserts, 200000000 by default. That converts
into 4000000 pages, or around 32 GB. It will be the first vacuum ever on
that table, since no other mechanism triggers it, and if it steals all the
available IOPS, it will finish in 200000000/50 /1000 = 4000 seconds,
killing prod for over an hour.

Telemetry workloads can easily generate 32 GB of data a day (I've seen
more, but let's stick to that number). Production going down for an hour a
day isn't good and I consider it an emergency.

Now, two ways to fix it that reading documentation leads you while you're a
sleepy one trying to get prod back:
- raise autovacuum_freeze_max_age so VACUUM keeps sleeping;
- rewrite code to use batching to insert more tuples at once.

We don't have a better recommendation mechanism for settings, and
experience in tuning autovacuum into right direction comes at the cost of a
job or company to people :)

Both ways not fix the problem but just delay the inevitable. Ratio of "one
hour of vacuum per day of operation" keeps, you just delay it.
Let's say had same thing with 1000 records batched inserts, and moved
autovacuum_freeze_max_age to the highest possible value. How much will the
downtime last?

2**31 (max tid) * 1000 (tuples per tid) / 50 (tuples in page) / 1000 (pages
per second) / 86400 (seconds in day) = 49 days.

This matches highest estimation in Mandrill's report, so that might be what
have happened to them.

This all would not be needed if autovacuum came after 50 inserted tuples.
It will just mark page as all visible and all frozen and be gone, while
it's still in memory. This will get rid of emergency altogether.

Is this elaborate enough disaster scenario? :)

--
Darafei Praliaskouski
Support me: http://patreon.com/komzpa

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Darafei Komяpa Praliaskouski 2019-03-31 10:19:53 Re: Berserk Autovacuum (let's save next Mandrill)
Previous Message GUO Rui 2019-03-31 09:58:35 Google Summer of Code: question about GiST API advancement project