Re: Proposal : For Auto-Prewarm.

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Mithun Cy <mithun(dot)cy(at)enterprisedb(dot)com>
Cc: Beena Emerson <memissemerson(at)gmail(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal : For Auto-Prewarm.
Date: 2017-02-26 16:46:50
Message-ID: CA+TgmoYNF_wfdwQ3z3713zKy2j0Z9C32WJdtKjvRWzeY7JOL4g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 22, 2017 at 11:49 PM, Mithun Cy <mithun(dot)cy(at)enterprisedb(dot)com> wrote:
> Hi all thanks,
> I have tried to fix all of the comments given above with some more
> code cleanups.

While reading this patch tonight, I realized a serious problem with
the entire approach, which is that this patch is supposing that we can
read relation blocks for every database from a single worker that's
not connected to any database. I realize that I suggested that
approach, but now I think it's broken, because the patch isn't taking
any locks on the relations whose pages it is reading, and that is
definitely going to break things. While autoprewarm is busy sucking
blocks into the shared buffer cache, somebody could be, for example,
dropping one of those relations. DropRelFileNodesAllBuffers and
friends expect that nobody is going to be concurrently reading blocks
back into the buffer cache because they hold AccessExclusiveLock, and
they assume that anybody else who is touching it will hold at least
AccessShareLock. But this violates that assumption, and probably some
others.

This is not easy to fix. The lock has to be taken based on the
relation OID, not the relfilenode, but we don't have the relation OID
in the dump file, and recording it there won't help, because the
relfilenode can change under us if the relation is rewritten with
CLUSTER or VACUUM FULL or relevant forms of ALTER TABLE. I don't see
a solution other than launching a separate worker for each database,
which seems like it could be extremely expensive if there are many
databases. Also, I am pretty sure it's no good to take locks before
recovery reaches a consistent state. I'm not sure off-hand whether
crash-recovery will notice conflicting locks, but even if it does,
blocking crash recovery in order to prewarm stuff is bad news.

Here are some other review comments (which may not matter unless we
can think up a solution to the problems above).

- I think auto_pg_prewarm.c is an unnecessarily long name. How about
autoprewarm.c?

- It looks like you ran pgindent over this without adding the new
typedefs to your typedefs.list, so things like the last line of each
typedef is formatted incorrectly.

- ReadBufferForPrewarm isn't a good name for this interface. You need
to give it a generic name (and header comment) that describes what it
does, rather than why you added it. Others might want to also use
this interface. Actually, an even better idea might be to adjust
ReadBufferWithoutRelcache() to serve your need here. That function's
header comment seems to contemplate that somebody might want to add a
relpersistence argument someday; perhaps that day has arrived.

- have_free_buffer's comment shouldn't mention autoprewarm, but it
should mention that this is a lockless test, so the results might be
slightly stale. See similar comments in various other backend
functions for an example of how to write this.

- next_task could be static, and with such a generic name, it really
MUST be static to avoid namespace conflicts.

- load_block() has a race condition. The relation could be dropped
after you check smgrexists() and before you access the relation.
Also, the fork number validity check looks useless; it should never
fail.

- I suggest renaming the file that stores the blocks to
autoprewarm.blocks or something like that. Calling it just
"autopgprewarm" doesn't seem very clear.

- I don't see any reason for the dump file to contain a header record
with an expected record count. When rereading the file, you can just
read until EOF; there's no real need to know the record count before
you start.

- You should test for multiple flags like this: if ((buf_state &
(BM_VALID|VM_TAG_VALID|BM_PERSISTENT)) != 0). However, I also think
the test is wrong. Even if the buffer isn't BM_VALID, that's not
really a reason not to include it in the dump file. Same with
BM_PERSISTENT. I think the reason for the latter restriction is that
you're always calling load_block() with RELPERSISTENCE_PERMANENT, but
that's not a good idea either. If the relation were made unlogged
after you wrote the dump file, then on reload it you'd incorrectly set
BM_PERMANENT on the reloaded blocks.

- elog() should not be used for user-facing messages, but rather only
for internal messages that we don't expect to get generated. Also,
the messages you've picked don't conform to the project's message
style guidelines.

- The error handling loop around load_block() suggests that you're
expecting some reads to fail, which I guess is because you could be
trying to read blocks from a relation that's been rewritten under a
different relfilenode, or partially or entirely truncated. But I
don't think it's very reasonable to just let ReadBufferWhatever() spew
error messages into the log and hope users don't panic. People will
expect an automatic prewarm solution to handle any such cases quietly,
not bleat loudly in the log. I suspect that this error-trapping code
isn't entirely correct, but there's not much point in fixing it; what
we really need to do is get rid of it (somehow).

- dump_block_info_periodically() will sleep for too long - perhaps
forever - if WaitLatch() repeatedly returns due to WL_LATCH_SET, which
can probably happen if for any reason the process receives SIGUSR1
repeatedly. Every time the latch gets set, the timeout is reset, so
it may never expire. There are examples of how to write a loop like
this correctly in various places in the server; please check one of
those.

- I don't think you should need an error-catching loop in
auto_pgprewarm_main(), either. Just let the worker die if there's an
ERROR, and set the restart interval to something other than
BGW_NEVER_RESTART.

- Setting bgw_main isn't correct for extension code. Please read the
documentation on background workers, which explains how to do this
correctly in an extension.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2017-02-26 16:49:21 Re: mat views stats
Previous Message Robert Haas 2017-02-26 15:56:34 Re: Parallel bitmap heap scan