Re: Proposal : For Auto-Prewarm.

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Mithun Cy <mithun(dot)cy(at)enterprisedb(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal : For Auto-Prewarm.
Date: 2017-05-31 16:48:06
Message-ID: CA+TgmoZ13ybkqM71bvi-wwpZ=3BoLaDhLOMJL=70Dy+r21TLVA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, May 30, 2017 at 8:15 AM, Mithun Cy <mithun(dot)cy(at)enterprisedb(dot)com> wrote:
> On Tue, May 30, 2017 at 10:16 AM, Mithun Cy <mithun(dot)cy(at)enterprisedb(dot)com> wrote:
>> Thanks Robert,
>
> Sorry, there was one more mistake ( a typo) in dump_now() instead of
> using pfree I used free corrected same in the new patch v10.

+ * contrib/autoprewarm.c

Wrong.

+ Oid database; /* database */
+ Oid spcnode; /* tablespace */
+ Oid filenode; /* relation's filenode. */
+ ForkNumber forknum; /* fork number */
+ BlockNumber blocknum; /* block number */

Why spcnode rather than tablespace? Also, the third line has a
period, but not the others. I think you could just drop these
comments; they basically just duplicate the structure names, except
for spcnode which doesn't but probably should.

+ bool can_do_prewarm; /* if set can't do prewarm task. */

The comment and the name of the variable imply opposite meanings.

+/*
+ * detach_blkinfos - on_shm_exit detach the dsm allocated for blockinfos.
+ */
+static void
+detach_blkinfos(int code, Datum arg)
+{
+ if (seg != NULL)
+ dsm_detach(seg);
+}

I assume you don't really need this. Presumably process exit is going
to detach the DSM anyway.

+ if (seg == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("unable to map dynamic shared memory segment")));

Let's use the wording from the message in parallel.c rather than this
wording. Actually, we should probably (as a separate patch) change
test_shm_mq's worker.c to use the parallel.c wording also.

+ SetCurrentStatementStartTimestamp();

Why do we need this?

+ StartTransactionCommand();

Do we need a transaction? If so, how about starting a new transaction
for each relation instead of using a single one for the entire
prewarm?

+ if (status == BGWH_STOPPED)
+ return;
+
+ if (status == BGWH_POSTMASTER_DIED)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_RESOURCES),
+ errmsg("cannot start bgworker autoprewarm without postmaster"),
+ errhint("Kill all remaining database processes and restart"
+ " the database.")));
+ }
+
+ Assert(0);

Instead of writing it this way, remove the first "if" block and change
the second one to Assert(status == BGWH_STOPPED). Also, I'd ditch the
curly braces in this case.

+ file = fopen(AUTOPREWARM_FILE, PG_BINARY_R);

Use AllocateFile() instead. See the comments for that function and at
the top of fd.c. Then you don't need to worry about leaking the file
handle on an error, so you can remove the fclose() before ereport()
stuff -- which is incomplete anyway, because you could fail e.g.
inside dsm_create().

+ errmsg("Error reading num of elements in \"%s\" for"
+ " autoprewarm : %m", AUTOPREWARM_FILE)));

As I said in a previous review, please do NOT split error messages
across lines like this. Also, this error message is nothing close to
PostgreSQL style. Please read
https://www.postgresql.org/docs/devel/static/error-style-guide.html
and learn to follow all those guidelines written therein. I see at
least 3 separate problems with this message.

+ num_elements = i;

I'd do something like if (i != num_elements) elog(ERROR, "autoprewarm
block dump has %d entries but expected %d", i, num_elements); It
seems OK for this to be elog() rather than ereport() because the file
should never be corrupt unless the user has cheated by hand-editing
it.

I think you can get rid of next_db_pos altogether, and this
prewarm_elem thing too. Design sketch:

1. Move all the stuff that's in prewarm_elem into
AutoPrewarmSharedState. Rename start_pos to prewarm_start_idx and
end_of_blockinfos to prewarm_stop_idx.

2. Instead of computing all of the database ranges first and then
launching workers, do it all in one loop. So figure out where the
records for the current database end and set prewarm_start_idx and
prewarm_end_idx appropriately. Launch a worker. When the worker
terminates, set prewarm_start_idx = prewarm_end_idx and advance
prewarm_end_idx to the end of the next database's records.

This saves having to allocate memory for the next_db_pos array, and it
also avoids this crock:

+ memcpy(&pelem, MyBgworkerEntry->bgw_extra, sizeof(prewarm_elem));

The reason that's bad is because it only works so long as bgw_extra is
large enough to hold prewarm_elem. If prewarm_elem grows or bgw_extra
shrinks, this turns into a buffer overrun.

I would use AUTOPREWARM_FILE ".tmp" rather than a name incorporating
the PID for the temporary file. Otherwise, you might leave many
temporary files behind under different names. If you use the same
name every time, you'll never have more than one, and the next
successful dump will end up getting rid of it along the way.

+ pfree(block_info_array);
+ CloseTransientFile(fd);
+ unlink(transient_dump_file_path);
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("error writing to \"%s\" : %m",
+ AUTOPREWARM_FILE)));

Again, this is NOT a standard error message text. It occurs in zero
other places in the source tree. You are not the first person to need
an error message for a failed write to a file; please look at what the
previous authors did. Also, the pfree() before report is not needed;
isn't the whole process going to terminate? Also, you can't really use
errcode_for_file_access() here, because you've meanwhile done
CloseTransientFile() and unlink(), which will have clobbered errno.

+ ereport(LOG, (errmsg("saved metadata info of %d blocks", num_blocks)));

Not project style for ereport(). Line break after the first comma.
Similarly elsewhere.

+ * dump_now - the main routine which goes through each buffer
header of buffer
+ * pool and dumps their meta data. We Sort these data and then dump them.
+ * Sorting is necessary as it facilitates sequential read during load.

This is no longer true, because you moved the sort to the load side.
It's also not capitalized properly.

Discussions of the format of the autoprewarm dump file involve
inexplicably varying number of < and > symbols:

+ * <<DatabaseId,TableSpaceId,RelationId,Forknum,BlockNum>> in
+ * <DatabaseId,TableSpaceId,RelationId,Forknum,BlockNum> and we shall call it
+ buflen = sprintf(buf, "%u,%u,%u,%u,%u\n",

+#ifndef __AUTOPREWARM_H__
+#define __AUTOPREWARM_H__

We don't use double-underscored names for header files. Please
consult existing header files for the appropriate style. Also, why
does this file exist at all, instead of just including them in the .c
file? The pointer of a header is for things that need to be included
by multiple .c files, but there's no such need here.

+ * load. If there are no other block infos than the global objects
+ * we silently ignore them. Should I throw error?

Silently ignoring them seems fine. Throwing an error doesn't seem
like it would improve things.

+ /*
+ * Register a sub-worker to load new database's block. Wait until the
+ * sub-worker finish its job before launching next sub-worker.
+ */
+ launch_prewarm_subworker(&pelem);

The function name implies that it launches the worker, but the comment
implies that it also waits for it to terminate. Functions should be
named in a way that matches what they do.

I feel like the get_autoprewarm_task() function is introducing fairly
baroque logic for something that really ought to be more simple. All
autoprewarm_main() really needs to do is:

if (!state->autoprewarm_done)
autoprewarm();
dump_block_info_periodically();

The locking in autoprewarm_dump_now() is a bit tricky. There are two
trouble cases. One is that we try to rename() our new dump file on
top of the existing one while a background worker is still using it to
perform an autoprewarm. The other is that we try to write a new
temporary dump file while some other process is trying to do the same
thing. I think we should fix this by storing a PID in
AutoPrewarmSharedState; a process which wants to perform a dump or an
autoprewarm must change the PID from 0 to their own PID, and change it
back to 0 on successful completion or error exit. If we go to perform
an immediate dump process and finds a non-zero value already just does
ereport(ERROR, ...), including the PID of the other process in the
message (e.g. "unable to perform block dump because dump file is being
used by PID %d"). In a background worker, if we go to dump and find
the file in use, log a message (e.g. "skipping block dump because it
is already being performed by PID %d", "skipping prewarm because block
dump file is being rewritten by PID %d").

I also think we should change is_bgworker_running to a PID, so that if
we try to launch a new worker we can report something like
"autoprewarm worker is already running under PID %d".

So putting that all together, I suppose AutoPrewarmSharedState should
end up looking like this:

LWLock lock; /* mutual exclusion */
pid_t bgworker_pid; /* for main bgworker */
pid_t pid_using_dumpfile; /* for autoprewarm or block dump */

/* following items are for communication with per-database worker */
dsm_handle block_info_handle;
Oid database;
int prewarm_start_idx;
int prewarm_stop_idx;

I suggest going through and changing "subworker" to "per-database
worker" throughout.

BTW, have you tested how long this takes to run with, say, shared_buffers = 8GB?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2017-05-31 16:53:49 Re: POC: Sharing record typmods between backends
Previous Message Joshua D. Drake 2017-05-31 16:46:43 Re: TAP backpatching policy