Re: Proposal : For Auto-Prewarm.

From: Mithun Cy <mithun(dot)cy(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal : For Auto-Prewarm.
Date: 2017-06-04 07:15:25
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers

On Wed, May 31, 2017 at 10:18 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> + * contrib/autoprewarm.c
> Wrong.

-- Oops Sorry fixed.

> + Oid database; /* database */
> + Oid spcnode; /* tablespace */
> + Oid filenode; /* relation's filenode. */
> + ForkNumber forknum; /* fork number */
> + BlockNumber blocknum; /* block number */
> Why spcnode rather than tablespace? Also, the third line has a
> period, but not the others. I think you could just drop these
> comments; they basically just duplicate the structure names, except
> for spcnode which doesn't but probably should.

-- Dropped comments and changed spcnode to tablespace.

> + bool can_do_prewarm; /* if set can't do prewarm task. */
> The comment and the name of the variable imply opposite meanings.

-- Sorry a typo. Now this variable has been removed as you have
suggested with new variables in AutoPrewarmSharedState.

> +/*
> + * detach_blkinfos - on_shm_exit detach the dsm allocated for blockinfos.
> + */
> I assume you don't really need this. Presumably process exit is going
> to detach the DSM anyway.

-- Yes considering process exit will detach the dsm, I have removed
that function.

> + if (seg == NULL)
> + ereport(ERROR,
> + errmsg("unable to map dynamic shared memory segment")));
> Let's use the wording from the message in parallel.c rather than this
> wording. Actually, we should probably (as a separate patch) change
> test_shm_mq's worker.c to use the parallel.c wording also.

-- I have corrected the message with "could not map dynamic shared
memory segment" as in parallel.c

> + SetCurrentStatementStartTimestamp();
> Why do we need this?

-- Removed Sorry forgot to remove same when I removed the SPI connection code.

> + StartTransactionCommand();
> Do we need a transaction? If so, how about starting a new transaction
> for each relation instead of using a single one for the entire
> prewarm?

-- We do relation_open hence need a transaction. As suggested now we
start a new transaction on every new relation.

> + if (status == BGWH_STOPPED)
> + return;
> +
> + if (status == BGWH_POSTMASTER_DIED)
> + {
> + ereport(ERROR,
> + errmsg("cannot start bgworker autoprewarm without postmaster"),
> + errhint("Kill all remaining database processes and restart"
> + " the database.")));
> + }
> +
> + Assert(0);
> Instead of writing it this way, remove the first "if" block and change
> the second one to Assert(status == BGWH_STOPPED). Also, I'd ditch the
> curly braces in this case.

-- Fixed as suggested.

> + file = fopen(AUTOPREWARM_FILE, PG_BINARY_R);
> Use AllocateFile() instead. See the comments for that function and at
> the top of fd.c. Then you don't need to worry about leaking the file
> handle on an error, so you can remove the fclose() before ereport()
now> stuff -- which is incomplete anyway, because you could fail e.g.
> inside dsm_create().

-- Using AllocateFile now.

> + errmsg("Error reading num of elements in \"%s\" for"
> + " autoprewarm : %m", AUTOPREWARM_FILE)));
> As I said in a previous review, please do NOT split error messages
> across lines like this. Also, this error message is nothing close to
> PostgreSQL style. Please read
> and learn to follow all those guidelines written therein. I see at
> least 3 separate problems with this message.

-- Thanks, I have tried to fix it now.

> + num_elements = i;
> I'd do something like if (i != num_elements) elog(ERROR, "autoprewarm
> block dump has %d entries but expected %d", i, num_elements); It
> seems OK for this to be elog() rather than ereport() because the file
> should never be corrupt unless the user has cheated by hand-editing
> it.

-- Fixed as suggested. Now eloged as an ERROR.

> I think you can get rid of next_db_pos altogether, and this
> prewarm_elem thing too. Design sketch:
> 1. Move all the stuff that's in prewarm_elem into
> AutoPrewarmSharedState. Rename start_pos to prewarm_start_idx and
> end_of_blockinfos to prewarm_stop_idx.
> 2. Instead of computing all of the database ranges first and then
> launching workers, do it all in one loop. So figure out where the
> records for the current database end and set prewarm_start_idx and
> prewarm_end_idx appropriately. Launch a worker. When the worker
> terminates, set prewarm_start_idx = prewarm_end_idx and advance
> prewarm_end_idx to the end of the next database's records.
> This saves having to allocate memory for the next_db_pos array, and it
> also avoids this crock:
> + memcpy(&pelem, MyBgworkerEntry->bgw_extra, sizeof(prewarm_elem));

-- Fixed as suggested.

> The reason that's bad is because it only works so long as bgw_extra is
> large enough to hold prewarm_elem. If prewarm_elem grows or bgw_extra
> shrinks, this turns into a buffer overrun.

-- passing prewarm info through bgw_extra helped us to restrict the
scope and lifetime of prewarm_elem only to prewarm task. Moving them
to shared memory made them global even though they are not needed once
prewarm task is finished. As there are other disadvantages of using
bgw_extra I have now implemented as you have suggested.

> I would use AUTOPREWARM_FILE ".tmp" rather than a name incorporating
> the PID for the temporary file. Otherwise, you might leave many
> temporary files behind under different names. If you use the same
> name every time, you'll never have more than one, and the next
> successful dump will end up getting rid of it along the way.

-- Fixed as sugested. Previosuly PID was used so that concurrent dump
can happen between dump worker and immediate dump as they will write
to two different files. With new way of registering PID before file
access in shared memory I think that problem can be adressed.

> + pfree(block_info_array);
> + CloseTransientFile(fd);
> + unlink(transient_dump_file_path);
> + ereport(ERROR,
> + (errcode_for_file_access(),
> + errmsg("error writing to \"%s\" : %m",
> Again, this is NOT a standard error message text. It occurs in zero
> other places in the source tree. You are not the first person to need
> an error message for a failed write to a file; please look at what the
> previous authors did. Also, the pfree() before report is not needed;
> isn't the whole process going to terminate? Also, you can't really use
> errcode_for_file_access() here, because you've meanwhile done
> CloseTransientFile() and unlink(), which will have clobbered errno.

-- Removed pfree, saved errno before CloseTransientFile() and unlink()

> + ereport(LOG, (errmsg("saved metadata info of %d blocks", num_blocks)));
> Not project style for ereport(). Line break after the first comma.
> Similarly elsewhere.

-- Tried to fix same

> + * dump_now - the main routine which goes through each buffer
> header of buffer
> + * pool and dumps their meta data. We Sort these data and then dump them.
> + * Sorting is necessary as it facilitates sequential read during load.
> This is no longer true, because you moved the sort to the load side.
> It's also not capitalized properly.

-- Sorry removed now.

> Discussions of the format of the autoprewarm dump file involve
> inexplicably varying number of < and > symbols:
> + * <<DatabaseId,TableSpaceId,RelationId,Forknum,BlockNum>> in
> + * <DatabaseId,TableSpaceId,RelationId,Forknum,BlockNum> and we shall call it
> + buflen = sprintf(buf, "%u,%u,%u,%u,%u\n",

-- Sorry fixed now, in all of the places the block info formats will
not have such (</>) delimiter.

> +#ifndef __AUTOPREWARM_H__
> +#define __AUTOPREWARM_H__
> We don't use double-underscored names for header files. Please
> consult existing header files for the appropriate style. Also, why
> does this file exist at all, instead of just including them in the .c
> file? The pointer of a header is for things that need to be included
> by multiple .c files, but there's no such need here.

-- This was done to fix one of the previous review comments. I have
moved them back to .c file.

> + * load. If there are no other block infos than the global objects
> + * we silently ignore them. Should I throw error?
> Silently ignoring them seems fine. Throwing an error doesn't seem
> like it would improve things.

-- Okay thanks.

> + /*
> + * Register a sub-worker to load new database's block. Wait until the
> + * sub-worker finish its job before launching next sub-worker.
> + */
> + launch_prewarm_subworker(&pelem);
> The function name implies that it launches the worker, but the comment
> implies that it also waits for it to terminate. Functions should be
> named in a way that matches what they do.

-- Have renamed it to launch_and_wait_for_per_database_worker

> I feel like the get_autoprewarm_task() function is introducing fairly
> baroque logic for something that really ought to be more simple. All
> autoprewarm_main() really needs to do is:
> if (!state->autoprewarm_done)
> autoprewarm();
> dump_block_info_periodically();

-- Have simplified things as suggested now. Function
get_autoprewarm_task has been removed.

> The locking in autoprewarm_dump_now() is a bit tricky. There are two
> trouble cases. One is that we try to rename() our new dump file on
> top of the existing one while a background worker is still using it to
> perform an autoprewarm. The other is that we try to write a new
> temporary dump file while some other process is trying to do the same
> thing. I think we should fix this by storing a PID in
> AutoPrewarmSharedState; a process which wants to perform a dump or an
> autoprewarm must change the PID from 0 to their own PID, and change it
> back to 0 on successful completion or error exit. If we go to perform
> an immediate dump process and finds a non-zero value already just does
> ereport(ERROR, ...), including the PID of the other process in the
> message (e.g. "unable to perform block dump because dump file is being
> used by PID %d"). In a background worker, if we go to dump and find
> the file in use, log a message (e.g. "skipping block dump because it
> is already being performed by PID %d", "skipping prewarm because block
> dump file is being rewritten by PID %d").

-- Fixed as suggested.

> I also think we should change is_bgworker_running to a PID, so that if
> we try to launch a new worker we can report something like
> "autoprewarm worker is already running under PID %d".

-- Fixed. I could only "LOG" about another autoprewarm worker already
running and then exit. Because on ERROR we try to restart the worker,
so do not want to restart such workers.

> So putting that all together, I suppose AutoPrewarmSharedState should
> end up looking like this:
> LWLock lock; /* mutual exclusion */
> pid_t bgworker_pid; /* for main bgworker */
> pid_t pid_using_dumpfile; /* for autoprewarm or block dump */

-- I think one more member is required which state whether prewarm can
be done when the worker restarts.

> /* following items are for communication with per-database worker */
> dsm_handle block_info_handle;
> Oid database;
> int prewarm_start_idx;
> int prewarm_stop_idx;

-- Fixed as suggested

> I suggest going through and changing "subworker" to "per-database
> worker" throughout.

-- Fixed as suggested.

> BTW, have you tested how long this takes to run with, say, shared_buffers = 8GB?

I have tried same on my local machine with ssd as a storage.

settings: shared_buffers = 8GB, loaded data with pg_bench scale_factor=1000.

Total blocks got dumped

5 different load time based logs

2017-06-04 11:30:26.460 IST [116253] LOG: autoprewarm has started
2017-06-04 11:30:43.443 IST [116253] LOG: autoprewarm load task ended
-- 17 secs

2017-06-04 11:31:13.565 IST [116291] LOG: autoprewarm has started
2017-06-04 11:31:30.317 IST [116291] LOG: autoprewarm load task ended
-- 17 secs

2017-06-04 11:32:12.995 IST [116329] LOG: autoprewarm has started
2017-06-04 11:32:29.982 IST [116329] LOG: autoprewarm load task ended
-- 17 secs

2017-06-04 11:32:58.974 IST [116361] LOG: autoprewarm has started
2017-06-04 11:33:15.017 IST [116361] LOG: autoprewarm load task ended
-- 17secs

2017-06-04 12:15:49.772 IST [117936] LOG: autoprewarm has started
2017-06-04 12:16:11.012 IST [117936] LOG: autoprewarm load task ended
-- 22 secs.

So mostly from 17 to 22 secs.

But I think I need to do tests on a larger set of configuration on
different storage types. I shall do same and upload later. I have also
uploaded latest performance test results (on my local machine ssd
configuration: shared_buffer = 8GB,
test setting: scale_factor=300 (data fits to shared_buffers) pgbench clients =1

PGBENCH_RUN="./pgbench --no-vacuum --protocol=prepared --time=5 -j 1
-c 1 --select-only postgres"
| grep excluding | cut -d ' ' -f 3); TIME=$((SECONDS-START_TIME));
echo $TIME, $TPS; done

Thanks and Regards
Mithun C Y

Attachment Content-Type Size
autoprewarm_11.patch application/octet-stream 35.7 KB
autoprewarm_performance_report.ods application/vnd.oasis.opendocument.spreadsheet 34.4 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Brent Douglas 2017-06-04 08:37:12 Re: proposal psql \gdesc
Previous Message Fabien COELHO 2017-06-04 07:13:28 Re: proposal psql \gdesc