Re: Proposal : For Auto-Prewarm.

From: andres(at)anarazel(dot)de (Andres Freund)
To: Mithun Cy <mithun(dot)cy(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal : For Auto-Prewarm.
Date: 2017-04-05 22:42:22
Message-ID: 20170405224222.l2nwmg5j5sd4q6qr@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2017-03-13 18:45:00 +0530, Mithun Cy wrote:
> I have implemented a similar logic now. The prewarm bgworker will
> launch a sub-worker per database in the dump file. And, each
> sub-worker will load its database block info. The sub-workers will be
> launched only after previous one is finished. All of this will only
> start if the database has reached a consistent state.

Hm. For replay performance it'd possibly be good to start earlier,
before reaching consistency. Is there an issue starting earlier?

> diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c
> new file mode 100644
> index 0000000..f4b34ca
> --- /dev/null
> +++ b/contrib/pg_prewarm/autoprewarm.c
> @@ -0,0 +1,1137 @@
> +/*-------------------------------------------------------------------------
> + *
> + * autoprewarm.c
> + *
> + * -- Automatically prewarm the shared buffer pool when server restarts.

Don't think we ususally use -- here.

> + * Copyright (c) 2013-2017, PostgreSQL Global Development Group

Hm, that's a bit of a weird date range.

> + * IDENTIFICATION
> + * contrib/pg_prewarm.c/autoprewarm.c
> + *-------------------------------------------------------------------------
> + */

The pg_prewarm.c in there looks like some search & replace gone awry.

> +#include "postgres.h"
> +#include <unistd.h>
> +
> +/* These are always necessary for a bgworker. */
> +#include "miscadmin.h"
> +#include "postmaster/bgworker.h"
> +#include "storage/ipc.h"
> +#include "storage/latch.h"
> +#include "storage/lwlock.h"
> +#include "storage/proc.h"
> +#include "storage/shmem.h"
> +
> +/* These are necessary for prewarm utilities. */
> +#include "pgstat.h"
> +#include "storage/buf_internals.h"
> +#include "storage/smgr.h"
> +#include "utils/memutils.h"
> +#include "utils/resowner.h"
> +#include "utils/guc.h"
> +#include "catalog/pg_class.h"
> +#include "catalog/pg_type.h"
> +#include "executor/spi.h"
> +#include "access/xact.h"
> +#include "utils/rel.h"
> +#include "port/atomics.h"

I'd rather just sort these alphabetically.

I think this should rather be in the initial header.

> +/*
> + * autoprewarm :
> + *
> + * What is it?
> + * ===========
> + * A bgworker which automatically records information about blocks which were
> + * present in buffer pool before server shutdown and then prewarm the buffer
> + * pool upon server restart with those blocks.
> + *
> + * How does it work?
> + * =================
> + * When the shared library "pg_prewarm" is preloaded, a
> + * bgworker "autoprewarm" is launched immediately after the server has reached
> + * consistent state. The bgworker will start loading blocks recorded in the
> + * format BlockInfoRecord
> + * <<DatabaseId,TableSpaceId,RelationId,Forknum,BlockNum>> in
> + * $PGDATA/AUTOPREWARM_FILE, until there is a free buffer left in the buffer
> + * pool. This way we do not replace any new blocks which were loaded either by
> + * the recovery process or the querying clients.

s/until there is a/until there is no/?

> +/*
> + * ============================================================================
> + * =========================== SIGNAL HANDLERS ===========================
> + * ============================================================================
> + */

Hm...

> +static void sigtermHandler(SIGNAL_ARGS);
> +static void sighupHandler(SIGNAL_ARGS);

I don't think that's a casing we commonly use. We mostly use CamelCase
or underscore_case.

> +/*
> + * Signal handler for SIGUSR1.
> + */
> +static void
> +sigusr1Handler(SIGNAL_ARGS)
> +{
> + int save_errno = errno;
> +
> + if (MyProc)
> + SetLatch(&MyProc->procLatch);
> +
> + errno = save_errno;
> +}

Hm, what's this one for?

> +/*
> + * Shared state information about the running autoprewarm bgworker.
> + */
> +typedef struct AutoPrewarmSharedState
> +{
> + pg_atomic_uint32 current_task; /* current tasks performed by
> + * autoprewarm workers. */
> +} AutoPrewarmSharedState;

Hm. Why do we need atomics here? I thought there's no concurrency?

> +/*
> + * sort_cmp_func - compare function used for qsort().
> + */
> +static int
> +sort_cmp_func(const void *p, const void *q)
> +{

rename to blockinfo_cmp?

> +static AutoPrewarmTask
> +get_autoprewarm_task(AutoPrewarmTask todo_task)
> +{
> + bool found;
> +
> + state = NULL;
> +
> + LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
> + state = ShmemInitStruct("autoprewarm",
> + sizeof(AutoPrewarmSharedState),
> + &found);
> + if (!found)
> + pg_atomic_write_u32(&(state->current_task), todo_task);

Superflous parens (repeated a lot).

> + LWLockRelease(AddinShmemInitLock);
> +
> + /* If found check if we can go ahead. */
> + if (found)
> + {
> + if (pg_atomic_read_u32(&(state->current_task)) ==
> + TASK_PREWARM_BUFFERPOOL)

You repeat the read in every branch - why don't you store it in a
variable instead?

That aside, the use of an atomic doesn't seem to actually gain us
anything here. If we need control over concurrency it seems a lot
better to instead use a lwlock or spinlock. There's no contention here,
using lock-free stuff just increases complexity without a corresponding
benefit.

> + {
> + if (todo_task == TASK_PREWARM_BUFFERPOOL)
> + {
> + /*
> + * we were prewarming and we are back to do same, time to
> + * abort prewarming and move to dumping.
> + */

I'm not sure what "back to do same" should mean here - changing to a
different type of task surely is not the same.

> + pg_atomic_write_u32(&(state->current_task),
> + TASK_DUMP_BUFFERPOOL_INFO);
> + return TASK_DUMP_BUFFERPOOL_INFO;
> + }
> + else
> + return TASK_END; /* rest all cannot proceed further. */

What does that comment mean?

> + }
> + else if (pg_atomic_read_u32(&(state->current_task)) ==
> + TASK_DUMP_IMMEDIATE_ONCE)
> + {
> + uint32 current_state = TASK_DUMP_IMMEDIATE_ONCE;
> +
> + /* We cannot do a TASK_PREWARM_BUFFERPOOL but rest can go ahead */
> + if (todo_task == TASK_DUMP_IMMEDIATE_ONCE)
> + return TASK_DUMP_IMMEDIATE_ONCE;
> +
> + if (todo_task == TASK_PREWARM_BUFFERPOOL)
> + todo_task = TASK_DUMP_BUFFERPOOL_INFO; /* skip to do dump only */
> +
> + /*
> + * first guy who can atomically set the current_task get the
> + * opportunity to proceed further
> + */
> + if (pg_atomic_compare_exchange_u32(&(state->current_task),
> + &current_state,
> + TASK_DUMP_BUFFERPOOL_INFO))
> + {
> + /* Wow! We won the race proceed with the task. */
> + return TASK_DUMP_BUFFERPOOL_INFO;
> + }
> + else
> + return TASK_END;

Note that it's not generally guaranteed that any
pg_atomic_compare_exchange_u32 actually wins, it could temporarily fail
for all.

> +/*
> + * getnextblockinfo -- given a BlkType get its next BlockInfoRecord from the
> + * dump file.
> + */
> +static BlkType
> +getnextblockinfo(FILE *file, BlockInfoRecord *currblkinfo, BlkType reqblock,
> + BlockInfoRecord *newblkinfo)
> +{
> + BlkType nextblk;
> +
> + while (true)
> + {
> + /* get next block. */
> + if (5 != fscanf(file, "%u,%u,%u,%u,%u\n", &(newblkinfo->database),
> + &(newblkinfo->spcNode), &(newblkinfo->filenode),
> + (uint32 *) &(newblkinfo->forknum),
> + &(newblkinfo->blocknum)))
> + return BLKTYPE_END; /* No more valid entry hence stop processing. */

Hm. Is it actually helpful to store the file as text? That's commonly
going to increase the size of the file quite considerably, no?

> +/*
> + * GetRelOid -- given a filenode get its relation oid.
> + */
> +static Oid
> +get_reloid(Oid filenode)
> +{

Function and comment don't agree on naming.

But what is this actually used for? I thought Robert, in
http://archives.postgresql.org/message-id/CA%2BTgmoa%3DUqCL2mR%2B9WTq05tB3Up-z4Sv2wkzkDxDwBP7Mj_2_w%40mail.gmail.com
suggested storing the filenode in the dump, and then to use
RelidByRelfilenode to get the corresponding relation?

It seems a lot better to use relfilenodes, because otherwise table
rewrites will lead to reloading wrong things.

> + int ret;
> + Oid relationid;
> + bool isnull;
> + Datum value[1] = {ObjectIdGetDatum(filenode)};
> + StringInfoData buf;
> + Oid ptype[1] = {OIDOID};
> +
> + initStringInfo(&buf);
> + appendStringInfo(&buf,
> + "select oid from pg_class where pg_relation_filenode(oid) = $1");
> +
> + ret = SPI_execute_with_args(buf.data, 1, (Oid *) &ptype, (Datum *) &value,
> + NULL, true, 1);
> +
> + if (ret != SPI_OK_SELECT)
> + ereport(FATAL, (errmsg("SPI_execute failed: error code %d", ret)));
> +
> + if (SPI_processed < 1)
> + return InvalidOid;
> +
> + relationid = DatumGetObjectId(SPI_getbinval(SPI_tuptable->vals[0],
> + SPI_tuptable->tupdesc,
> + 1, &isnull));
> + if (isnull)
> + return InvalidOid;
> +
> + return relationid;
> +}

Doing this via SPI doesn't strike me as a good idea - that's really
quite expensive. Why not call the underlying function directly?

> +/*
> + * load_one_database -- start of prewarm sub-worker, this will try to load
> + * blocks of one database starting from block info position passed by main
> + * prewarm worker.
> + */
> +void
> +load_one_database(Datum main_arg)
> +{

> + /* check if file exists and open file in read mode. */
> + snprintf(dump_file_path, sizeof(dump_file_path), "%s", AUTOPREWARM_FILE);
> + file = fopen(dump_file_path, PG_BINARY_R);
> + if (!file)
> + return; /* No file to load. */

Shouldn't this be an error case? In which case is it ok for the file to
be gone after we launched the worker?

> + /*
> + * It should be a block info belonging to a new database. Or else dump
> + * file is corrupted better to end the loading of bocks now.
> + */
> + if (loadblocktype != BLKTYPE_NEW_DATABASE)
> + goto end_load; /* should we raise a voice here? */

Yes, this should raise an error.

> + case BLKTYPE_NEW_RELATION:
> +
> + /*
> + * release lock on previous relation.
> + */
> + if (rel)
> + {
> + relation_close(rel, AccessShareLock);
> + rel = NULL;
> + }
> +
> + loadblocktype = BLKTYPE_NEW_RELATION;
> +
> + /*
> + * lock new relation.
> + */
> + reloid = get_reloid(toload_block.filenode);
> +
> + if (!OidIsValid(reloid))
> + break;
> +
> + rel = try_relation_open(reloid, AccessShareLock);
> + if (!rel)
> + break;
> + RelationOpenSmgr(rel);

Now I'm confused. Your get_reloid used pg_relation_filenode() to map
from relation oid to filenode - and then you're using it to lock the
relation? Something's wrong.

> + case BLKTYPE_NEW_FORK:
> +
> + /*
> + * check if fork exists and if block is within the range
> + */
> + loadblocktype = BLKTYPE_NEW_FORK;
> + if ( /* toload_block.forknum > InvalidForkNumber &&
> + * toload_block.forknum <= MAX_FORKNUM && */
> + !smgrexists(rel->rd_smgr, toload_block.forknum))
> + break;

Huh? What's with that commented out section of code?

> + case BLKTYPE_NEW_BLOCK:
> +
> + /* check if blocknum is valid and with in fork file size. */
> + if (toload_block.blocknum >= nblocks)
> + {
> + /* move to next forknum. */
> + loadblocktype = BLKTYPE_NEW_FORK;
> + break;
> + }

Hm. Why does the size of the underlying file allow us to skip to the
next fork? Don't we have to read all the pending dump records?

> + buf = ReadBufferExtended(rel, toload_block.forknum,
> + toload_block.blocknum, RBM_NORMAL,
> + NULL);
> + if (BufferIsValid(buf))
> + {
> + ReleaseBuffer(buf);
> + }
> +
> + loadblocktype = BLKTYPE_NEW_BLOCK;
> + break;

Hm. RBM_NORMAL will error out in a bunch of cases, is that ok?

> + if (have_dbconnection)
> + {
> + SPI_finish();
> + PopActiveSnapshot();
> + CommitTransactionCommand();
> + }
> + return;
> +}

Are we really ok keeping open a transaction through all of this? That
could potentially be quite long, no? How about doing that on a per-file
basis, or even moving to session locks alltogether?

> +/* This sub-module is for periodically dumping buffer pool's block info into
> + * a dump file AUTOPREWARM_FILE.
> + * Each entry of block info looks like this:
> + * <DatabaseId,TableSpaceId,RelationId,Forknum,BlockNum> and we shall call it
> + * as BlockInfoRecord.
> + *
> + * Contents of AUTOPREWARM_FILE has been formated such a way that
> + * blockInfoRecord of each database can be given to different prewarm workers.
> + *
> + * format of AUTOPREWAM_FILE
> + * =======================================
> + * [offset position of database map table]
> + * [sorted BlockInfoRecords..............]
> + * [database map table]
> + * =======================================

This doesn't mention storing things as ascii, instead of binary...

> + * The [database map table] is sequence of offset in file which will point to
> + * first BlockInfoRecords of each database in the dump. The prewarm worker
> + * will read this offset one by one in sequence and ask its subworker to seek
> + * to this position and then start loading the BlockInfoRecords one by one
> + * until it see a BlockInfoRecords of a different database than it is actually
> + * connected to.
> + * NOTE : We store off_t inside file so the dump file will not be portable to
> + * be used across systems where sizeof off_t is different from each other.
> + */

Why are we using off_t? Shouldn't this just be BlockNumber?

> +static uint32
> +dump_now(void)
> +{
> + static char dump_file_path[MAXPGPATH],

> +
> + for (num_blocks = 0, i = 0; i < NBuffers; i++)
> + {
> + uint32 buf_state;
> +
> + bufHdr = GetBufferDescriptor(i);
> +
> + /* lock each buffer header before inspecting. */
> + buf_state = LockBufHdr(bufHdr);
> +
> + if (buf_state & BM_TAG_VALID)
> + {
> + block_info_array[num_blocks].database = bufHdr->tag.rnode.dbNode;
> + block_info_array[num_blocks].spcNode = bufHdr->tag.rnode.spcNode;
> + block_info_array[num_blocks].filenode = bufHdr->tag.rnode.relNode;
> + block_info_array[num_blocks].forknum = bufHdr->tag.forkNum;
> + block_info_array[num_blocks].blocknum = bufHdr->tag.blockNum;
> + ++num_blocks;
> + }
> +
> + UnlockBufHdr(bufHdr, buf_state);

> + }
> +
> + /* sorting now only to avoid sorting while loading. */

"sorting while loading"? You mean random accesses?

> + pg_qsort(block_info_array, num_blocks, sizeof(BlockInfoRecord),
> + sort_cmp_func);

> + snprintf(transient_dump_file_path, sizeof(dump_file_path),
> + "%s.%d", AUTOPREWARM_FILE, MyProcPid);
> + file = fopen(transient_dump_file_path, "w");
> + if (file == NULL)
> + ereport(ERROR,
> + (errcode_for_file_access(),
> + errmsg("autoprewarm: could not open \"%s\": %m",
> + dump_file_path)));

What if that file already exists? You're not truncating it. Also, what
if we error out in the middle of this? We'll leak an fd. I think this
needs to use OpenTransientFile etc.

> + snprintf(dump_file_path, sizeof(dump_file_path),
> + "%s", AUTOPREWARM_FILE);
> + ret = fprintf(file, "%020jd\n", (intmax_t) 0);
> + if (ret < 0)
> + {
> + fclose(file);
> + ereport(ERROR,
> + (errcode_for_file_access(),
> + errmsg("autoprewarm: error writing to \"%s\" : %m",
> + dump_file_path)));
> + }
> +
> + database_map_table[num_db++] = ftello(file);
> +
> + for (i = 0; i < num_blocks; i++)
> + {
> + if (i > 0 && block_info_array[i].database != prev_database)
> + {
> + if (num_db == database_map_table_size)
> + {
> + database_map_table_size *= 2; /* double and repalloc. */
> + database_map_table =
> + (off_t *) repalloc(database_map_table,
> + sizeof(off_t) * database_map_table_size);
> + }
> + fflush(file);
> + database_map_table[num_db++] = ftello(file);
> + }
> +
> + ret = fprintf(file, "%u,%u,%u,%u,%u\n",
> + block_info_array[i].database,
> + block_info_array[i].spcNode,
> + block_info_array[i].filenode,
> + (uint32) block_info_array[i].forknum,
> + block_info_array[i].blocknum);
> + if (ret < 0)
> + {
> + fclose(file);
> + ereport(ERROR,
> + (errcode_for_file_access(),
> + errmsg("autoprewarm: error writing to \"%s\" : %m",
> + dump_file_path)));
> + }
> +
> + prev_database = block_info_array[i].database;
> + }

I think we should check for interrupts somewhere in that (and the
preceding) loop.

> +/*
> + * dump_block_info_periodically - at regular intervals, which is defined by GUC
> + * dump_interval, dump the info of blocks which are present in buffer pool.
> + */
> +void
> +dump_block_info_periodically()
> +{

Suggest adding void to the parameter list.

> + pg_time_t last_dump_time = (pg_time_t) time(NULL);
> +
> + while (!got_sigterm)
> + {
> + int rc;
> + pg_time_t now;
> + int elapsed_secs = 0,
> + timeout = AT_PWARM_DEFAULT_DUMP_INTERVAL;
> +
> + if (dump_interval > AT_PWARM_DUMP_AT_SHUTDOWN_ONLY)
> + {
> + now = (pg_time_t) time(NULL);
> + elapsed_secs = now - last_dump_time;
> +
> + if (elapsed_secs > dump_interval)
> + {
> + dump_now();
> + if (got_sigterm)
> + return; /* got shutdown signal just after a dump. And,
> + * I think better to return now. */
> + last_dump_time = (pg_time_t) time(NULL);
> + elapsed_secs = 0;
> + }
> +
> + timeout = dump_interval - elapsed_secs;
> + }

I suggest using GetCurrenttimstamp() and TimestampDifferenceExceeds()
instead.

> + /* Has been set not to dump. Nothing more to do. */
> + if (dump_interval == AT_PWARM_OFF)
> + return;
> +
> + ResetLatch(&MyProc->procLatch);
> + rc = WaitLatch(&MyProc->procLatch,
> + WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
> + timeout * 1000, PG_WAIT_EXTENSION);
> +
> + if (rc & WL_POSTMASTER_DEATH)
> + proc_exit(1);
> +
> + /*
> + * In case of a SIGHUP, just reload the configuration.
> + */
> + if (got_sighup)
> + {
> + got_sighup = false;
> + ProcessConfigFile(PGC_SIGHUP);
> + }
> + }
> +
> + /* One last block meta info dump while postmaster shutdown. */
> + if (dump_interval != AT_PWARM_OFF)
> + dump_now();

Uh, afaics we'll also do this if somebody SIGTERMed the process
interactively?

> +/* Extension's entry point. */
> +void
> +_PG_init(void)
> +{
> + BackgroundWorker autoprewarm;
> +
> + /* Define custom GUC variables. */
> + DefineCustomIntVariable("pg_prewarm.dump_interval",
> + "Sets the maximum time between two buffer pool dumps",
> + "If set to Zero, timer based dumping is disabled."
> + " If set to -1, stops the running autoprewarm.",
> + &dump_interval,
> + AT_PWARM_DEFAULT_DUMP_INTERVAL,
> + AT_PWARM_OFF, INT_MAX / 1000,
> + PGC_SIGHUP,
> + GUC_UNIT_S,
> + NULL,
> + NULL,
> + NULL);
> +
> + /* if not run as a preloaded library, nothing more to do here! */
> + if (!process_shared_preload_libraries_in_progress)
> + return;
> +
> + DefineCustomStringVariable("pg_prewarm.default_database",
> + "default database to connect if dump has not recorded same.",
> + NULL,
> + &default_database,
> + "postgres",
> + PGC_POSTMASTER,
> + 0,
> + NULL,
> + NULL,
> + NULL);

I don't think it's a good idea to make guc registration depending on
process_shared_preload_libraries_in_progress.

You should also use EmitWarningsOnPlaceholders() somewhere here.

I also wonder whether we don't need to use prefetch to actually make
this fast enough.

I think it's pretty clear that this needs a bit more work and thus won't
be ready for v10. Moved to the next CF.

- Andres

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2017-04-05 22:45:26 Re: PATCH: Batch/pipelining support for libpq
Previous Message David Rowley 2017-04-05 22:22:26 Re: multivariate statistics (v25)