Re: Proposal : For Auto-Prewarm.

From: Mithun Cy <mithun(dot)cy(at)enterprisedb(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal : For Auto-Prewarm.
Date: 2017-05-23 13:36:58
Message-ID: CAD__OujPm830FBP7Gnf9tyf-LUHLeeBwKdt4wVeJbxbgfWrqww@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks, Andres,

I have tried to fix all of your comments. One important change has
happened with this patch is previously we used to read one block info
structure at a time and load it. But now we read all of them together
and load it into as DSA and then we distribute the block infos to the
subgroups to load corresponding blocks. With this portability issue
which I have mentioned above will no longer exists as we do not store
any map tables within the dump file.

On Thu, Apr 6, 2017 at 4:12 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> On 2017-03-13 18:45:00 +0530, Mithun Cy wrote:
>> launched only after previous one is finished. All of this will only
>> start if the database has reached a consistent state.
>
> Hm. For replay performance it'd possibly be good to start earlier,
> before reaching consistency. Is there an issue starting earlier?

Earlier patches used to start loading blocks before reaching a
consistent state. Then Robert while reviewing found a basic flaw in my
approach. The function DropRelFileNodesAllBuffers do not expect
others to load the blocks concurrently while it is getting rid of
buffered blocks. So has to delay loading until database reaches
consistent state so that we can connect to each database and take a
relation lock before loading any of theirs blocks.

>> + * -- Automatically prewarm the shared buffer pool when server restarts.
>
> Don't think we ususally use -- here.

-- Fixed.

>> + * Copyright (c) 2013-2017, PostgreSQL Global Development Group
>
> Hm, that's a bit of a weird date range.

-- changed it to 2016-2017 is that right?

> The pg_prewarm.c in there looks like some search & replace gone awry.

-- sorry Fixed.

>> +#include "utils/rel.h"
>> +#include "port/atomics.h"
>
> I'd rather just sort these alphabetically.
> I think this should rather be in the initial header.

-- Fixed as suggested and have moved everything into a header file pg_prewarm.h.

> s/until there is a/until there is no/?

-- Fixed.

>
>> +/*
>> + * ============================================================================
>> + * =========================== SIGNAL HANDLERS ===========================
>> + * ============================================================================
>> + */
>
> Hm...

-- I have reverted those cosmetic changes now.

>
>> +static void sigtermHandler(SIGNAL_ARGS);
>> +static void sighupHandler(SIGNAL_ARGS);
>
> I don't think that's a casing we commonly use. We mostly use CamelCase
> or underscore_case.
>

-- Fixed with CamelCase.

>> + * Signal handler for SIGUSR1.
>> + */
>> +static void
>> +sigusr1Handler(SIGNAL_ARGS)

> Hm, what's this one for?

-- The prewarm sub-workers will notify with SIGUSR1 on their
startup/shutdown. Updated the comments.

>> +/*
>> + * Shared state information about the running autoprewarm bgworker.
>> + */
>> +typedef struct AutoPrewarmSharedState
>> +{
>> + pg_atomic_uint32 current_task; /* current tasks performed by
>> + * autoprewarm workers. */
>> +} AutoPrewarmSharedState;
>
> Hm. Why do we need atomics here? I thought there's no concurrency?

There are 3 methods in autoprewarm.
A: prealoaded prewarm bgworker which also prewarm and then periodic dumping;
B: bgwoker launched by launch_autoprewarm_dump() which do the periodic dumping.
C: Immediate dump by backends.

We do not want 2 bgworkers started and working concurrently. and do
not want to dump while prewarm task is running. As you have suggested
rewrote a simple logic with the use of a boolean variable.

>> +sort_cmp_func(const void *p, const void *q)
>> +{
>
> rename to blockinfo_cmp?

-- Fixed.

>
> Superflous parens (repeated a lot).

-- Fixed in all places.

> Hm. Is it actually helpful to store the file as text? That's commonly
> going to increase the size of the file quite considerably, no?

-- Having in the text could help in readability or if we want to
modify/adjust it as needed. I shall so a small experiment to check how
much we caould save and produce a patch on top of this for same.

>
> But what is this actually used for? I thought Robert, in
> http://archives.postgresql.org/message-id/CA%2BTgmoa%3DUqCL2mR%2B9WTq05tB3Up-z4Sv2wkzkDxDwBP7Mj_2_w%40mail.gmail.com
> suggested storing the filenode in the dump, and then to use
> RelidByRelfilenode to get the corresponding relation?

-- Fixed as suggested directly calling RelidByRelfilenode now.

>> +load_one_database(Datum main_arg)
>> +{
>
>> + /* check if file exists and open file in read mode. */
>> + snprintf(dump_file_path, sizeof(dump_file_path), "%s", AUTOPREWARM_FILE);
>> + file = fopen(dump_file_path, PG_BINARY_R);
>> + if (!file)
>> + return; /* No file to load. */
>
> Shouldn't this be an error case? In which case is it ok for the file to
> be gone after we launched the worker?

-- Yes sorry a mistake. In the new patch I have changed the file map
to dsa to distribute block infos to subworker so this code will go
away.
Now the main worker will load all of the blockinfos into a dynamic
shared area(dsa) and sub worker will read blocks from them which
belong to the database they are assigned to load.

>
>> + /*
>> + * It should be a block info belonging to a new database. Or else dump
>> + * file is corrupted better to end the loading of bocks now.
>> + */
>> + if (loadblocktype != BLKTYPE_NEW_DATABASE)
>> + goto end_load; /* should we raise a voice here? */
>
> Yes, this should raise an error.

-- Got rid of this check with the new code.

>> + case BLKTYPE_NEW_RELATION:
>> +
>> + /*
>> + * release lock on previous relation.
>> + */
>> + if (rel)
>> + {
>> + relation_close(rel, AccessShareLock);
>> + rel = NULL;
>> + }
>> +
>> + loadblocktype = BLKTYPE_NEW_RELATION;
>> +
>> + /*
>> + * lock new relation.
>> + */
>> + reloid = get_reloid(toload_block.filenode);
>> +
>> + if (!OidIsValid(reloid))
>> + break;
>> +
>> + rel = try_relation_open(reloid, AccessShareLock);
>> + if (!rel)
>> + break;
>> + RelationOpenSmgr(rel);
>
> Now I'm confused. Your get_reloid used pg_relation_filenode() to map
> from relation oid to filenode - and then you're using it to lock the
> relation? Something's wrong.

We take a shared lock on relation id so that while we load the blocks
of the relation, DropRelFileNodesAllBuffers is not called by another
process.

>> + case BLKTYPE_NEW_FORK:
>> +
>> + /*
>> + * check if fork exists and if block is within the range
>> + */
>> + loadblocktype = BLKTYPE_NEW_FORK;
>> + if ( /* toload_block.forknum > InvalidForkNumber &&
>> + * toload_block.forknum <= MAX_FORKNUM && */
>> + !smgrexists(rel->rd_smgr, toload_block.forknum))
>> + break;
>
> Huh? What's with that commented out section of code?

-- smgrexists is not safe it crashed if we pass illegal forknumber so
a precheck. Accidently forgot to uncomment same sorry. Fixed it now.

>> + case BLKTYPE_NEW_BLOCK:
>> +
>> + /* check if blocknum is valid and with in fork file size. */
>> + if (toload_block.blocknum >= nblocks)
>> + {
>> + /* move to next forknum. */
>> + loadblocktype = BLKTYPE_NEW_FORK;
>> + break;
>> + }
>
> Hm. Why does the size of the underlying file allow us to skip to the
> next fork? Don't we have to read all the pending dump records?

-- Blocks beyond the file size are not existing. So I thought we can
move to next fork to load thier blocks.

>> + buf = ReadBufferExtended(rel, toload_block.forknum,
>> + toload_block.blocknum, RBM_NORMAL,
>> + NULL);
>> + if (BufferIsValid(buf))
>> + {
>> + ReleaseBuffer(buf);
>> + }
>> +
>> + loadblocktype = BLKTYPE_NEW_BLOCK;
>> + break;
>
> Hm. RBM_NORMAL will error out in a bunch of cases, is that ok?

For now, on error we restart the bgworker; Previously I did setjump to
catch(process) the error and continue. But as a review comment fix I
have moved to new logic of restart.

>> + if (have_dbconnection)
>> + {
>> + SPI_finish();
>> + PopActiveSnapshot();
>> + CommitTransactionCommand();
>> + }
>> + return;
>> +}
>
> Are we really ok keeping open a transaction through all of this? That
> could potentially be quite long, no? How about doing that on a per-file
> basis, or even moving to session locks alltogether?
>

-- We hold the transaction until we load all the dumped blocks of the
database or buffer pool becomes full, yes that can go long. Should I
end and start a transaction on every fork file?

>
> This doesn't mention storing things as ascii, instead of binary...
>

-- Fixed same.

>> + * NOTE : We store off_t inside file so the dump file will not be portable to
>> + * be used across systems where sizeof off_t is different from each other.
>> + */
>
> Why are we using off_t? Shouldn't this just be BlockNumber?

-- Previously we maintained a special map to block infos of each
database, but now moved everything to dsa so off_t this code is
removed now.

>> +static uint32
>> +dump_now(void)
>> +{
>> + static char dump_file_path[MAXPGPATH],
>
>> +
>> + for (num_blocks = 0, i = 0; i < NBuffers; i++)
>> + {
>> + uint32 buf_state;
>> +
>> + bufHdr = GetBufferDescriptor(i);
>> +
>> + /* lock each buffer header before inspecting. */
>> + buf_state = LockBufHdr(bufHdr);
>> +
>> + if (buf_state & BM_TAG_VALID)
>> + {
>> + block_info_array[num_blocks].database = bufHdr->tag.rnode.dbNode;
>> + block_info_array[num_blocks].spcNode = bufHdr->tag.rnode.spcNode;
>> + block_info_array[num_blocks].filenode = bufHdr->tag.rnode.relNode;
>> + block_info_array[num_blocks].forknum = bufHdr->tag.forkNum;
>> + block_info_array[num_blocks].blocknum = bufHdr->tag.blockNum;
>> + ++num_blocks;
>> + }
>> +
>> + UnlockBufHdr(bufHdr, buf_state);
>
>> + }
>> +
>> + /* sorting now only to avoid sorting while loading. */
>
> "sorting while loading"? You mean random accesses?

-- Yes fixed same.

>> + pg_qsort(block_info_array, num_blocks, sizeof(BlockInfoRecord),
>> + sort_cmp_func);
>
>
>
>> + snprintf(transient_dump_file_path, sizeof(dump_file_path),
>> + "%s.%d", AUTOPREWARM_FILE, MyProcPid);
>> + file = fopen(transient_dump_file_path, "w");
>> + if (file == NULL)
>> + ereport(ERROR,
>> + (errcode_for_file_access(),
>> + errmsg("autoprewarm: could not open \"%s\": %m",
>> + dump_file_path)));
>
> What if that file already exists? You're not truncating it. Also, what
> if we error out in the middle of this? We'll leak an fd. I think this
> needs to use OpenTransientFile etc.

Thanks using OpenTransientFile now.

>> + snprintf(dump_file_path, sizeof(dump_file_path),
>> + "%s", AUTOPREWARM_FILE);
>> + ret = fprintf(file, "%020jd\n", (intmax_t) 0);
>> + if (ret < 0)
>> + {
>> + fclose(file);
>> + ereport(ERROR,
>> + (errcode_for_file_access(),
>> + errmsg("autoprewarm: error writing to \"%s\" : %m",
>> + dump_file_path)));
>> + }
>> +
>> + database_map_table[num_db++] = ftello(file);
>> +
>> + for (i = 0; i < num_blocks; i++)
>> + {
>> + if (i > 0 && block_info_array[i].database != prev_database)
>> + {
>> + if (num_db == database_map_table_size)
>> + {
>> + database_map_table_size *= 2; /* double and repalloc. */
>> + database_map_table =
>> + (off_t *) repalloc(database_map_table,
>> + sizeof(off_t) * database_map_table_size);
>> + }
>> + fflush(file);
>> + database_map_table[num_db++] = ftello(file);
>> + }
>> +
>> + ret = fprintf(file, "%u,%u,%u,%u,%u\n",
>> + block_info_array[i].database,
>> + block_info_array[i].spcNode,
>> + block_info_array[i].filenode,
>> + (uint32) block_info_array[i].forknum,
>> + block_info_array[i].blocknum);
>> + if (ret < 0)
>> + {
>> + fclose(file);
>> + ereport(ERROR,
>> + (errcode_for_file_access(),
>> + errmsg("autoprewarm: error writing to \"%s\" : %m",
>> + dump_file_path)));
>> + }
>> +
>> + prev_database = block_info_array[i].database;
>> + }
>
> I think we should check for interrupts somewhere in that (and the
> preceding) loop.

-- Now checking from any sighup and if it is asked not to dump with
pg_prewarm.dump_interval set to -1 I will terminate the loop.

>
>> +/*
>> + * dump_block_info_periodically - at regular intervals, which is defined by GUC
>> + * dump_interval, dump the info of blocks which are present in buffer pool.
>> + */
>> +void
>> +dump_block_info_periodically()
>> +{
>
> Suggest adding void to the parameter list.

-- Fixed.

>> + last_dump_time = (pg_time_t) time(NULL);
>> + elapsed_secs = 0;
>> + }
>> +
>> + timeout = dump_interval - elapsed_secs;
>> + }
>
> I suggest using GetCurrenttimstamp() and TimestampDifferenceExceeds()
> instead.

-- Fixed as suggested.

>> + /*
>> + * In case of a SIGHUP, just reload the configuration.
>> + */
>> + if (got_sighup)
>> + {
>> + got_sighup = false;
>> + ProcessConfigFile(PGC_SIGHUP);
>> + }
>> + }
>> +
>> + /* One last block meta info dump while postmaster shutdown. */
>> + if (dump_interval != AT_PWARM_OFF)
>> + dump_now();
>
> Uh, afaics we'll also do this if somebody SIGTERMed the process
> interactively?

-- Yes we dump and exit on a sigterm.

>> + /* if not run as a preloaded library, nothing more to do here! */
>> + if (!process_shared_preload_libraries_in_progress)
>> + return;
>> +
>> + DefineCustomStringVariable("pg_prewarm.default_database",
>> + "default database to connect if dump has not recorded same.",
>> + NULL,
>> + &default_database,
>> + "postgres",
>> + PGC_POSTMASTER,
>> + 0,
>> + NULL,
>> + NULL,
>> + NULL);
>
> I don't think it's a good idea to make guc registration depending on
> process_shared_preload_libraries_in_progress.

-- Fixed now.

> You should also use EmitWarningsOnPlaceholders() somewhere here.

-- Added same.

> I also wonder whether we don't need to use prefetch to actually make
> this fast enough.

-- I have not used prefetch but I will re-access and update the patch.

--
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
autoprewarm_07.patch application/octet-stream 39.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2017-05-23 13:45:14 Re: Getting server crash after running sqlsmith
Previous Message Andres Freund 2017-05-23 13:12:10 Re: Improve logical decoding error message (was wal_level > WAL_LEVEL_LOGICAL)