Re: Proposal : For Auto-Prewarm.

From: Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>
To: Mithun Cy <mithun(dot)cy(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal : For Auto-Prewarm.
Date: 2017-06-05 11:58:06
Message-ID: CAOGQiiNwtepFRa=R2-5YBrgxyT+w8GcWyZY0pwA_SiH-OZQpKQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Jun 4, 2017 at 12:45 PM, Mithun Cy <mithun(dot)cy(at)enterprisedb(dot)com> wrote:
> On Wed, May 31, 2017 at 10:18 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> + * contrib/autoprewarm.c
>>
>> Wrong.
>
> -- Oops Sorry fixed.
>
>> + Oid database; /* database */
>> + Oid spcnode; /* tablespace */
>> + Oid filenode; /* relation's filenode. */
>> + ForkNumber forknum; /* fork number */
>> + BlockNumber blocknum; /* block number */
>>
>> Why spcnode rather than tablespace? Also, the third line has a
>> period, but not the others. I think you could just drop these
>> comments; they basically just duplicate the structure names, except
>> for spcnode which doesn't but probably should.
>
> -- Dropped comments and changed spcnode to tablespace.
>
>> + bool can_do_prewarm; /* if set can't do prewarm task. */
>>
>> The comment and the name of the variable imply opposite meanings.
>
> -- Sorry a typo. Now this variable has been removed as you have
> suggested with new variables in AutoPrewarmSharedState.
>
>> +/*
>> + * detach_blkinfos - on_shm_exit detach the dsm allocated for blockinfos.
>> + */
>> I assume you don't really need this. Presumably process exit is going
>> to detach the DSM anyway.
>
> -- Yes considering process exit will detach the dsm, I have removed
> that function.
>
>> + if (seg == NULL)
>> + ereport(ERROR,
>> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
>> + errmsg("unable to map dynamic shared memory segment")));
>>
>> Let's use the wording from the message in parallel.c rather than this
>> wording. Actually, we should probably (as a separate patch) change
>> test_shm_mq's worker.c to use the parallel.c wording also.
>
> -- I have corrected the message with "could not map dynamic shared
> memory segment" as in parallel.c
>
>> + SetCurrentStatementStartTimestamp();
>>
>> Why do we need this?
>
> -- Removed Sorry forgot to remove same when I removed the SPI connection code.
>
>> + StartTransactionCommand();
>>
>> Do we need a transaction? If so, how about starting a new transaction
>> for each relation instead of using a single one for the entire
>> prewarm?
>
> -- We do relation_open hence need a transaction. As suggested now we
> start a new transaction on every new relation.
>
>> + if (status == BGWH_STOPPED)
>> + return;
>> +
>> + if (status == BGWH_POSTMASTER_DIED)
>> + {
>> + ereport(ERROR,
>> + (errcode(ERRCODE_INSUFFICIENT_RESOURCES),
>> + errmsg("cannot start bgworker autoprewarm without postmaster"),
>> + errhint("Kill all remaining database processes and restart"
>> + " the database.")));
>> + }
>> +
>> + Assert(0);
>>
>> Instead of writing it this way, remove the first "if" block and change
>> the second one to Assert(status == BGWH_STOPPED). Also, I'd ditch the
>> curly braces in this case.
>
> -- Fixed as suggested.
>
>>
>> + file = fopen(AUTOPREWARM_FILE, PG_BINARY_R);
>>
>> Use AllocateFile() instead. See the comments for that function and at
>> the top of fd.c. Then you don't need to worry about leaking the file
>> handle on an error, so you can remove the fclose() before ereport()
> now> stuff -- which is incomplete anyway, because you could fail e.g.
>> inside dsm_create().
>
> -- Using AllocateFile now.
>
>>
>> + errmsg("Error reading num of elements in \"%s\" for"
>> + " autoprewarm : %m", AUTOPREWARM_FILE)));
>>
>> As I said in a previous review, please do NOT split error messages
>> across lines like this. Also, this error message is nothing close to
>> PostgreSQL style. Please read
>> https://www.postgresql.org/docs/devel/static/error-style-guide.html
>> and learn to follow all those guidelines written therein. I see at
>> least 3 separate problems with this message.
>
> -- Thanks, I have tried to fix it now.
>
>> + num_elements = i;
>>
>> I'd do something like if (i != num_elements) elog(ERROR, "autoprewarm
>> block dump has %d entries but expected %d", i, num_elements); It
>> seems OK for this to be elog() rather than ereport() because the file
>> should never be corrupt unless the user has cheated by hand-editing
>> it.
>
> -- Fixed as suggested. Now eloged as an ERROR.
>
>> I think you can get rid of next_db_pos altogether, and this
>> prewarm_elem thing too. Design sketch:
>>
>> 1. Move all the stuff that's in prewarm_elem into
>> AutoPrewarmSharedState. Rename start_pos to prewarm_start_idx and
>> end_of_blockinfos to prewarm_stop_idx.
>>
>> 2. Instead of computing all of the database ranges first and then
>> launching workers, do it all in one loop. So figure out where the
>> records for the current database end and set prewarm_start_idx and
>> prewarm_end_idx appropriately. Launch a worker. When the worker
>> terminates, set prewarm_start_idx = prewarm_end_idx and advance
>> prewarm_end_idx to the end of the next database's records.
>>
>> This saves having to allocate memory for the next_db_pos array, and it
>> also avoids this crock:
>>
>> + memcpy(&pelem, MyBgworkerEntry->bgw_extra, sizeof(prewarm_elem));
>>
>
> -- Fixed as suggested.
>
>> The reason that's bad is because it only works so long as bgw_extra is
>> large enough to hold prewarm_elem. If prewarm_elem grows or bgw_extra
>> shrinks, this turns into a buffer overrun.
>
> -- passing prewarm info through bgw_extra helped us to restrict the
> scope and lifetime of prewarm_elem only to prewarm task. Moving them
> to shared memory made them global even though they are not needed once
> prewarm task is finished. As there are other disadvantages of using
> bgw_extra I have now implemented as you have suggested.
>
>> I would use AUTOPREWARM_FILE ".tmp" rather than a name incorporating
>> the PID for the temporary file. Otherwise, you might leave many
>> temporary files behind under different names. If you use the same
>> name every time, you'll never have more than one, and the next
>> successful dump will end up getting rid of it along the way.
>
> -- Fixed as sugested. Previosuly PID was used so that concurrent dump
> can happen between dump worker and immediate dump as they will write
> to two different files. With new way of registering PID before file
> access in shared memory I think that problem can be adressed.
>
>>
>> + pfree(block_info_array);
>> + CloseTransientFile(fd);
>> + unlink(transient_dump_file_path);
>> + ereport(ERROR,
>> + (errcode_for_file_access(),
>> + errmsg("error writing to \"%s\" : %m",
> .> + AUTOPREWARM_FILE)));
>>
>> Again, this is NOT a standard error message text. It occurs in zero
>> other places in the source tree. You are not the first person to need
>> an error message for a failed write to a file; please look at what the
>> previous authors did. Also, the pfree() before report is not needed;
>> isn't the whole process going to terminate? Also, you can't really use
>> errcode_for_file_access() here, because you've meanwhile done
>> CloseTransientFile() and unlink(), which will have clobbered errno.
>
> -- Removed pfree, saved errno before CloseTransientFile() and unlink()
>
>> + ereport(LOG, (errmsg("saved metadata info of %d blocks", num_blocks)));
>>
>> Not project style for ereport(). Line break after the first comma.
>> Similarly elsewhere.
>
> -- Tried to fix same
>
>> + * dump_now - the main routine which goes through each buffer
>> header of buffer
>> + * pool and dumps their meta data. We Sort these data and then dump them.
>> + * Sorting is necessary as it facilitates sequential read during load.
>>
>> This is no longer true, because you moved the sort to the load side.
>> It's also not capitalized properly.
>
> -- Sorry removed now.
>
>> Discussions of the format of the autoprewarm dump file involve
>> inexplicably varying number of < and > symbols:
>>
>> + * <<DatabaseId,TableSpaceId,RelationId,Forknum,BlockNum>> in
>> + * <DatabaseId,TableSpaceId,RelationId,Forknum,BlockNum> and we shall call it
>> + buflen = sprintf(buf, "%u,%u,%u,%u,%u\n",
>
> -- Sorry fixed now, in all of the places the block info formats will
> not have such (</>) delimiter.
>
>> +#ifndef __AUTOPREWARM_H__
>> +#define __AUTOPREWARM_H__
>>
>> We don't use double-underscored names for header files. Please
>> consult existing header files for the appropriate style. Also, why
>> does this file exist at all, instead of just including them in the .c
>> file? The pointer of a header is for things that need to be included
>> by multiple .c files, but there's no such need here.
>
> -- This was done to fix one of the previous review comments. I have
> moved them back to .c file.
>
>> + * load. If there are no other block infos than the global objects
>> + * we silently ignore them. Should I throw error?
>>
>> Silently ignoring them seems fine. Throwing an error doesn't seem
>> like it would improve things.
>
> -- Okay thanks.
>
>>
>> + /*
>> + * Register a sub-worker to load new database's block. Wait until the
>> + * sub-worker finish its job before launching next sub-worker.
>> + */
>> + launch_prewarm_subworker(&pelem);
>>
>> The function name implies that it launches the worker, but the comment
>> implies that it also waits for it to terminate. Functions should be
>> named in a way that matches what they do.
>
> -- Have renamed it to launch_and_wait_for_per_database_worker
>
>> I feel like the get_autoprewarm_task() function is introducing fairly
>> baroque logic for something that really ought to be more simple. All
>> autoprewarm_main() really needs to do is:
>>
>> if (!state->autoprewarm_done)
>> autoprewarm();
>> dump_block_info_periodically();
>
> -- Have simplified things as suggested now. Function
> get_autoprewarm_task has been removed.
>
>> The locking in autoprewarm_dump_now() is a bit tricky. There are two
>> trouble cases. One is that we try to rename() our new dump file on
>> top of the existing one while a background worker is still using it to
>> perform an autoprewarm. The other is that we try to write a new
>> temporary dump file while some other process is trying to do the same
>> thing. I think we should fix this by storing a PID in
>> AutoPrewarmSharedState; a process which wants to perform a dump or an
>> autoprewarm must change the PID from 0 to their own PID, and change it
>> back to 0 on successful completion or error exit. If we go to perform
>> an immediate dump process and finds a non-zero value already just does
>> ereport(ERROR, ...), including the PID of the other process in the
>> message (e.g. "unable to perform block dump because dump file is being
>> used by PID %d"). In a background worker, if we go to dump and find
>> the file in use, log a message (e.g. "skipping block dump because it
>> is already being performed by PID %d", "skipping prewarm because block
>> dump file is being rewritten by PID %d").
>
> -- Fixed as suggested.
>
>> I also think we should change is_bgworker_running to a PID, so that if
>> we try to launch a new worker we can report something like
>> "autoprewarm worker is already running under PID %d".
>
> -- Fixed. I could only "LOG" about another autoprewarm worker already
> running and then exit. Because on ERROR we try to restart the worker,
> so do not want to restart such workers.
>
>> So putting that all together, I suppose AutoPrewarmSharedState should
>> end up looking like this:
>>
>> LWLock lock; /* mutual exclusion */
>> pid_t bgworker_pid; /* for main bgworker */
>> pid_t pid_using_dumpfile; /* for autoprewarm or block dump */
>
> -- I think one more member is required which state whether prewarm can
> be done when the worker restarts.
>
>> /* following items are for communication with per-database worker */
>> dsm_handle block_info_handle;
>> Oid database;
>> int prewarm_start_idx;
>> int prewarm_stop_idx;
>
> -- Fixed as suggested
>
>> I suggest going through and changing "subworker" to "per-database
>> worker" throughout.
>
> -- Fixed as suggested.
>
>>
>> BTW, have you tested how long this takes to run with, say, shared_buffers = 8GB?
>
> I have tried same on my local machine with ssd as a storage.
>
> settings: shared_buffers = 8GB, loaded data with pg_bench scale_factor=1000.
>
> Total blocks got dumped
> autoprewarm_dump_now
> ----------------------
> 1048576
>
> 5 different load time based logs
>
> 1.
> 2017-06-04 11:30:26.460 IST [116253] LOG: autoprewarm has started
> 2017-06-04 11:30:43.443 IST [116253] LOG: autoprewarm load task ended
> -- 17 secs
>
> 2
> 2017-06-04 11:31:13.565 IST [116291] LOG: autoprewarm has started
> 2017-06-04 11:31:30.317 IST [116291] LOG: autoprewarm load task ended
> -- 17 secs
>
> 3.
> 2017-06-04 11:32:12.995 IST [116329] LOG: autoprewarm has started
> 2017-06-04 11:32:29.982 IST [116329] LOG: autoprewarm load task ended
> -- 17 secs
>
> 4.
> 2017-06-04 11:32:58.974 IST [116361] LOG: autoprewarm has started
> 2017-06-04 11:33:15.017 IST [116361] LOG: autoprewarm load task ended
> -- 17secs
>
> 5.
> 2017-06-04 12:15:49.772 IST [117936] LOG: autoprewarm has started
> 2017-06-04 12:16:11.012 IST [117936] LOG: autoprewarm load task ended
> -- 22 secs.
>
> So mostly from 17 to 22 secs.
>
> But I think I need to do tests on a larger set of configuration on
> different storage types. I shall do same and upload later. I have also
> uploaded latest performance test results (on my local machine ssd
> drive)
> configuration: shared_buffer = 8GB,
> test setting: scale_factor=300 (data fits to shared_buffers) pgbench clients =1
>
> TEST
> PGBENCH_RUN="./pgbench --no-vacuum --protocol=prepared --time=5 -j 1
> -c 1 --select-only postgres"
> START_TIME=$SECONDS; echo TIME, TPS; while true; do TPS=$($PGBENCH_RUN
> | grep excluding | cut -d ' ' -f 3); TIME=$((SECONDS-START_TIME));
> echo $TIME, $TPS; done
>
>
I had a look at the patch from stylistic/formatting point of view,
please find the attached patch for the suggested modifications.

--
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/

Attachment Content-Type Size
cosmetic_autoprewarm.patch application/octet-stream 20.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2017-06-05 12:24:08 Re: Fix a typo in README.dependencies
Previous Message Magnus Hagander 2017-06-05 11:38:28 Re: retry shm attach for windows (WAS: Re: OK, so culicidae is *still* broken)