Re: Proposal : For Auto-Prewarm.

From: Mithun Cy <mithun(dot)cy(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Cc: Beena Emerson <memissemerson(at)gmail(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal : For Auto-Prewarm.
Date: 2017-03-13 13:15:00
Message-ID: CAD__Ouin+TjatLVL3aq2g7xXVh--hgq2CQ-fPeRDSN5uLTwwYw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

Hi all, thanks for the feedback. Based on your recent comments I have
implemented a new patch which is attached below,

On Sun, Feb 26, 2017 at 10:16 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> This is not easy to fix. The lock has to be taken based on the
> relation OID, not the relfilenode, but we don't have the relation OID
> in the dump file, and recording it there won't help, because the
> relfilenode can change under us if the relation is rewritten with
> CLUSTER or VACUUM FULL or relevant forms of ALTER TABLE. I don't see
> a solution other than launching a separate worker for each database,
> which seems like it could be extremely expensive if there are many
> databases. Also, I am pretty sure it's no good to take locks before
> recovery reaches a consistent state. I'm not sure off-hand whether
> crash-recovery will notice conflicting locks, but even if it does,
> blocking crash recovery in order to prewarm stuff is bad news.

On Mon, Feb 27, 2017 at 7:18 PM, Peter Eisentraut
<peter(dot)eisentraut(at)2ndquadrant(dot)com> wrote:
> You don't have to start all these workers at once. Starting one and not
> starting the next one until the first one is finished should be fine.
> It will have the same serial behavior that the patch is proposing anyway.

I have implemented a similar logic now. The prewarm bgworker will
launch a sub-worker per database in the dump file. And, each
sub-worker will load its database block info. The sub-workers will be
launched only after previous one is finished. All of this will only
start if the database has reached a consistent state.

I have also introduced 2 more utility functions which were requested earlier.
A. launch_autoprewarm_dump() RETURNS int4
This is a SQL callable function to launch the autoprewarm worker to
dump the buffer pool information at regular interval. In a server, we
can only run one autoprewarm worker so if a worker sees another
existing worker it will exit immediately. The return value is pid of
the worker which has been launched.

B. autoprewarm_dump_now() RETURNS int8
This is a SQL callable function to dump buffer pool information
immediately once by a backend. This can work in parallel with an
autoprewarm worker while it is dumping. The return value is the number
of blocks info dumped.

I need some feedback on efficiently dividing the blocks among
sub-workers. Existing dump format will not be much useful.

I have chosen the following format
Each entry of block info looks like this:
<DatabaseId,TableSpaceId,RelationId,Forknum,BlockNum> and we shall
call it as BlockInfoRecord.

Contents of AUTOPREWARM_FILE has been formated such a way that
blockInfoRecord of each database can be given to different prewarm
workers.
format of AUTOPREWAM_FILE
=======================================
[offset position of database map table]
[sorted BlockInfoRecords..............]
[database map table]
=======================================

The [database map table] is a sequence of offset in file which will
point to first BlockInfoRecords of each database in the dump. The
prewarm worker will read this offset one by one in sequence and ask
its sub-worker to seek to this position and then start loading the
BlockInfoRecords one by one until it sees a BlockInfoRecords of a
different database than it is actually connected to. NOTE: We store
off_t inside file so the dump file will not be portable to be used
across systems where sizeof off_t is different from each other.

I also thought of having one dump file per database. Problems I
thought which can go against it is there could be too many dump file
(also stale files of databases which are no longer in buffer pool).
Also, there is a possibility of dump file getting lost due to
concurrent dumps by bgworker and autoprewarm_dump_now() SQL utility.
ex: While implementing same, dump file names were chosen as number
sequence 0,1,2,3......number_of_db. (This helps to avoid stale files
being chosen before active database dump files). If 2 concurrent
process dump together there will be a race condition. If one page of
the new database is loaded to buffer pool at that point there could be
a possibility that a recent dump of database blocks will be
overwritten due to the race condition and it will go missing from
dumps even though that database is still active in bufferpool. If any
comments to fix this will be very helpful.

On Sun, Feb 26, 2017 at 10:16 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Here are some other review comments (which may not matter unless we
> can think up a solution to the problems above).

> - I think auto_pg_prewarm.c is an unnecessarily long name. How about
> autoprewarm.c?

Fixed; I am also trying to replace the term "auto pg_prewarm" to
"autoprewarm" in all relevant places.

> - It looks like you ran pgindent over this without adding the new
> typedefs to your typedefs.list, so things like the last line of each
> typedef is formatted incorrectly.

Fixed.

> - ReadBufferForPrewarm isn't a good name for this interface. You need
> to give it a generic name (and header comment) that describes what it
> does, rather than why you added it. Others might want to also use
> this interface. Actually, an even better idea might be to adjust
> ReadBufferWithoutRelcache() to serve your need here. That function's
> header comment seems to contemplate that somebody might want to add a
> relpersistence argument someday; perhaps that day has arrived.

-- I think it is not needed now as we have relation descriptor hence
using ReadBufferExtended.

> - have_free_buffer's comment shouldn't mention autoprewarm, but it
> should mention that this is a lockless test, so the results might be
> slightly stale. See similar comments in various other backend
> functions for an example of how to write this.

-- Fixed

> - next_task could be static, and with such a generic name, it really
> MUST be static to avoid namespace conflicts.

-- Fixed. after the new code that variable is no longer needed.

> - load_block() has a race condition. The relation could be dropped
> after you check smgrexists() and before you access the relation.
> Also, the fork number validity check looks useless; it should never
> fail.

-- Now we hold a relation lock so this should not be an issue. But
extra check for forknum is required before calling smgrexists.
Otherwise, it crashes if valid forknum is not given hence check is
necessary.

crash call stack after manually setting one of the forknum 10

#0 0x00007f333bb27694 in vfprintf () from /lib64/libc.so.6
#1 0x00007f333bb53179 in vsnprintf () from /lib64/libc.so.6
#2 0x00000000009ec683 in pvsnprintf (buf=0x139e8e0 "global/1260_",
'\177' <repeats 115 times>, len=128, fmt=0xbebffe "global/%u_%s",
args=0x7ffeaea65750)
at psprintf.c:121
#3 0x00000000009ec5d1 in psprintf (fmt=0xbebffe "global/%u_%s") at
psprintf.c:64
#4 0x00000000009eca0f in GetRelationPath (dbNode=0, spcNode=1664,
relNode=1260, backendId=-1, forkNumber=10) at relpath.c:150
#5 0x000000000082cd95 in mdopen (reln=0x1359c78, forknum=10,
behavior=2) at md.c:583
#6 0x000000000082c58d in mdexists (reln=0x1359c78, forkNum=10) at md.c:284
#7 0x000000000082f4cf in smgrexists (reln=0x1359c78, forknum=10) at smgr.c:289
#8 0x00007f33353b0294 in load_one_database (main_arg=21) at autoprewarm.c:546
#9 0x0000000000795232 in StartBackgroundWorker () at bgworker.c:757
#10 0x00000000007a707d in do_start_bgworker (rw=0x12fc3d0) at postmaster.c:5570

> - I suggest renaming the file that stores the blocks to
> autoprewarm.blocks or something like that. Calling it just
> "autopgprewarm" doesn't seem very clear.

-- Fixed. The current file name is "autopgprewarm.save" changed it to
autoprewarm.blocks.

> - I don't see any reason for the dump file to contain a header record
> with an expected record count. When rereading the file, you can just
> read until EOF; there's no real need to know the record count before
> you start.

-- Fixed. Now removed.

> - You should test for multiple flags like this: if ((buf_state &
> (BM_VALID|VM_TAG_VALID|BM_PERSISTENT)) != 0). However, I also think
> the test is wrong. Even if the buffer isn't BM_VALID, that's not
> really a reason not to include it in the dump file. Same with
> BM_PERSISTENT. I think the reason for the latter restriction is that
> you're always calling load_block() with RELPERSISTENCE_PERMANENT, but
> that's not a good idea either. If the relation were made unlogged
> after you wrote the dump file, then on reload it you'd incorrectly set
> BM_PERMANENT on the reloaded blocks.

-- Fixed now removed BM_PERMANENT and BM_VALID only using BM_TAG_VALID
so if the hash table has this buffer we dump it.

> - elog() should not be used for user-facing messages, but rather only
> for internal messages that we don't expect to get generated. Also,
> the messages you've picked don't conform to the project's message
> style guidelines.

-- Fixed.

> - The error handling loop around load_block() suggests that you're
> expecting some reads to fail, which I guess is because you could be
> trying to read blocks from a relation that's been rewritten under a
> different relfilenode, or partially or entirely truncated. But I
> don't think it's very reasonable to just let ReadBufferWhatever() spew
> error messages into the log and hope users don't panic. People will
> expect an automatic prewarm solution to handle any such cases quietly,
> not bleat loudly in the log. I suspect that this error-trapping code
> isn't entirely correct, but there's not much point in fixing it; what
> we really need to do is get rid of it (somehow).

[Need Reelook] -- Debug and check if block load fails what will happen.

> - dump_block_info_periodically() will sleep for too long - perhaps
> forever - if WaitLatch() repeatedly returns due to WL_LATCH_SET, which
> can probably happen if for any reason the process receives SIGUSR1
> repeatedly. Every time the latch gets set, the timeout is reset, so
> it may never expire. There are examples of how to write a loop like
> this correctly in various places in the server; please check one of
> those.

-- Fixed same, to count elapsed time from the previous dump and then dump.

> - I don't think you should need an error-catching loop in
> auto_pgprewarm_main(), either. Just let the worker die if there's an
> ERROR, and set the restart interval to something other than
> BGW_NEVER_RESTART.

-- Fixed we restart it now.

> - Setting bgw_main isn't correct for extension code. Please read the
> documentation on background workers, which explains how to do this
> correctly in an extension.

-Fixed.

--
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
pg_autoprewarm_06.patch application/octet-stream 39.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Mithun Cy 2017-03-13 13:20:02 Re: Proposal : For Auto-Prewarm.
Previous Message Surafel Temesgen 2017-03-13 13:13:04 Re: New CORRESPONDING clause design