Re: Refactoring the checkpointer's fsync request queue

From: Shawn Debnath <sdn(at)amazon(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Refactoring the checkpointer's fsync request queue
Date: 2019-03-05 01:25:08
Message-ID: 2a1667c64cc0485bb781704876ac5a0a@EX13D05UWC002.ant.amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Mar 01, 2019 at 11:17:27PM +1300, Thomas Munro wrote:
> - smgrpreckpt();
> + PreCheckpoint();
>
> I would call this and the "post" variant something like
> SyncPreCheckpoint(). Otherwise it's too general sounding and not
> clear which module it's in.

Sure - fixed.

> +static const int NSync = lengthof(syncsw);
>
> Unused.

Good catch - interesting how there was no warning thrown for this.

> + fd = PathNameOpenFile(path, O_RDWR | PG_BINARY);
>
> Needs to be closed.

*ahem* fixed

>
> + path = syncsw[entry->owner].sync_filepath(entry->ftag);
>
> Probably doesn't make much difference, but wouldn't it be better for
> the path to be written into a caller-supplied buffer of size
> MAXPGPATH? Then we could have that on the stack instead of alloc/free
> for every path.
>
> Hmm, mdfilepath() needs to use GetRelationPath(), and that already
> returns palloc'd memory. Oh well.

Yep - I tried to muck with this as well but the logic is too neatly
encapsulated inside relpath.c and given the usages via
relpath[perm|backend] - I chose to forgo the attempt.

>
> + entry->canceled = true;
>
> Now that we killed the bitmapset, I wonder if we still need this
> canceled flag. What if we just removed the entry from the hash table?
> If you killed the canceled flag you could then replace this:
>
> + if (entry->canceled)
> + break;
>
> .. with another hash table probe to see if the entry went in the
> AbsorbSyncRequests() call (having first copied the key into a local
> variable since of course "entry" may have been freed). Or maybe you
> don't think that's better, I dunno, just an idea :-)

It seems safer and cleaner to have the canceled flag as other approaches
don't really give us any gain. But this made me realize that I should
also cancel the entry for the simple forget case instead of removing it
from the hash table. Fixed and modified the for loop to break if entry
was cancelled. Deletion of the entry now happens in one place - after it
has been processed or skipped inside ProcessSyncRequests.

> +ForwardSyncRequest(FileTag ftag, SyncRequestType type, SyncRequestOwner owner)
>
> Is it a deliberate choice that you pass FileTag objects around by
> value? Rather than, say, pointer to const. Not really a complaint in
> the current coding since it's a small object anyway (not much bigger
> than a pointer), but I guess someone might eventually want to make it
> into a flexible sized object, or something, I dunno.

It was deliberate to match what we are doing today. You have a good
point regarding future modifications. May as well get the API changed
correctly the first time around. Changed.

> -ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)
> +ForgetRelationSyncRequests(RelFileNode rnode, ForkNumber forknum,
> + SegmentNumber segno)
> {
> - if (pendingOpsTable)
> + FileTag tag;
> +
> + tag.rnode = rnode;
> + tag.forknum = forknum;
> + tag.segno = segno;
> +
> + if (IsSyncManagedLocally())
> {
> /* standalone backend or startup process: fsync state
> is local */
> - RememberFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC);
> + RememberSyncRequest(tag, FORGET_REQUEST, SYNC_MD);
> }
> ...
>
> You left this and similar functions in md.c, but I think it needs to
> move out to sync.c, and just take a FileTag directly. Otherwise I
> have to write similar functions in undofile.c, and it seems kinda
> weird that those modules are worrying about whether sync is managed
> locally or the message needs to be sent to the checkpointer, and even
> worse, they have to duplicate the loop that deals with
> ForwardSyncRequest() failing and retrying. Concretely I'm saying that
> sync.c should define a function like this:
>
> +/*
> + * PostSyncRequest
> + *
> + * Remember locally, or post to checkpointer as appropriate.
> + */
> +void
> +PostSyncRequest(FileTag tag, SyncRequestType type, SyncRequestOwner owner)
> +{
> + if (IsSyncManagedLocally())
> + {
> + /* standalone backend or startup process: fsync state
> is local */
> + RememberSyncRequest(tag, type, owner);
> + }
> + else if (IsUnderPostmaster)
> + {
> + while (!ForwardSyncRequest(tag, FORGET_REQUEST, SYNC_MD))
> + pg_usleep(10000L); /* 10 msec seems a
> good number */
> + }
> +}
>
> Hmm, perhaps it would need to take an argument to say whether it
> should keep retrying or return false if it fails; that way
> register_dirty_segment() could perform the FileSync() itself if the
> queue is full, but register_unlink could tell it to keep trying. Does
> this make sense?

Yeah - makes sense. I did have it on my list, but I think I wanted to
minimize changes to md and wanted to wait for a use case. Though it
seems clear it's needed and looks like you already ran into it with
undo. Added a new function RegisterSyncRequest that checks for local
table or forwards to checkpointer. Accepts a parameter retryOnError
which does what it says or exits on failure.

> +typedef enum syncrequesttype
> +{
> + SYNC_REQUEST,
> + FORGET_REQUEST,
> + FORGET_HIERARCHY_REQUEST,
> + UNLINK_REQUEST
> +} syncrequesttype;
>
> These names are too generic for the global C namespace; how about
> SYNC_REQ_CANCEL or similar?

Yeah, again, was trying to match what we had before. I prefixed these
enums with SYNC to have a more readable set:

SYNC_REQUEST
SYNC_FORGET_REQUEST
SYNC_FORGET_HIERARCHY_REQUEST
SYNC_UNLINK_REQUEST

Except for the hierarchy one, they are pretty reasonable in length.

> +typedef enum syncrequestowner
> +{
> + SYNC_MD = 0 /* md smgr */
> +} syncrequestowner;
>
> I have a feeling that Andres wanted to see a single enum combining
> both the "operation" and the "owner", like SYNC_REQ_CANCEL_MD,
> SYNC_REQ_CANCEL_UNDO, ... but I actually like it better the way you
> have it.

This was tackled in a sub-thread. I am keeping the enums but changing
the the checkpointer queue to persist the type and owner combo in 8
bits.

> +/* md callback forward declarations */
> +extern char* mdfilepath(FileTag ftag);
> +extern bool mdtagmatches(FileTag ftag, FileTag predicate,
> SyncRequestType type);
>
> It's weird that these ^ are declared in sync.h. I think they should
> be declared in a new header md.h and that should be included by
> sync.c. I know that we have this historical weird thing there md.c's
> functions are declared in smgr.h, but we should eventually fix that.

Long story short - I wanted to avoid having md.h include sync.h for the
FileTag definitions. But - it's cleaner this way. Changed.

--
Shawn Debnath
Amazon Web Services (AWS)

Attachment Content-Type Size
0001-Refactor-the-fsync-machinery-to-support-future-SMGR-v11.patch text/plain 78.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2019-03-05 01:38:53 Re: Ordered Partitioned Table Scans
Previous Message Amit Langote 2019-03-05 00:50:42 Re: speeding up planning with partitions