Re: Refactoring the checkpointer's fsync request queue

From: Shawn Debnath <sdn(at)amazon(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Refactoring the checkpointer's fsync request queue
Date: 2019-02-22 18:18:57
Message-ID: 20190222181857.GA77583@f01898859afd.ant.amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> It's interesting that you measure performance improvements that IIUC
> can come only from dropping the bitmapset stuff (or I guess bugs). I
> don't understand the mechanism for that improvement yet.

I will be digging into this a bit more to understand what really is the
cause for the improvements. But first, need to get the refactor patch in
better shape :-)

> The idea of just including the segment number (in some representation,
> possibly opaque to this code) in the hash table key instead of
> carrying a segment list seems pretty good from here (and I withdraw
> the sorted vector machinery I mentioned earlier as it's redundant for
> this project)... except for one detail. In your patch, you still have
> FORGET_RELATION_FSYNC and FORGET_DATABASE_FSYNC. That relies on this
> sync manager code being able to understand which stuff to drop from
> the hash table, which means that is still has knowledge of the key
> hierarchy, so that it can match all entries for the relation or
> database. That's one of the things that Andres is arguing against.

You are correct. I actually did mention having a callback to do the
request resolution in an response to Andres back up in the thread, but
oops, completely slipped my mind with my last patch.

> How about we take all that sync-related stuff, that Shawn has moved
> from md.c into smgr.c, and my earlier patch had moved out of md.c into
> smgrsync.c, and we put it into a new place
> src/backend/storage/file/sync.c? Or somewhere else, but not under
> smgr. It doesn't know anything about smgr concepts, and it can be
> used to schedule file sync for any kind of file, not just the smgr
> implementations' files. Though they'd be the main customers, I guess.
> md.c and undofile.c etc would call it to register stuff, and
> checkpointer.c would call it to actually perform all the fsync calls.
>
> If we do that, the next question is how to represent filenames. One
> idea is to use tags that can be converted back to a path. I suppose
> there could be a table of function pointers somewhere, and the tag
> could be a discriminated union? Or just a descriminator + a small
> array of dumb uint32_t of a size big enough for all users, a bit like
> lock tags.
>
> One reason to want to use small fixed-sized tags is to avoid atomicity
> problems in the future when it comes to the fd-passing work, as
> mentioned up-thread. Here are some other ideas, to avoid having to
> use tags:
>
> * send the paths through a shm queue, but the fds through the Unix
> domain socket; the messages carrying fds somehow point to the pathname
> in the shm queue (and deal with slight disorder...)
> * send the paths through the socket, but hold an LWLock while doing so
> to make sure it's atomic, no matter what the size
> * somehow prove that it's really already atomic even for long paths,
> on every operating system we support, and that it's never change, so
> there is no problem here
>
> Another problem with variable sized pathnames even without the future
> fd-passing work is that it's harder to size the shm queue: the current
> code sets max_requests to NBuffers, which makes some kind of sense
> because that's a hard upper bound on the number of dirty segments
> there could possibly be at a given moment in time (one you'll
> practically never hit), after deduplication. It's harder to come up
> with a decent size for a new variable-sized-message queue; MAXPGPATH *
> NBuffers would be insanely large (it's be 1/8th the size of the buffer
> pool), but if you make it any smaller there is no guarantee that
> compacting it can create space. Perhaps the solution to that is
> simply to block/wait while shm queue is full -- but that might have
> deadlock problems.

I think I have a lot better understanding of what Andres is envisioning
and agree with what Thomas has said so far. To summarize, we want a
"sync" component at the level of fd, that components higher up the chain
like md, undo, slru and checkpointer will use to track and process fsync
requests (I am refraining from putting in an ascii diagram here!).
These checkpoint requests will be opaque to the sync machinery and will
rely on requesters to provide the necessary details. I, agree with
Thomas, in that I don't think passing full pathnames or variable
pathnames is the right way to go for all the reasons Thomas mentioned in
his email. However, if we want to, in the future we can easily extend
the checkpoint request to include a type, CHECKPOINT_REQUEST_FD or
CHECKPOINT_REQUEST_PATH, and delegate the current relfilenode to be of
type CHECKPOINT_REQUEST_RELFILENODE. Sync can then act on the requests
based on the type, and in some cases wouldn't need to interact with any
other component.

The pieces of information we need to process fsyncs are (1) determine if
a request is to invalidate other requests the queue currently holds and
(2) determine the full path to the file to issue fsync on.

I think using callbacks is the better path forward than having md or
other components issue an invalidate request for each and every segment
which can get quite heavy handed for large databases. Performance would
be the same as today since we already scan the entire hash table when we
encounter a forget request. This new approach will involve one
additional function call inside the loop which does a simple compare.

And Thomas brought up a good point offline: if we followed the path of
smgr for the callbacks, it will lead to header file dependency
nightmare. It would be best for components like md to register it's
callback functions with sync so that sync doesn't have to include higher
level header files to get access to their prototypes.

At the time of smgrinit(), mdinit() would call into sync and register
it's callbacks with an ID. We can use the same value that we are using
for smgr_which to map the callbacks. Each fsync request will then also
accompany this ID which the sync mechanism will use to call handlers for
resolving forget requests or obtaining paths for files.

I think this should satisfy Andres' requirement for not teaching smgr
anything about segmentation while keeping sync related knowledge far
below the smgr layer. This scheme works for undo and generic block
storage well. Though might have to teach immedsync to accept block
numbers so that undo and other storage managers can determine what
segment it maps to.

Thoughts? I am going to get started with revising the patch unless I
hear otherwise. And love the great feedback btw, thank you.

--
Shawn Debnath
Amazon Web Services (AWS)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2019-02-22 18:21:50 Re: [patch] Add schema total size to psql \dn+
Previous Message Robert Haas 2019-02-22 18:15:35 Re: Autovaccuum vs temp tables crash