Unhappy about API changes in the no-fsm-for-small-rels patch

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org, John Naylor <john(dot)naylor(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Subject: Unhappy about API changes in the no-fsm-for-small-rels patch
Date: 2019-04-16 18:04:52
Message-ID: 20190416180452.3pm6uegx54iitbt5@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I'm somewhat unhappy in how much the no-fsm-for-small-rels exposed
complexity that looks like it should be purely in freespacemap.c to
callers.

extern Size GetRecordedFreeSpace(Relation rel, BlockNumber heapBlk);
-extern BlockNumber GetPageWithFreeSpace(Relation rel, Size spaceNeeded);
+extern BlockNumber GetPageWithFreeSpace(Relation rel, Size spaceNeeded,
+ bool check_fsm_only);

So now freespace.c has an argument that says we should only check the
fsm. That's confusing. And it's not explained to callers what that
argument means, and when it should be set.

@@ -176,20 +269,44 @@ RecordAndGetPageWithFreeSpace(Relation rel, BlockNumber oldPage,
* Note that if the new spaceAvail value is higher than the old value stored
* in the FSM, the space might not become visible to searchers until the next
* FreeSpaceMapVacuum call, which updates the upper level pages.
+ *
+ * Callers have no need for a local map.
*/
void
-RecordPageWithFreeSpace(Relation rel, BlockNumber heapBlk, Size spaceAvail)
+RecordPageWithFreeSpace(Relation rel, BlockNumber heapBlk,
+ Size spaceAvail, BlockNumber nblocks)

There's no explanation as to what that "nblocks" argument is. One
basically has to search other callers to figure it out. It's not even
clear to which fork it relates to. Nor that one can set it to
InvalidBlockNumber if one doesn't have the relation size conveniently
reachable. But it's not exposed to RecordAndGetPageWithFreeSpace(), for
a basically unexplained reason. There's a comment above
fsm_allow_writes() - but that's file-local function that external
callers basically have need to know about.

I can't figure out what "Callers have no need for a local map." is
supposed to mean.

+/*
+ * Clear the local map. We must call this when we have found a block with
+ * enough free space, when we extend the relation, or on transaction abort.
+ */
+void
+FSMClearLocalMap(void)
+{
+ if (FSM_LOCAL_MAP_EXISTS)
+ {
+ fsm_local_map.nblocks = 0;
+ memset(&fsm_local_map.map, FSM_LOCAL_NOT_AVAIL,
+ sizeof(fsm_local_map.map));
+ }
+}
+

So now there's a new function one needs to call after successfully using
the block returned by [RecordAnd]GetPageWithFreeSpace(). But it's not
referenced from those functions, so basically one has to just know that.

+/* Only create the FSM if the heap has greater than this many blocks */
+#define HEAP_FSM_CREATION_THRESHOLD 4

Hm, this seems to be tying freespace.c closer to heap than I think is
great - think of new AMs like zheap, that also want to use it.

I think this is mostly fallout about the prime issue I'm unhappy
about. There's now some global variable in freespacemap.c that code
using freespace.c has to know about and maintain.

+static void
+fsm_local_set(Relation rel, BlockNumber cur_nblocks)
+{
+ BlockNumber blkno,
+ cached_target_block;
+
+ /* The local map must not be set already. */
+ Assert(!FSM_LOCAL_MAP_EXISTS);
+
+ /*
+ * Starting at the current last block in the relation and working
+ * backwards, mark alternating blocks as available.
+ */
+ blkno = cur_nblocks - 1;

That comment explains very little about why this is done, and why it's a
good idea.

+/* Status codes for the local map. */
+
+/* Either already tried, or beyond the end of the relation */
+#define FSM_LOCAL_NOT_AVAIL 0x00
+
+/* Available to try */
+#define FSM_LOCAL_AVAIL 0x01

+/* Local map of block numbers for small heaps with no FSM. */
+typedef struct
+{
+ BlockNumber nblocks;
+ uint8 map[HEAP_FSM_CREATION_THRESHOLD];
+} FSMLocalMap;
+

Hm, given realistic HEAP_FSM_CREATION_THRESHOLD, and the fact that we
really only need one bit per relation, it seems like map should really
be just a uint32 with one bit per page.

+static bool
+fsm_allow_writes(Relation rel, BlockNumber heapblk,
+ BlockNumber nblocks, BlockNumber *get_nblocks)

+ if (rel->rd_rel->relpages != InvalidBlockNumber &&
+ rel->rd_rel->relpages > HEAP_FSM_CREATION_THRESHOLD)
+ return true;
+ else
+ skip_get_nblocks = false;
+ }

This badly needs a comment explaining that these values can be basically
arbitrarily out of date. Explaining why it's correct to rely on them
anyway (Presumably because creating an fsm unnecessarily is ok, it just
avoid susing this optimization).

+static bool
+fsm_allow_writes(Relation rel, BlockNumber heapblk,
+ BlockNumber nblocks, BlockNumber *get_nblocks)

+ RelationOpenSmgr(rel);
+ if (smgrexists(rel->rd_smgr, FSM_FORKNUM))
+ return true;

Isn't this like really expensive? mdexists() closes the relations and
reopens it from scratch. Shouldn't we at the very least check
smgr_fsm_nblocks beforehand, so this is only done once?

I'm kinda thinking that this is the wrong architecture.

1) Unless I miss something, this will trigger a
RelationGetNumberOfBlocks(), which in turn ends up doing an lseek(),
once for each page we add to the relation. That strikes me as pretty
suboptimal. I think it's even worse if multiple backends do the
insertion, because the RelationGetTargetBlock(relation) logic will
succeed less often.

2) We'll repeatedly re-encounter the first few pages, because we clear
the local map after each successful RelationGetBufferForTuple().

3) The global variable based approach means that we cannot easily do
better. Even if we had a cache that lives across
RelationGetBufferForTuple() calls.

4) fsm_allow_writes() does a smgrexists() for the FSM in some
cases. That's pretty darn expensive if it's already open.

I think if we want to keep something like this feature, we'd need to
cache the relation size in a variable similar to how we cache the FSM
size (like SMgrRelationData.smgr_fsm_nblocks) *and* stash the bitmap of
pages that we think are used/free as a bitmap somewhere below the
relcache. If we cleared that variable at truncations, I think we should
be able to make that work reasonably well?

Greetings,

Andres Freund

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2019-04-16 18:11:44 Re: hyrax vs. RelationBuildPartitionDesc
Previous Message Tom Lane 2019-04-16 17:35:25 Re: hyrax vs. RelationBuildPartitionDesc