Re: Unlogged relations and WAL-logging

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Unlogged relations and WAL-logging
Date: 2023-08-23 14:40:32
Message-ID: 0fc6970f-b24d-1f3f-0ebc-2b9a0b14a497@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 07/07/2023 18:21, Heikki Linnakangas wrote:
> On 28/01/2022 15:57, Robert Haas wrote:
>>> 4. Also, the smgrwrite() calls are performed before WAL-logging the
>>> pages, so the page that's written to disk has 0/0 as the LSN, not the
>>> LSN of the WAL record. That's harmless too, but seems a bit sloppy.
>>
>> That is also a mistake on my part.
>
> I'm still sitting on these fixes. I think the patch I posted still makes
> sense, but I got carried away with a more invasive approach that
> introduces a whole new set of functions for bulk-creating a relation,
> which would handle WAL-logging, smgrimmedsync() and all that (see
> below). We have some repetitive, error-prone code in all the index build
> functions for that. But that's not backpatchable, so I'll rebase the
> original approach next week.

Committed this fix to master and v16. Didn't seem worth backpatching
further than that, given that there is no live user-visible issue here.

>>> 5. In heapam_relation_set_new_filenode(), we do this:
>>>
>>>>
>>>> /*
>>>> * If required, set up an init fork for an unlogged table so that it can
>>>> * be correctly reinitialized on restart. An immediate sync is required
>>>> * even if the page has been logged, because the write did not go through
>>>> * shared_buffers and therefore a concurrent checkpoint may have moved the
>>>> * redo pointer past our xlog record. Recovery may as well remove it
>>>> * while replaying, for example, XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE
>>>> * record. Therefore, logging is necessary even if wal_level=minimal.
>>>> */
>>>> if (persistence == RELPERSISTENCE_UNLOGGED)
>>>> {
>>>> Assert(rel->rd_rel->relkind == RELKIND_RELATION ||
>>>> rel->rd_rel->relkind == RELKIND_MATVIEW ||
>>>> rel->rd_rel->relkind == RELKIND_TOASTVALUE);
>>>> smgrcreate(srel, INIT_FORKNUM, false);
>>>> log_smgrcreate(newrnode, INIT_FORKNUM);
>>>> smgrimmedsync(srel, INIT_FORKNUM);
>>>> }
>>>
>>> The comment doesn't make much sense, we haven't written nor WAL-logged
>>> any page here, with nor without the buffer cache. It made more sense
>>> before commit fa0f466d53.
>>
>> Well, it seems to me (and perhaps I am just confused) that complaining
>> that there's no page written here might be a technicality. The point
>> is that there's no synchronization between the work we're doing here
>> -- which is creating a fork, not writing a page -- and any concurrent
>> checkpoint. So we both need to log it, and also sync it immediately.
>
> I see. I pushed the fix from the other thread that makes smgrcreate()
> call register_dirty_segment (commit 4b4798e13). I believe that makes
> this smgrimmedsync() unnecessary. If a concurrent checkpoint happens
> with a redo pointer greater than this WAL record, it must've received
> the fsync request created by smgrcreate(). That depends on the fact that
> we write the WAL record *after* smgrcreate(). Subtle..
>
> Hmm, we have a similar smgrimmedsync() call after index build, because
> we have written pages directly with smgrextend(skipFsync=true). If no
> checkpoints have occurred during the index build, we could call
> register_dirty_segment() instead of smgrimmedsync(). That would avoid
> the fsync() latency when creating an index on an empty or small index.
>
> This is all very subtle to get right though. That's why I'd like to
> invent a new bulk-creation facility that would handle this stuff, and
> make the callers less error-prone.

Having a more generic and less error-prone bulk-creation mechanism is
still on my long TODO list..

--
Heikki Linnakangas
Neon (https://neon.tech)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2023-08-23 14:43:42 using SysCacheGetAttrNotNull in place of heap_getattr
Previous Message Pavel Stehule 2023-08-23 14:04:20 Re: Schema variables - new implementation for Postgres 15