Re: trying again to get incremental backup

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Peter Eisentraut <peter(at)eisentraut(dot)org>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: trying again to get incremental backup
Date: 2023-10-25 14:33:49
Message-ID: 0387d6be-276b-4f37-b360-fdc2eddc17c5@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 2023-10-25 We 09:05, Robert Haas wrote:
> On Wed, Oct 25, 2023 at 7:54 AM Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:
>> Robert asked me to work on this quite some time ago, and most of this
>> work was done last year.
>>
>> Here's my WIP for an incremental JSON parser. It works and passes all
>> the usual json/b tests. It implements Algorithm 4.3 in the Dragon Book.
>> The reason I haven't posted it before is that it's about 50% slower in
>> pure parsing speed than the current recursive descent parser in my
>> testing. I've tried various things to make it faster, but haven't made
>> much impact. One of my colleagues is going to take a fresh look at it,
>> but maybe someone on the list can see where we can save some cycles.
>>
>> If we can't make it faster, I guess we could use the RD parser for
>> non-incremental cases and only use the non-RD parser for incremental,
>> although that would be a bit sad. However, I don't think we can make the
>> RD parser suitable for incremental parsing - there's too much state
>> involved in the call stack.
> Yeah, this is exactly why I didn't want to use JSON for the backup
> manifest in the first place. Parsing such a manifest incrementally is
> complicated. If we'd gone with my original design where the manifest
> consisted of a bunch of lines each of which could be parsed
> separately, we'd already have incremental parsing and wouldn't be
> faced with these difficult trade-offs.
>
> Unfortunately, I'm not in a good position either to figure out how to
> make your prototype faster, or to evaluate how painful it is to keep
> both in the source tree. It's probably worth considering how likely it
> is that we'd be interested in incremental JSON parsing in other cases.
> Maintaining two JSON parsers is probably not a lot of fun regardless,
> but if each of them gets used for a bunch of things, that feels less
> bad than if one of them gets used for a bunch of things and the other
> one only ever gets used for backup manifests. Would we be interested
> in JSON-format database dumps? Incrementally parsing JSON LOBs? Either
> seems tenuous, but those are examples of the kind of thing that could
> make us happy to have incremental JSON parsing as a general facility.
>
> If nobody's very excited by those kinds of use cases, then this just
> boils down to whether we want to (a) accept that users with very large
> numbers of relation files won't be able to use pg_verifybackup or
> incremental backup, (b) accept that we're going to maintain a second
> JSON parser just to enable that use cas and with no other benefit, or
> (c) undertake to change the manifest format to something that is
> straightforward to parse incrementally. I think (a) is reasonable
> short term, but at some point I think we should do better. I'm not
> really that enthused about (c) because it means more work for me and
> possibly more arguing, but if (b) is going to cause a lot of hassle
> then we might need to consider it.

I'm not too worried about the maintenance burden. The RD routines were
added in March 2013 (commit a570c98d7fa) and have hardly changed since
then. The new code is not ground-breaking - it's just a different (and
fairly well known) way of doing the same thing. I'd be happier if we
could make it faster, but maybe it's just a fact that keeping an
explicit stack, which is how this works, is slower.

I wouldn't at all be surprised if there were other good uses for
incremental JSON parsing, including some you've identified.

That said, I agree that JSON might not be the best format for backup
manifests, but maybe that ship has sailed.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2023-10-25 14:36:41 Re: [patch] pg_basebackup: mention that spread checkpoints are the default in --help
Previous Message Aleksander Alekseev 2023-10-25 13:38:19 Re: Add connection active, idle time to pg_stat_activity