Re: Patch: dumping tables data in multiple chunks in pg_dump

From: Hannu Krosing <hannuk(at)google(dot)com>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: Zsolt Parragi <zsolt(dot)parragi(at)percona(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com>
Subject: Re: Patch: dumping tables data in multiple chunks in pg_dump
Date: 2026-01-27 22:43:44
Message-ID: CAMT0RQQT1dQTPj4etRTc0877mirCPVKzjbF_U5KRAyPwhHMr0Q@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi David

Thanks for reviewing.

Please hold back reviewing this v12 patch until I have verified that
it passes CfBot and until I have finished my testing with 17TB table.

I would appreciate pointers or help on adding the data correctness
tests to the tap tests as real tests.
For now I put them as DO$$ ... $$ blocks in the parallel test .pl and
I check manually that the table data checksums match
If we add them we should add them to other places in pg_dump tests as
well. Currently we just test that dump and restore do not fail, not
that the restored data is correct.

On Fri, Jan 23, 2026 at 3:15 AM David Rowley <dgrowleyml(at)gmail(dot)com> wrote:
>
> On Fri, 23 Jan 2026 at 06:05, Hannu Krosing <hannuk(at)google(dot)com> wrote:
> >
> > Fixing all the warnings
>
> I think overall this needs significantly more care and precision than
> what you've given it so far. For example, you have:
>
> + if(dopt->max_table_segment_pages != InvalidBlockNumber)
> + appendPQExpBufferStr(query,
> "pg_relation_size(c.oid)/current_setting('block_size')::int AS
> relpages, ");
> + else
> + appendPQExpBufferStr(query, "c.relpages, ");
>
> Note that pg_class.relpages is "int". Later the code in master does:
>
> tblinfo[i].relpages = atoi(PQgetvalue(res, i, i_relpages));

I have now fixed the base issue by changing the data type of
TableInfo.relpages to BlockNumber, and also changed the way we get it
by

1. converting it to unsigned int ( c.relpages::oid ) in the query
2. reading it from the result using strtoul()

(technically it should have been enough to just use strtoul() as it
already wraps signed ints to unsigned ones, but having it converted in
the query seems cleaner)

This allowed removing casts to (BlockNumber) everywhere where
.relpages was used.

Functionally value was ever only used for ordering and even this
loosley, which explains why patch v10 did not break anything.

I also changed the data type of TocEntry.dataLength from pgoff_t to
uint64. The current clearly had an overflow in case when off_t was 32
bit and sum of relpages from heap and toast was larger than allowed
for it.

> If you look in vacuum.c, you'll see "pgcform->relpages = (int32)
> num_pages;" that the value stored in relpages will be negative when
> the table is >= 16TB (assuming 8k pages). Your pg_relation_size
> expression is not going to produce an INT. It'll produce a BIGINT, per
> "select pg_typeof(pg_relation_size('pg_class') /
> current_setting('block_size')::int);". So the atoi() can receive a
> string of digits representing an integer larger than INT_MAX in this
> case. Looking at [1], I see:

As said above this should be fixed now by using correct type in struch
and strtoul().
To be sure I have now created a 17TB table and running some tests on
this as well.
Will let you know here when done.

> "7.22.1 Numeric conversion functions 1 The functions atof, atoi, atol,
> and atoll need not affect the value of the integer expression errno on
> an error. If the value of the result cannot be represented, *the
> behavior is undefined.*"
>
> And testing locally, I see that my Microsoft compiler will just return
> INT_MAX on overflow, whereas I see gcc does nothing to prevent
> overflows and just continues to multiply by 10 regardless of what
> overflows occur, which I think would just make the code work by
> accident.

As .relpages was only ever used for ordering parallel copies it does
work just not optimally.

The old code has similar overflow/wraparound for case when off_t is 32
bit int and the sum of relpages from heap and toast table is above
INT_MAX

I have removed the whole part where this was partially fixed for the
case when one of them was > 0x7fffffff (i.e. negative) by pinning the
dataLength to INT_MAX in that case

> Aside from that, nothing in the documentation mentions that this is
> for "heap" tables only. That should be mentioned as it'll just result
> in people posting questions about why it's not working for some other
> table access method. There's also not much care for white space.
> You've introduced a bunch of whitespace changes unrelated to code
> changes you've made, plus there's not much regard for following
> project standard. For example, you commonly do "if(" and don't
> consistently follow the bracing rules, e.g:
>
> + for(chkptr = optarg; *chkptr != '\0'; chkptr++)
> + if(*chkptr == '-')

I assumed that it is the classical "single statemet -- no braces.

Do we have a writeup of our coding standards somewhere ?

Now this specific case is rewritten using while() so shoud be ok.

> Things like the following help convey the level of care that's gone into this:
>
> +/*
> + * option_parse_int
> + *
> + * Parse integer value for an option. If the parsing is successful, returns
> + * true and stores the result in *result if that's given; if parsing fails,
> + * returns false.
> + */
> +bool
> +option_parse_uint32(const char *optarg, const char *optname,
>
> i.e zero effort gone in to modify the comments after pasting them from
> option_parse_int().
>
> Another example:
>
> + pg_log_error("%s musst be in range %lu..%lu",
>
> Also, I have no comprehension of why you'd use uint64 for the valid
> range when the function is for processing uint32 types in:

The uint64 there I picked up from the referenced long unsigned usage
in pg_resetval after I managed to get pg_log_warning to print out -1
for format %u and did not want to go to debug why that happens.

I have now made all the arguments uint32

> +bool
> +option_parse_uint32(const char *optarg, const char *optname,
> + uint64 min_range, uint64 max_range,
> + uint32 *result)
>
> In its current state, it's quite hard to take this patch seriously.
> Please spend longer self-reviewing it before posting. You could
> temporarily hard-code something for testing which makes at least 1
> table appear to be larger than 16TB and ensure your code works. What
> you have is visually broken and depends on whatever the atoi
> implementation opts to do in the overflow case. These are all things
> diligent commiters will be testing and it's sad to see how little
> effort you're putting into this. How do you expect this community to
> scale with this quality level of patch submissions? You've been around
> long enough and should know and do better. Are you just expecting the
> committer to fix these things for you? That work does not get done via
> magic wand. Being on v10 already, I'd have expected the patch to be
> far beyond proof of concept grade. If you're withholding investing
> time on this until you see more community buy-in, then I'd suggest you
> write that and withhold further revisions until you're happy with the
> level of buy-in.

> I'm also still not liking your de-normalised TableInfo representation
> for "is_segment".
> IMO, InvalidBlockNumber should be used to represent
> open bounded ranges, and if there's no chunking, then startPage and
> endPage will both be InvalidBlockNumber.

That's what I ended up doing

I switched to using startPage = InvalidBlockNumber to indicate that no
chunking is in effect.

This is safe because when chunking is in use I always try to set both
chunk end pages, and lower bound I can always set the lower bound.

Only for the last page is the endPage left to InvalidBlockNumber.

> IMO, what you have now
> needlessly allows invalid states where is_segment == true and
> startPage, endPage are not set correctly. If you want to keep the code
> simple, hide the complexity in a macro or an inline function. There's
> just no performance reason to materialise the more complex condition
> into a dedicated boolean flag.

Attachment Content-Type Size
v12-0001-changed-flag-name-to-max-table-segment-pages.patch text/x-patch 21.9 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Melanie Plageman 2026-01-27 22:58:33 Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)
Previous Message Andres Freund 2026-01-27 22:39:57 Re: [OAuth] Missing dependency on oauth_validator test