v13 has added a proper test comparing original and restored table data
On Tue, Jan 27, 2026 at 11:43 PM Hannu Krosing <hannuk(at)google(dot)com> wrote:
>
> Hi David
>
> Thanks for reviewing.
>
> Please hold back reviewing this v12 patch until I have verified that
> it passes CfBot and until I have finished my testing with 17TB table.
>
> I would appreciate pointers or help on adding the data correctness
> tests to the tap tests as real tests.
> For now I put them as DO$$ ... $$ blocks in the parallel test .pl and
> I check manually that the table data checksums match
> If we add them we should add them to other places in pg_dump tests as
> well. Currently we just test that dump and restore do not fail, not
> that the restored data is correct.
>
> On Fri, Jan 23, 2026 at 3:15 AM David Rowley <dgrowleyml(at)gmail(dot)com> wrote:
> >
> > On Fri, 23 Jan 2026 at 06:05, Hannu Krosing <hannuk(at)google(dot)com> wrote:
> > >
> > > Fixing all the warnings
> >
> > I think overall this needs significantly more care and precision than
> > what you've given it so far. For example, you have:
> >
> > + if(dopt->max_table_segment_pages != InvalidBlockNumber)
> > + appendPQExpBufferStr(query,
> > "pg_relation_size(c.oid)/current_setting('block_size')::int AS
> > relpages, ");
> > + else
> > + appendPQExpBufferStr(query, "c.relpages, ");
> >
> > Note that pg_class.relpages is "int". Later the code in master does:
> >
> > tblinfo[i].relpages = atoi(PQgetvalue(res, i, i_relpages));
>
> I have now fixed the base issue by changing the data type of
> TableInfo.relpages to BlockNumber, and also changed the way we get it
> by
>
> 1. converting it to unsigned int ( c.relpages::oid ) in the query
> 2. reading it from the result using strtoul()
>
> (technically it should have been enough to just use strtoul() as it
> already wraps signed ints to unsigned ones, but having it converted in
> the query seems cleaner)
>
> This allowed removing casts to (BlockNumber) everywhere where
> .relpages was used.
>
> Functionally value was ever only used for ordering and even this
> loosley, which explains why patch v10 did not break anything.
>
> I also changed the data type of TocEntry.dataLength from pgoff_t to
> uint64. The current clearly had an overflow in case when off_t was 32
> bit and sum of relpages from heap and toast was larger than allowed
> for it.
>
> > If you look in vacuum.c, you'll see "pgcform->relpages = (int32)
> > num_pages;" that the value stored in relpages will be negative when
> > the table is >= 16TB (assuming 8k pages). Your pg_relation_size
> > expression is not going to produce an INT. It'll produce a BIGINT, per
> > "select pg_typeof(pg_relation_size('pg_class') /
> > current_setting('block_size')::int);". So the atoi() can receive a
> > string of digits representing an integer larger than INT_MAX in this
> > case. Looking at [1], I see:
>
> As said above this should be fixed now by using correct type in struch
> and strtoul().
> To be sure I have now created a 17TB table and running some tests on
> this as well.
> Will let you know here when done.
>
> > "7.22.1 Numeric conversion functions 1 The functions atof, atoi, atol,
> > and atoll need not affect the value of the integer expression errno on
> > an error. If the value of the result cannot be represented, *the
> > behavior is undefined.*"
> >
> > And testing locally, I see that my Microsoft compiler will just return
> > INT_MAX on overflow, whereas I see gcc does nothing to prevent
> > overflows and just continues to multiply by 10 regardless of what
> > overflows occur, which I think would just make the code work by
> > accident.
>
> As .relpages was only ever used for ordering parallel copies it does
> work just not optimally.
>
> The old code has similar overflow/wraparound for case when off_t is 32
> bit int and the sum of relpages from heap and toast table is above
> INT_MAX
>
> I have removed the whole part where this was partially fixed for the
> case when one of them was > 0x7fffffff (i.e. negative) by pinning the
> dataLength to INT_MAX in that case
>
> > Aside from that, nothing in the documentation mentions that this is
> > for "heap" tables only. That should be mentioned as it'll just result
> > in people posting questions about why it's not working for some other
> > table access method. There's also not much care for white space.
> > You've introduced a bunch of whitespace changes unrelated to code
> > changes you've made, plus there's not much regard for following
> > project standard. For example, you commonly do "if(" and don't
> > consistently follow the bracing rules, e.g:
> >
> > + for(chkptr = optarg; *chkptr != '\0'; chkptr++)
> > + if(*chkptr == '-')
>
> I assumed that it is the classical "single statemet -- no braces.
>
> Do we have a writeup of our coding standards somewhere ?
>
> Now this specific case is rewritten using while() so shoud be ok.
>
> > Things like the following help convey the level of care that's gone into this:
> >
> > +/*
> > + * option_parse_int
> > + *
> > + * Parse integer value for an option. If the parsing is successful, returns
> > + * true and stores the result in *result if that's given; if parsing fails,
> > + * returns false.
> > + */
> > +bool
> > +option_parse_uint32(const char *optarg, const char *optname,
> >
> > i.e zero effort gone in to modify the comments after pasting them from
> > option_parse_int().
> >
> > Another example:
> >
> > + pg_log_error("%s musst be in range %lu..%lu",
> >
> > Also, I have no comprehension of why you'd use uint64 for the valid
> > range when the function is for processing uint32 types in:
>
> The uint64 there I picked up from the referenced long unsigned usage
> in pg_resetval after I managed to get pg_log_warning to print out -1
> for format %u and did not want to go to debug why that happens.
>
> I have now made all the arguments uint32
>
> > +bool
> > +option_parse_uint32(const char *optarg, const char *optname,
> > + uint64 min_range, uint64 max_range,
> > + uint32 *result)
> >
> > In its current state, it's quite hard to take this patch seriously.
> > Please spend longer self-reviewing it before posting. You could
> > temporarily hard-code something for testing which makes at least 1
> > table appear to be larger than 16TB and ensure your code works. What
> > you have is visually broken and depends on whatever the atoi
> > implementation opts to do in the overflow case. These are all things
> > diligent commiters will be testing and it's sad to see how little
> > effort you're putting into this. How do you expect this community to
> > scale with this quality level of patch submissions? You've been around
> > long enough and should know and do better. Are you just expecting the
> > committer to fix these things for you? That work does not get done via
> > magic wand. Being on v10 already, I'd have expected the patch to be
> > far beyond proof of concept grade. If you're withholding investing
> > time on this until you see more community buy-in, then I'd suggest you
> > write that and withhold further revisions until you're happy with the
> > level of buy-in.
>
> > I'm also still not liking your de-normalised TableInfo representation
> > for "is_segment".
> > IMO, InvalidBlockNumber should be used to represent
> > open bounded ranges, and if there's no chunking, then startPage and
> > endPage will both be InvalidBlockNumber.
>
> That's what I ended up doing
>
> I switched to using startPage = InvalidBlockNumber to indicate that no
> chunking is in effect.
>
> This is safe because when chunking is in use I always try to set both
> chunk end pages, and lower bound I can always set the lower bound.
>
> Only for the last page is the endPage left to InvalidBlockNumber.
>
> > IMO, what you have now
> > needlessly allows invalid states where is_segment == true and
> > startPage, endPage are not set correctly. If you want to keep the code
> > simple, hide the complexity in a macro or an inline function. There's
> > just no performance reason to materialise the more complex condition
> > into a dedicated boolean flag.