Re: Add LZ4 compression in pg_dump

From: gkokolatos(at)pm(dot)me
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Alexander Lakhin <exclusion(at)gmail(dot)com>, "shiy(dot)fnst(at)fujitsu(dot)com" <shiy(dot)fnst(at)fujitsu(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Rachel Heaton <rachelmheaton(at)gmail(dot)com>
Subject: Re: Add LZ4 compression in pg_dump
Date: 2023-03-17 15:43:58
Message-ID: XPMIIorJ5munrlLRZGquartiN5tuapdjiBhaPZ6iEmJccRiKWCIHSowSLPhaGj4gLpqvaS1pQn7PGkA_WNwvuUTVdl-lrU_TzHA81LqujJ4=@pm.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

------- Original Message -------
On Thursday, March 16th, 2023 at 10:20 PM, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:

>
>
>
>
> On 3/16/23 18:04, gkokolatos(at)pm(dot)me wrote:
>
> > ------- Original Message -------
> > On Tuesday, March 14th, 2023 at 4:32 PM, Tomas Vondra tomas(dot)vondra(at)enterprisedb(dot)com wrote:
> >
> > > On 3/14/23 16:18, gkokolatos(at)pm(dot)me wrote:
> > >
> > > > ...> Would you mind me trying to come with a patch to address your points?
> > >
> > > That'd be great, thanks. Please keep it split into smaller patches - two
> > > might work, with one patch for "cosmetic" changes and the other tweaking
> > > the API error-handling stuff.
> >
> > Please find attached a set for it. I will admit that the splitting in the
> > series might not be ideal and what you requested. It is split on what
> > seemed as a logical units. Please advice how a better split can look like.
> >
> > 0001 is unifying types and return values on the API
> > 0002 is addressing the constant definitions
> > 0003 is your previous 0004 adding comments
>
>
> Thanks. I think the split seems reasonable - the goal was to not mix
> different changes, and from that POV it works.
>
> I'm not sure I understand the Gzip_read/Gzip_write changes in 0001. I
> mean, gzread/gzwrite returns int, so how does renaming the size_t
> variable solve the issue of negative values for errors? I mean, this
>
> - size_t ret;
> + size_t gzret;
>
> - ret = gzread(gzfp, ptr, size);
> + gzret = gzread(gzfp, ptr, size);
>
> means we still lost the information gzread() returned a negative value,
> no? We'll still probably trigger an error, but it's a bit weird.

You are obviously correct. My bad, I miss-read the return type of gzread().

Please find an amended version attached.

> Unless I'm missing something, if gzread() ever returns -1 or some other
> negative error value, we'll cast it to size_t, while condition will
> evaluate to "true" and we'll happily chew on some random chunk of data.
>
> So the confusion is (at least partially) a preexisting issue ...
>
> For gzwrite() it seems to be fine, because that only returns 0 on error.
> OTOH it's defined to take 'int size' but then we happily pass size_t
> values to it.
>
> As I wrote earlier, this apparently assumes we never need to deal with
> buffers larger than int, and I don't think we have the ambition to relax
> that (I'm not sure it's even needed / possible).

Agreed.

> I see the read/write functions are now defined as int, but we only ever
> return 0/1 from them, and then interpret that as bool. Why not to define
> it like that? I don't think we need to adhere to the custom that
> everything returns "int". This is an internal API. Or if we want to
> stick to int, I'd define meaningful "nice" constants for 0/1.

The return types are now booleans and the callers have been made aware.

> 0002 seems fine to me. I see you've ditched the idea of having two
> separate buffers, and replaced them with DEFAULT_IO_BUFFER_SIZE. Fine
> with me, although I wonder if this might have negative impact on
> performance or something (but I doubt that).
>

I doubt that too. Thank you.

> 0003 seems fine too.

Thank you.

> > As far as the error handling is concerned, you had said upthread:
> >
> > > I think the right approach is to handle all library errors and not just
> > > let them through. So Gzip_write() needs to check the return value, and
> > > either call pg_fatal() or translate it to an error defined by the API.
> >
> > While working on it, I thought it would be clearer and more consistent
> > for the pg_fatal() to be called by the caller of the individual functions.
> > Each individual function can keep track of the specifics of the error
> > internally. Then the caller upon detecting that there was an error by
> > checking the return value, can call pg_fatal() with a uniform error
> > message and then add the specifics by calling the get_error_func().
>
>
> I agree it's cleaner the way you did it.
>
> I was thinking that with each compression function handling error
> internally, the callers would not need to do that. But I haven't
> realized there's logic to detect ENOSPC and so on, and we'd need to
> duplicate that in every compression func.
>

If you agree, I can prepare a patch to improve on the error handling
aspect of the API as a separate thread, since here we are trying to
focus on correctness.

Cheers,
//Georgios

>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company

Attachment Content-Type Size
v3-0003-Improve-compress_lz4-documentation.patch text/x-patch 3.0 KB
v3-0002-Clean-up-constants-in-pg_dump-s-compression-API.patch text/x-patch 6.2 KB
v3-0001-Improve-type-handling-in-pg_dump-s-compress-file-.patch text/x-patch 24.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Paul Jungwirth 2023-03-17 16:03:09 Re: Exclusion constraints on partitioned tables
Previous Message Zheng Li 2023-03-17 15:40:56 Re: Support logical replication of DDLs