Re: Portal->commandTag as an enum

From: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
To: John Naylor <john(dot)naylor(at)2ndquadrant(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Portal->commandTag as an enum
Date: 2020-02-19 21:16:30
Message-ID: DD6EF499-E5DF-44A0-8A9B-06B775062076@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Feb 19, 2020, at 3:31 AM, John Naylor <john(dot)naylor(at)2ndquadrant(dot)com> wrote:
>
> Hi Mark,

Hi John, thanks for reviewing!

> On Wed, Feb 19, 2020 at 10:40 AM Mark Dilger
> <mark(dot)dilger(at)enterprisedb(dot)com> wrote:
>> This would only make sense to me if the string held in $_ had already been checked for safety, but Catalog.pm does very little to verify that the string is safe to eval. The assumption here, so far as I can infer, is that we don’t embed anything dangerous in our .dat files, so this should be ok. That may be true for the moment, but I can imagine a day when we start embedding perl functions as quoted text inside a data file, or shell commands as text which look enough like perl for eval() to be able to execute them. Developers who edit these .dat files and mess up the quoting, and then rerun ‘make’ to get the new .c and .h files generated, may not like the side effects. Perhaps I’m being overly paranoid….
>
> The use case for that seems slim. However, at a brief glance your
> module seems more robust in other ways.
>
>> Rather than add more code generation logic based on the design of Catalog.pm, I wrote a perl based data file parser that parses .dat files and returns vivified perl data, as Catalog.pm does, but with much stricter parsing logic to make certain nothing dangerous gets eval()ed. I put the new module in DataFile.pm.
>> [...]
>> The new parser is more flexible about the structure of the data, which seems good to me for making it easier to add or modify data files in the future. The new parser does not yet have a means of hacking up the data to add autogenerated fields and such that Catalog.pm does, but I think a more clean break between parsing and autovivifying fields would be good anyway.
>
> Separation of concerns sounds like a good idea, but I've not fully
> thought it through. For one advantage, I think it might be nicer to
> have indexing.dat and toasting.dat instead of having to dig the info
> out of C macros. This was evident while recently experimenting with
> generating catcache control data.

I guess you mean macros DECLARE_UNIQUE_INDEX and DECLARE_TOAST. I don’t mind converting that to .dat files, though I’m mindful of Tom’s concern expressed early in this thread about the amount of code churn. Is there sufficient demand for refactoring this stuff? There are more reasons in the conversation below to refactor the perl modules and code generation scripts.

> As for the patch, I have not done a full review, but I have some
> comments based on a light read-through:
>
> utils/Makefile:
>
> +# location of commandtag.dat
> +headerdir = $(top_srcdir)/src/include/utils
>
> This variable name is too generic for what the comment says it is. A
> better abstraction, if we want one, would be the full path to the
> commandtag input file. The other script invocations in this Makefile
> don't do it this way, but that's a separate patch.
>
> +# location to write generated headers
> +sourcedir = $(top_srcdir)/src/backend/utils
>
> Calling the output the source is bound to confuse people. The comment
> implies all generated headers, not just the ones introduced by the
> patch. I would just output to the current directory (i.e. have an
> --output option with a default empty string). Also, if we want to
> output somewhere else, I would imagine it'd be under the top builddir,
> not srcdir.
>
> +$(PERL) -I $(top_srcdir)/src/include/utils $<
> --headerdir=$(headerdir) --sourcedir=$(sourcedir)
> --inputfile=$(headerdir)/commandtag.dat
>
> 1. headerdir is entirely unused by the script
> 2. We can default to working dir for the output as mentioned above
> 3. -I $(top_srcdir)/src/include/utils is supposed to point to the dir
> containing DataFile.pm, but since gencommandtag.pl has "use lib..."
> it's probably not needed here. I don't recall why we keep the "-I"
> elsewhere. (ditto in Solution.pm)
>
> I'm thinking it would look something like this:
>
> +$(PERL) $< --inputfile=$(top_srcdir)/src/include/utils/commandtag.dat\

I have taken all this advice in v5 of the patch. --inputfile and --outputdir (previously named --sourcedir) are now optional with the defaults as you suggested.

> --
> utils/misc/Makefile
>
> +all: distprep
> +
> # Note: guc-file.c is not deleted by 'make clean',
> # since we want to ship it in distribution tarballs.
> clean:
> @rm -f lex.yy.c
> +
> +maintainer-clean: clean
>
> Seems non-functional.

Yeah, I also had an unnecessary addition to .gitignore in that directory. I had originally placed the commandtag stuff here before moving it one directory up. Thanks for catching that.

> --
> DataFiles.pm
>
> I haven't studied this in detail, but I'll suggest that if this meant
> to have wider application, maybe it should live in src/tools ?

We don’t seem to have a standard place for perl modules. src/test/perl has some that are specifically for tap testing, and src/backend/catalog has some for catalog data file processing. I put DataFiles.pm in src/backend/catalog because that’s where most data file processing currently is located. src/tools has PerfectHash.pm, and a bunch of Windows specific modules under src/tools/msvc.

> I'm not familiar with using different IO routines depending on the OS
> -- what's the benefit of that?

I think you are talking about the slurp_file routine. That came directly from the TestLib.pm module. I don't have enough perl-on-windows experience to comment about why it does things that way. I was reluctant to have DataFile.pm 'use TestLib', since DataFile has absolutely nothing to do with regression testing. I don't like copying the function, either, though I chose that as the lesser evil. Which is more evil is debateable.

src/test/perl/ contains SimpleTee.pm and RecursiveCopy.pm, neither of which contain functionality limited to just testing. I think they could be moved to src/tools. src/test/perl/TestLib.pm contains a mixture of testing specific functions and more general purpose functions. For instance, TestLib.pm contains functions to read in a file or directory (slurp_file(filepath) and slurp_dir(dirpath), respectively). I think we should have just one implementation of those in just one place. Neither TestLib nor DataFile seem appropriate, nor does src/test/perl seem right. I checked whether Perl ships with core module support for this and didn't find anything. There is a cpan module named File::Slurp, but it is not a core module so far as I can tell, and it does more than we want.

Should I submit a separate patch refactoring the location of perl modules and functions of general interest into src/tools? src/tools/perl?

I am not changing DataFile.pm's duplicate copy of slurp_file in v5 of the patch, as I don't yet know the best way to approach the problem. I expect there will have to be a v6 once this has been adequately debated.

> --
> gencommandtag.pl
>
> slurp_without_comments() is unused.

Right. An earlier version of gencommandtag.pl didn't use DataFile.pm, and I neglected to remove this function when I transitioned to using DataFile.pm. Thanks for noticing!

> sanity_check_data() seems longer than the main body of the script
> (minus header boilerplate), and I wonder if we can pare it down some.
> For one, I have difficulty imagining anyone would accidentally type an
> unprintable or non-ascii character in a command tag and somehow not
> realize it.

I'm uncertain about that. There is logic in EndCommand in tcop/dest.c that specifically warns that no encoding conversion will be performed due to the assumption that command tags contain only 7-bit ascii. I think that's a perfectly reasonable assumption in the C-code, but it needs to be checked by gencommandtag.pl because the bugs that might ensue from inserting an accent character or whatever could be subtle enough to not be caught right away. Such mistakes only get easier as time goes by, as the tendency for editors to change your quotes into "smart quotes" and such gets more common, and potentially as the assumption that PostgreSQL has been internationalized gets more common. Hopefully, we're moving more and more towards supporting non-ascii in more and more places. It might be less obvious to a contributor some years hence that they cannot stick an accented character into a command tag. (Compare, for example, that it used to be widely accepted that you shouldn't stick spaces and hyphens into file names, but now a fair number of programmers will do that without flinching.)

As for checking for unprintable characters, the case is weaker. I'm not too motivated to remove the check, though.

> For another, duplicating checks that were done earlier
> seems like a maintenance headache.

Hmmm. As long as gencommandtag.pl is the only user of DataFile.pm, I'm inclined to agree that we're double-checking some things. The code comments I wrote certainly say so. But if DataFile.pm gets wider adoption, it might start to accept more varied input, and then gencommandtag.pl will need to assert its own set of validation. There is also the distinction between checking that the input data file meets the syntax requirements of the *parser* vs. making certain that the vivified perl structures meet the semantic requirements of the *code generator*. You may at this point be able to assert that meeting the first guarantees meeting the second, but that can't be expected to hold indefinitely.

It would be easier to decide these matters if we knew whether commandtag logic will ever be removed and whether DataFile will ever gain wider adoption for code generation purposes....

> dataerror() is defined near the top, but other functions are defined
> at the bottom.

Moved.

> +# Generate all output internally before outputting anything, to avoid
> +# partially overwriting generated files under error conditions
>
> My personal preference is, having this as a design requirement
> sacrifices readability for unclear gain, especially since a "chunk"
> also includes things like header boilerplate. That said, the script is
> also short enough that it doesn't make a huge difference either way.

Catalog.pm writes a temporary file and then moves it to the final file name at the end. DataFile buffers the output and only writes it after all the code generation has succeeded. There is no principled basis for these two modules tackling the same problem in two different ways. Perhaps that's another argument for pulling this kind of functionality out of random places and consolidating it in one or more modules in src/tools.

> Speaking of boilerplate, it's better for readability to separate that
> from actual code such as:
>
> typedef enum CommandTag
> {
> #define FIRST_COMMANDTAG COMMANDTAG_$sorted[0]->{taglabel})

Good idea. While I was doing this, I also consolidated the duplicated boilerplate into just one function. I think this function, too, should go in just one perl module somewhere. See boilerplate_header() for details.

> --
> tcop/dest.c
>
> + * We no longer display LastOid, but to preserve the wire protocol,
> + * we write InvalidOid where the LastOid used to be written. For
> + * efficiency in the snprintf(), hard-code InvalidOid as zero.
>
> Hmm, is hard-coding zero going to make any difference here?

Part of the value of refactoring the commandtag logic is to make it easier to remove the whole ugly mess later. Having snprintf write the Oid into the string obfuscates the stupidity of what is really being done here. Putting the zero directly into the format string makes it clearer, to my eyes, that nothing clever is afoot.

I have removed the sentence about efficiency. Thanks for mentioning it.

Attachment Content-Type Size
v5-0001-Migrating-commandTag-from-string-to-enum.patch application/octet-stream 144.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2020-02-19 21:24:26 Re: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()
Previous Message Thomas Munro 2020-02-19 21:14:02 Re: Yet another fast GiST build