Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: John Naylor <jcnaylor(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)
Date: 2018-12-27 18:57:04
Message-ID: 8228.1545937024@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

John Naylor <jcnaylor(at)gmail(dot)com> writes:
> Barring additional bikeshedding on 0001, I'll plan on implementing
> offset-based lookup for the other keyword types and retire the old
> ScanKeyword. Once that's done, we can benchmark and compare with the
> optimizations in 0002.

Sounds like a plan.

Assorted minor bikeshedding on v4-0001 (just from eyeballing it, I didn't
test it):

+/* Like ScanKeywordLookup, but uses offsets into a keyword string. */
+int
+ScanKeywordLookupOffset(const char *string_to_lookup,
+ const char *kw_strings,

Not really "like" it, since the return value is totally different and
so is the representation of the keyword list. I realize that your
plan is probably to get rid of ScanKeywordLookup and then adapt the
latter's comment for this code, but don't forget that you need to
adjust said comment.

+/* Payload data for keywords */
+typedef struct ScanKeywordAux
+{
+ int16 value; /* grammar's token code */
+ char category; /* see codes above */
+} ScanKeywordAux;

There isn't really any point in changing category to "char", because
alignment considerations will mandate that sizeof(ScanKeywordAux) be
a multiple of 2 anyway. With some compilers we could get around that
with a pragma to force non-aligned storage, but doing so would be a
net loss on most non-Intel architectures.

If you really are hot about saving that other 440 bytes, the way to
do it would be to drop the struct entirely and use two parallel
arrays, an int16[] for value and a char[] (or better uint8[]) for
category. Those would be filled by reading kwlist.h twice with
different definitions for PG_KEYWORD. Not sure it's worth the
trouble though --- in particular, not clear that it's a win from
the standpoint of number of cache lines touched.

diff --git a/src/pl/plpgsql/src/.gitignore b/src/pl/plpgsql/src/.gitignore
@@ -1,3 +1,4 @@
+/*kwlist_d.h

Not a fan of using wildcards in .gitignore files, at least not when
there's just one or two files you intend to match.

# Force these dependencies to be known even without dependency info built:
-pl_gram.o pl_handler.o pl_comp.o pl_exec.o pl_funcs.o pl_scanner.o: plpgsql.h pl_gram.h plerrcodes.h
+pl_gram.o pl_handler.o pl_comp.o pl_exec.o pl_funcs.o pl_scanner.o: plpgsql.h pl_gram.h plerrcodes.h pl_unreserved_kwlist_d.h

Hm, do we really need any more than pl_scanner.o to depend on that header?

+/* FIXME: Have to redefine this symbol for the WIP. */
+#undef PG_KEYWORD
+#define PG_KEYWORD(kwname, value, category) {value, category},
+
+static const ScanKeywordAux unreserved_keywords[] = {
+#include "pl_unreserved_kwlist.h"
};

The category isn't useful for this keyword list, so couldn't you
just make this an array of uint16 values?

diff --git a/src/pl/plpgsql/src/pl_unreserved_kwlist.h b/src/pl/plpgsql/src/pl_unreserved_kwlist.h
+/* name, value, category */
+PG_KEYWORD("absolute", K_ABSOLUTE, UNRESERVED_KEYWORD)

Likewise, I'd just have these be two-argument macros. There's no reason
for the various kwlist.h headers to agree on the number of payload
arguments for PG_KEYWORD.

diff --git a/src/tools/gen_keywords.pl b/src/tools/gen_keywords.pl
+ elsif ($arg =~ /^-o/)
+ {
+ $output_path = length($arg) > 2 ? substr($arg, 2) : shift @ARGV;
+ }

My perl-fu is not great, but it looks like this will accept arguments
like "-ofilename", which is a style I don't like at all. I'd rather
either insist on the filename being separate or write the switch like
"-o=filename". Also, project style when taking both forms is usually
more like
-o filename
--offset=filename

+$kw_input_file =~ /((\w*)kwlist)\.h/;
+my $base_filename = $1;
+$prefix = $2 if !defined $prefix;

Hmm, what happens if the input filename does not end with "kwlist.h"?

+# Parse keyword header for names.
+my @keywords;
+while (<$kif>)
+{
+ if (/^PG_KEYWORD\("(\w+)",\s*\w+,\s*\w+\)/)

This is assuming more than it should about the number of arguments for
PG_KEYWORD, as well as what's in them. I think it'd be sufficient to
match like this:

if (/^PG_KEYWORD\("(\w+)",/)

+Options:
+ -o output path
+ -p optional prefix for generated data structures

This usage message is pretty vague about how you write the options
(cf gripe above).

I looked very briefly at v4-0002, and I'm not very convinced about
the "middle" aspect of that optimization. It seems unmaintainable,
plus you've not exhibited how the preferred keywords would get selected
in the first place (wiring them into the Perl script is surely not
acceptable). If you want to pursue that, please separate it into
an 0002 that just adds the letter-range aspect and then an 0003
that adds the "middle" business on top. Then we can do testing to
see whether either of those ideas are worthwhile.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2018-12-27 19:05:11 Re: row filtering for logical replication
Previous Message Andrew Dunstan 2018-12-27 18:54:05 Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)