Minor regexp hacking: code coverage, moveins() and friends

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Minor regexp hacking: code coverage, moveins() and friends
Date: 2021-08-15 21:47:43
Message-ID: 810272.1629064063@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

While trying to improve the code-coverage report for backend/regex/,
I found that there are portions of copyins() and copyouts() that are
just plain unreachable, because those functions are over-engineered.
In point of fact, the only uses of copyins() and copyouts() are in
places where the target state is brand new and cannot have any
pre-existing in-arcs (resp. out-arcs). This means that all the
trouble we're going to to de-duplicate the copied arcs is entirely
wasted; we could just copy the source arcs without extra checking.

A fairly significant fraction, though by no means all, of the calls
of moveins() and moveouts() are likewise working with new target
states, and so don't really need to do any de-duplication.

Hence I propose 0001 attached, which creates simplified functions
copyinstonew() and so on, for use when the target state is known not
to have any existing arcs. I'd thought that this might show a useful
improvement in regexp compilation speed, but it's pretty hard to
measure any noticeable change on typical regexps such as Jacobson's
web corpus. (I do see maybe a 1% improvement on that, but that's
below the noise threshold so I don't take it too seriously.) It is
possible to demonstrate noticeable improvement on handpicked regexes,
for example on HEAD:

regression=# SELECT regexp_matches('foo', 'abcdefghijklmnopq((\y|.?)+)+','');
regexp_matches
----------------
(0 rows)

Time: 6.297 ms

versus with patch:

regression=# SELECT regexp_matches('foo', 'abcdefghijklmnopq((\y|.?)+)+','');
regexp_matches
----------------
(0 rows)

Time: 5.506 ms

So this isn't entirely a waste of time, but it is marginal. Improving
the code-coverage numbers is probably a better argument. (0001 also
adds some test cases that exercise nearly everything that's reachable
without OOM conditions or cancels in regc_nfa.c.)

0002 below is some additional test cases to improve code coverage in
regc_locale.c and regc_pg_locale.c. (regc_pg_locale.c is still not
great, but that's mostly because much of the code is only reachable
for particular choices of database encoding, so any one coverage
run hits just some of it.)

Barring objections, I plan to push this in a day or two; I don't
think it needs much review.

regards, tom lane

Attachment Content-Type Size
0001-cleanup-moveins-and-friends.patch text/x-diff 12.1 KB
0002-extra-test-coverage.patch text/x-diff 8.8 KB

Browse pgsql-hackers by date

  From Date Subject
Next Message David Fetter 2021-08-16 00:51:56 Re: Default to TIMESTAMP WITH TIME ZONE?
Previous Message Masahiko Sawada 2021-08-15 21:04:39 Re: Added schema level support for publication.