Re: Ltree syntax improvement

From: Dmitry Belyavsky <beldmit(at)gmail(dot)com>
To: Nikolay Shaplov <dhyan(at)nataraj(dot)su>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Ltree syntax improvement
Date: 2019-02-24 11:31:55
Message-ID: CADqLbzK9ke9haVCBN8wAreMGpdvO3qN3jF1SJ7d=g0VC-TxrSw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dear Nikolay,

On Wed, Feb 20, 2019 at 12:28 PM Nikolay Shaplov <dhyan(at)nataraj(dot)su> wrote:

> В письме от вторник, 29 января 2019 г. 20:43:07 MSK пользователь Dmitry
> Belyavsky написал:
> > Dear all,
> >
> > Please find attached the patch extending the sets of symbols allowed in
> > ltree labels. The patch introduces 2 variants of escaping symbols, via
> > backslashing separate symbols and via quoting the labels in whole for
> > ltree, lquery and ltxtquery datatypes.
>
> Sorry I still did not do full revision, I've stumbled up on other things
> in
> the code, which, as I think should be fixed before final checking through.
> (I
> guess that code do what is expected, since it passes tests and so on, it
> needs
> recheck, but I'd like to do this full recheck only once)
>
> So my doubts about this code is that a lot parts of code is just a
> copy-paste
> of the same code that was written right above. And it can be rewritten in
> a
> way that the same lines (or parts of lines) is repeated only once...
> This should make code more readable, and make code more similar to the
> code of
> postgres core, I've seen.
>
> Here I will speak about lquery_in function from ltree_io.c.
>
> 1. May be it is better here to switch from else if (state ==
> LQPRS_[something]) to switch (stat) { case LQPRS_[something]: }
> because because here the main case is to determine what is the current
> state
> of your state machine, do something with it, and then go to the next step.
>
> Now after you've done what should be done in this step, you should make
> sure
> that no other code is executed writing more ifs... If you use switch/case
> you
> will be able to say break wherever you like, it will save you some ifs.
>

I thought about it writing the code. But it significantly increased the
size of the patch
so I decided to follow the original coding style here.

>
> Theoretical example
>
> (cl is for character length fc is for fist char)
>
> if (cl==1 && fc =='1') {do_case1();}
> else if (cl==1 && fc =='2') {do_case2();}
> else if (cl==1 && fc =='3') {do_case3();}
> else {do_otherwise();}
>
> If it is wrapped in switch/case it will be like
>
> if (cl==1)
> {
> if (fc =='1') {do_case1(); break;}
> if (fc =='2') {do_case2(); break;}
> if (fc =='3') {do_case3(); break;}
> }
> /*if nothing is found */
> do_otherwise();
>
> This for example will allow you to get rid of multiply charlen == 1 checks
> in
>
> else if ((lptr->flag & LVAR_QUOTEDPART) == 0)
>
>
> {
>
>
> if (charlen == 1 && t_iseq(ptr, '@'))
>
>
> {
>
>
> if (lptr->start == ptr)
>
>
> UNCHAR;
>
>
> lptr->flag |= LVAR_INCASE;
>
>
> curqlevel->flag |= LVAR_INCASE;
>
>
> }
>
>
> else if (charlen == 1 && t_iseq(ptr, '*'))
>
>
> {
>
>
> if (lptr->start == ptr)
>
>
> UNCHAR;
>
>
> lptr->flag |= LVAR_ANYEND;
>
>
> curqlevel->flag |= LVAR_ANYEND;
>
>
> }
>
>
> else if (charlen == 1 && t_iseq(ptr, '%'))
>
>
> {
>
>
> if (lptr->start == ptr)
>
>
> UNCHAR;
>
>
> lptr->flag |= LVAR_SUBLEXEME;
>
>
> curqlevel->flag |= LVAR_SUBLEXEME;
>
>
> }
>
>
> else if (charlen == 1 && t_iseq(ptr, '|'))
>
>
> {
>
>
> real_nodeitem_len(lptr, ptr, escaped_count,
> tail_space_bytes, tail_space_symbols);
>
>
> if (state == LQPRS_WAITDELIMSTRICT)
>
>
> adjust_quoted_nodeitem(lptr);
>
>
>
> check_level_length(lptr, pos);
>
>
> state = LQPRS_WAITVAR;
>
>
> }
>
>
> else if (charlen == 1 && t_iseq(ptr, '.'))
>
>
> {
>
>
> real_nodeitem_len(lptr, ptr, escaped_count,
> tail_space_bytes, tail_space_symbols);
>
>
> if (state == LQPRS_WAITDELIMSTRICT)
>
>
> adjust_quoted_nodeitem(lptr);
>
>
> check_level_length(lptr, pos);
>
>
>
> state = LQPRS_WAITLEVEL;
>
>
> curqlevel = NEXTLEV(curqlevel);
>
>
> }
>
>
> else if (charlen == 1 && t_iseq(ptr, '\\'))
>
>
> {
>
>
> if (state == LQPRS_WAITDELIMSTRICT)
>
>
> UNCHAR;
>
>
> state = LQPRS_WAITESCAPED;
>
>
> }
>
>
> else
>
>
> {
>
>
> if (charlen == 1 && strchr("!{}", *ptr))
>
>
> UNCHAR;
>
>
> if (state == LQPRS_WAITDELIMSTRICT)
>
>
> {
>
>
> if (t_isspace(ptr))
>
>
> {
>
>
> ptr += charlen;
>
>
> pos++;
>
>
> tail_space_bytes += charlen;
>
>
> tail_space_symbols = 1;
>
>
> continue;
>
>
> }
>
>
>
> UNCHAR;
>
>
> }
>
>
> if (lptr->flag & ~LVAR_QUOTEDPART)
>
>
> UNCHAR;
>
>
> }
>
>
> }
>

Yes, this piece of code could be converted to the form you suggest, but
only partially, because the final else includes a strchr() call that,
being rewritten to the 'case' syntax will look ugly enough.

>
> There are a lot of them, I would suggest to reduce there number to one
> check
> in the right place.
>
> Similar thing is about "GETVAR(curqlevel) = lptr = (nodeitem *)
> palloc0(sizeof(nodeitem) * numOR);" in this part of code.
>
> if (charlen == 1)
>
>
> {
>
>
> if (t_iseq(ptr, '!'))
>
>
> {
>
>
> GETVAR(curqlevel) = lptr = (nodeitem *)
> palloc0(sizeof(nodeitem) * numOR);
>
> lptr->start = ptr + 1;
>
>
> state = LQPRS_WAITDELIM;
>
>
> curqlevel->numvar = 1;
>
>
> curqlevel->flag |= LQL_NOT;
>
>
> hasnot = true;
>
>
> }
>
>
> else if (t_iseq(ptr, '*'))
>
>
> state = LQPRS_WAITOPEN;
>
>
> else if (t_iseq(ptr, '\\'))
>
>
> {
>
>
> GETVAR(curqlevel) = lptr = (nodeitem *)
> palloc0(sizeof(nodeitem) * numOR);
>
> lptr->start = ptr;
>
>
> curqlevel->numvar = 1;
>
>
> state = LQPRS_WAITESCAPED;
>
>
> }
>
>
> else if (strchr(".|@%{}", *ptr))
>
>
> {
>
>
> UNCHAR;
>
>
> }
>
>
> else
>
>
> {
>
>
> GETVAR(curqlevel) = lptr = (nodeitem *)
> palloc0(sizeof(nodeitem) * numOR);
>
> lptr->start = ptr;
>
>
> state = LQPRS_WAITDELIM;
>
>
> curqlevel->numvar = 1;
>
>
> if (t_iseq(ptr, '"'))
>
>
> {
>
>
> lptr->flag |= LVAR_QUOTEDPART;
>
>
> }
>
>
> }
>
>
> }
> else
>
>
> {
>
>
> GETVAR(curqlevel) = lptr = (nodeitem *)
> palloc0(sizeof(nodeitem) * numOR);
>
> lptr->start = ptr;
>
>
> state = LQPRS_WAITDELIM;
>
>
> curqlevel->numvar = 1;
>
>
> }
>
> It is repeated almost is every branch of this if-subtree... Can it be
> rewritten the way it is repeated only once?
>

Well, I see the only one so long sequence of 'if'.

>
> And so on.
>
> So my request is to reduce multiply repetition of the same code...
>
> PS. One other thing that I've been thinking over but did not came to final
> conclusion...
>
> It is often quite often case
>
> if (t_iseq(ptr, 'a'){same_code(); code_for_a();}
> else if (t_iseq(ptr, 'b'){same_code(); code_for_b();}
> else if (t_iseq(ptr, 'c'){same_code(); code_for_c);}
> else something_else();
>
> I've been thinking if it is possible to move this to
>
> const unsigned char c=TOUCHAR(ptr);
> switch(c)
> {
> case 'a':
> case 'b':
> case 'c':
> same_code();
> if (c=='a') {code_for_a();}
> if (c=='b') {code_for_b();}
> if (c=='c') {code_for_c();}
> break;
> default:
> something_else();
> }
>
> I did not still get, how valid is replacement of t_iseq() with TOUCHAR() +
> "==". It is quite equivalent now, but I do not know how the core code is
> going
> to evolve, so can't get final understanding. I even tried to discuss it
> with
> Teodor (you've seen it) but still did come to any conclusion.
>

I agree with Teodor here. We do not know about various charsets so
the current syntax seems to be more safe.

>
> So for now status of this idea is "my considerations about ways to
> deduplicate
> the code".
>
> So this is all essential things I can say about this patch as it is now.
> May
> be somebody can add something more...
>
>

--
SY, Dmitry Belyavsky

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2019-02-24 11:44:46 Re: jsonpath
Previous Message David Rowley 2019-02-24 09:21:10 Re: pg_dump multi VALUES INSERT