Re: patch: function xmltable

From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: patch: function xmltable
Date: 2016-09-07 05:49:35
Message-ID: CAFj8pRB1DEu1d59eyGsc-ajCLLF+SOoUbpVRvCF=OrokEperRw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

2016-09-07 5:03 GMT+02:00 Craig Ringer <craig(at)2ndquadrant(dot)com>:

> On 7 September 2016 at 04:13, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
> wrote:
>
> >> Overall, I think this needs to be revised with appropriate comments.
> >> Whitespace/formatting needs fixing since it's all over the place.
> >> Documentation is insufficient (per notes below).
> >
> >
> > I am not able to write documentation in English language :( - This
> function
> > is pretty complex - so I hope so anybody with better language skills can
> > help with this. It respects standard and it respects little bit different
> > Oracle's behave too (different order of DEFAULT and PATH parts).
>
> OK, no problem. It can't be committed without more comprehensive docs
> though, especially for new and nontrivial functionality.
>
> Is there some reference material you can point to so someone else can
> help with docs? And can you describe what differences there are
> between your implementation and the reference?
>
> Alternately, if you document it in Czech, do you know of anyone who
> could assist in translating to English for the main documentation?
>
> >> Re identifier naming, some of this code uses XmlTable naming patterns,
> >> some uses TableExpr prefixes. Is that intended to indicate a bounary
> >> between things re-usable for other structured data ingesting
> >> functions? Do you expect a "JSONEXPR" or similar in future? That's
> >> alluded to by
> >
> >
> > This structure should be reused by JSON_TABLE function. Now, it is little
> > bit strange, because there is only XMLTABLE implementation - and I have
> to
> > choose between a) using two different names now, b) renaming some part in
> > future.
>
> OK. Are you planning on writing this JSON_TABLE or are you leaving
> room for future growth? Either way is fine, just curious.
>
> > And although XMLTABLE and JSON_TABLE functions are pretty similar - share
> > 90% of data (input value, path, columns definitions), these functions has
> > different syntax - so only middle level code should be shared.
>
> That makes sense.
>
> I think it would be best if you separated out the TableExpr
> infrastructure from the XMLTABLE implementation though, so we can
> review the first level infrastrcture separately and make this a
> 2-patch series. Most importantly, doing it that way will help you find
> places where TableExpr code calls directly into XMLTABLE code. If
> TableExpr is supposed to be reusable for json etc, it probably
> shouldn't be calling XmlTable stuff directly.
>
> That also means somewhat smaller simpler patches, which probably isn't bad.
>
> I don't necessarily think this needs to be fully pluggable with
> callbacks etc. It doesn't sound like you expect this to be used by
> extensions or to have a lot of users, right? So it probably just needs
> clearer separation of the infrastructure layer from the xmltable
> layer. I think splitting the patch will make that easier to see and
> make it easier to find problems.
>
> My biggest complaint at the moment is that execEvalTableExpr calls
> initXmlTableContext(...) directly, is aware of XML namespaces
> directly, calls XmlTableSetRowPath() directly, calls
> XmlTableFetchRow() directly, etc. It is in no way generic/reusable for
> some later JSONTABLE feature. That needs to be fixed by:
>
> * Renaming it so it's clearly only for XMLTABLE; or
> * Abstracting the init context, set row path, fetch row etc operations
> so json ones can be plugged in later
>
>
>
> > Currently the common part is not too big - just the Node related part -
> I am
> > not sure about necessity of two patches.
>
> The problem is that the common part is all mixed in with the
> XMLTABLE-specific part, so it's not at all clear it can be common with
> something else.
>
> > I am agree, there is missing some
> > TableExpBuilder, where can be better isolated the XML part.
>
> Yeah, that's sort of what I'm getting at.
>
>
> >> execEvalTableExpr seems to be defined twice, with a difference in
> >> case. This is probably not going to fly:
> >>
> >>
> >> +static Datum
> >> +execEvalTableExpr(TableExprState *tstate,
> >> + ExprContext *econtext,
> >> + bool *isNull, ExprDoneCond *isDone)
> >> +{
> >>
> >> +static Datum
> >> +ExecEvalTableExpr(TableExprState *tstate,
> >> + ExprContext *econtext,
> >> + bool *isNull, ExprDoneCond *isDone)
> >> +{
> >>
> >>
> >> It looks like you've split the function into a "guts" and "wrapper"
> >> part, with the error handling PG_TRY / PG_CATCH block in the wrapper.
> >> That seems reasonable for readability, but the naming isn't.
> >
> >
> > I invite any idea how these functions should be named.
>
> Definitely not how they are ;) . They really can't differ in a single
> character's case.
>
> I'm not sure if PostgreSQL has any formal convention for this. Some
> places use _impl e.g. pg_read_barrier_impl() but that's in the
> context of an interface-vs-implementation separation, which isn't the
> case here.
>
> Some places use _internal, like AlterObjectRename_internal(...), but
> that's where there's an associated public/external part, which isn't
> the case here.
>
> Some places use _guts e.g. pg_logical_slot_get_changes_guts(...),
> largely where there's common use by several callers.
>
> This is a fairly arbitrary function split for readability/length. Is
> it actually useful to split this function up at all?
>
> Anyone else have an opinion?
>
> >> A comment is needed to explain what ExecEvalTableExpr is / does. If
> >> it's XMLTABLE specific (which it looks like based on the code), its
> >> name should reflect that. This pattern is repeated elsewhere; e.g.
> >> TableExprState is really the state for an XMLTABLE expression. But
> >> PostgreSQL actually has TABLE statements, and in future we might want
> >> to support table-expressions, so I don't think this naming is
> >> appropriate. This is made worse by the lack of comments on things like
> >> the definition of TableExprState. Please use something that makes it
> >> clear it's for XMLTABLE and add appropriate comments.
> >
> >
> > I understand, so using TableExpr can be strange (for XMLTABLE function).
> But
> > when we will have JSON_TABLE function, then it will have a sense.
>
> It's pretty hard to review that as shared infrastructure when it's
> still tangled up in xmltable specifics, though.
>
> > "TableExprState" is consistent with "TableExpr".
> >
> > Any idea how it should be changed?
>
> I think if you want it to be shareable infrasructure, you need to
> write it so it can be used as shared infrastructure. Not just name it
> that way but then make it XMLTABLE specific in actual functionality.
>
>
> >> /* when typmod is not valid, refresh it */
> >> if (te->typmod == -1)
> >>
> >>
> >> Is this a cache? How is it valid or not valid and when? The comment
> >> (thanks!) on TableExprGetTupleDesc says:
> >>
> >> /*
> >> * When we skip transform stage (in view), then TableExpr's
> >> * TupleDesc should not be valid. Refresh is necessary.
> >> */
> >>
> >> but I'm not really grasping what you're trying to explain here. What
> >> transform stage? What view? This could well be my ignorance of this
> >> part of the code; if it should be understandable by a reader who is
> >> appropriately familiar with the executor that's fine, but if it's
> >> specific to how XMLTABLE works some more explanation would be good.
> >
> >
> > This is most difficult part of this patch, and I am not sure it it is
> fully
> > correctly implemented. I use TupleDesc cache. The TupleDesc is created in
> > parser/transform stage. When the XMLTABLE is used in some view, then the
> > transformed parser tree is materialized - and when the view is used in
> > query, then this tree is loaded and the parser/transform stage is
> "skipped".
> > I'll check this code against implementation of ROW constructor and I'll
> try
> > to do more comments there.
>
> Thanks. It would be good to highlight when this does and does not
> happen and why. Why is it necessary at all?
>
> What happens if XMLTABLE is used in a query directly, not part of a
> view? if XMLTABLE is used in a view? If XMLTABLE is used in a prepared
> statement / plpgsql statement / etc? What about a CTE term?
>
> Not necessarily list all these cases one by one, just explain what
> happens, when and why. Especially if it's complex, so other readers
> can understand it and don't have to study it in detail to understand
> what is going on. It does not need to be good public-facing
> documentation, and details of wording, grammar etc can be fixed up
> later, it's the ideas that matter.
>
> Is this similar to other logic elsewhere? If so, reference that other
> logic so readers know where to look. That way if they're
> changing/bugfixing/etc one place they know there's another place that
> might need changing.
>
> I don't know this area of the code well enough to give a solid review
> of the actual functionality, and I don't yet understand what it's
> trying to do so it's hard to review it by studying what it actually
> does vs what it claims to do. Maybe Peter E can help, he said he was
> thinking of looking at this patch too. But more information on what
> it's trying to do would be a big help.
>
> >> PostgreSQL already has XPATH support in the form of xmlexists(...)
> >> etc. Why is getXPathToken() etc needed? What re-use is possible here?
> >> There's no explanation in the patch header or comments. Should the new
> >> xpath parser be re-used by the existing xpath stuff? Why can't we use
> >> libxml's facilities? etc. This at least needs explaining in the
> >> submission, and some kind of hint as to why we have two different ways
> >> to do it is needed in the code. If we do need a new XML parser, should
> >> it be bundled in adt/xml.c along with a lot of user-facing
> >> functionality, or a separate file?
> >
> > libxml2 and our XPATH function doesn't support default namespace (
> > http://plasmasturm.org/log/259/ ). This is pretty useful feature - so I
> > implemented.
>
> OK, that makes sense.
>
> For the purpose of getting this patch in, is it a _necessary_ feature?
> Can XMLTABLE be usefully implemented without it, and if so, can it be
> added in a subsequent patch? It would be nice to simplify this by
> using existing libxml2 functionality in the first version rather than
> adding a whole new xpath as well!
>

This is not a xpath implementation - it is preprocessing of xpath
expression. without it, a users have to set explicitly PATH clause with
explicit prefix "./" and explicit suffix "/text()". The usability will be
significantly lower, and what is worst - the examples from internet should
not work. Although is is lot of lines, this code is necessary.

>
> > This is the mayor issue of libxml2 library. Another difference
> > between XPATH function and XMLTABLE function is using two phase searching
> > and implicit prefix "./" and suffix ("/text()") in XMLTABLE. XMLTABLE
> using
> > two XPATH expressions - for row data cutting and next for column data
> > cutting (from row data). The our XPATH functions is pretty simple mapped
> to
> > libxml2 XPATH API. But it is not possible with XMLTABLE function - due
> > design of this function in standard (it is more user friendly and doesn't
> > require exactly correct xpath expressions).
>
> So you can't use existing libxml2 xpath support to implement XMLTABLE,
> even without default namespaces?
>
> > I didn't find any API in libxml2 for a work with parsed xpath
> expressions -
> > I need some info about the first and last token of xpath expression - it
> is
> > base for decision about using prefix or suffix.
> >
> > This functionality (xpath expression parser) cannot be used for our XPATH
> > function now - maybe default namespace in future.
>
> So we'll have two different XPATH implementations for different
> places, with different limitations, different possible bugs, etc?
>

It is just preprocessing. The evaluation of xpath expression is part of
libxml2 and it is shared.

Our XPATH function is not short, but the reason is reading namespaces data
from 2D array. The evaluation of xpath expression is on few lines.

>
> What would be needed to make the new XPATH work for our built-in xpath
> functions too?
>

> >> How does XmlTableGetValue(...) and XmlTableGetRowValue(...) relate to
> >> this? It doesn't look like they're intended to be called directly by
> >> the user, and they're not documented (or commented).
> >
> >
> > Probably I used wrong names. XMLTABLE function is running in two
> different
> > modes - with explicitly defined columns (XmlTableGetValue is used), and
> > without explicitly defined columns - so result is one XML column and only
> > one one step searching is used (there are not column related xpath
> > expressions) ( XmlTableGetRowValue is used). The function
> XmlTableGetValue
> > is used for getting one column value, the function XmlTableGetRowValue is
> > used for getting one value too, but in special case, when there are not
> any
> > other value.
>
> So both are internal implementation of the parser-level XMLTABLE(...)
> construct and are not intended to be called directly by users - right?
>

No - it is called from executor - and it should not be called differently.
I have to do better separation from executor, and these functions will be
private.

>
> Comments please! A short comment on the function saying this would be
> a big help.
>
> Regarding naming, do we already have a convention for functions that
> are internal implementation of something the user "spells"
> differently? Where it's transformed by the parser? I couldn't find
> one. I don't much care about the names so long as there are comments
> explaining what calls the functions and what the user-facing interface
> that matches the function is.
>
> Is it safe for users to call these directly? What happens if they do
> so incorrectly?
>
> Why are they not in pg_proc.h? Do they need to be?
>
> >> +/*
> >> + * There are different requests from XMLTABLE, JSON_TABLE functions
> >> + * on passed data than has CREATE TABLE command. It is reason for
> >> + * introduction special structure instead using ColumnDef.
> >> + */
> >> +typedef struct TableExprRawCol
> >> +{
> >> + NodeTag type;
> >> + char *colname;
> >> + TypeName *typeName;
> >> + bool for_ordinality;
> >> + bool is_not_null;
> >> + Node *path_expr;
> >> + Node *default_expr;
> >> + int location;
> >> +} TableExprRawCol;
> >
> > I am sorry. It is my fault. Now we have very similar node ColumnDef. This
> > node is designed for usage in utility commands - and it is not designed
> for
> > usage inside a query.
>
> Makes sense.
>
> > I had to decide between enhancing ColumnDef node or
> > introduction new special node. Because there are more special attributes
> and
> > it is hard to serialize current ColumnDef, I decided to use new node.
>
> Seems reasonable. The summary is "this is the parse node for a column
> of an XMLTABLE expression".
>
> Suggested comment:
>
> /*
> * This is the parsenode for a column definition in a table-expression
> like XMLTABLE.
> *
> * We can't re-use ColumnDef here; the utility command column
> definition has all the
> * wrong attributes for use in table-expressions and just doesn't make
> sense here.
> */
> typedef struct TableExprColumn
> {
> ...
> };
>
> ?
>
> Why "RawCol" ? What does it become when it's not "raw" anymore? Is
> that a reference to ColumnDef's raw_default and cooked_default for
> untransformed vs transformed parse-trees?
>

TableExprColumn is better

Regards

Pavel

>
>
>
>
> --
> Craig Ringer http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2016-09-07 05:51:05 Re: Parallel tuplesort (for parallel B-Tree index creation)
Previous Message Peter Geoghegan 2016-09-07 05:40:20 Re: Parallel tuplesort (for parallel B-Tree index creation)