Re: Specifying attribute slot for storing/reading statistics

From: Esteban Zimanyi <ezimanyi(at)ulb(dot)ac(dot)be>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org, Mahmoud Sakr <m_attia_sakr(at)yahoo(dot)com>, mohamed sayed <mohamed_bakli(at)aun(dot)edu(dot)eg>
Subject: Re: Specifying attribute slot for storing/reading statistics
Date: 2019-09-06 10:50:33
Message-ID: CAPqRbE5OnjZdOUKMdbqv3x3W7nrmjKRTKQO9oKbbNE8U_pigZA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dear Tom

Many thanks for your quick reply. Indeed both solutions you proposed can be
combined together in order to solve all the problems. However changes in
the code are needed. Let me now elaborate on the solution concerning the
combination of stakind/staop first and I will elaborate on adding a new
kind identifier after.

In order to understand the setting, let me explain a little more about the
different kinds of temporal types. As explained in my previous email these
are types whose values are composed of elements v(at)t where v is a
PostgreSQL/PostGIS type (float or geometry) and t is a TimestampTz. There
are four kinds of temporal types, depending on the their duration
* Instant: Values of the form v(at)t(dot) These are used for example to represent
car accidents as in Point(0 0)@2000-01-01 08:30
* InstantSet: A set of values {v1(at)t1, ...., vn(at)tn} where the values between
the points are unknown. These are used for example to represent checkins in
FourSquare or RFID readings
* Sequence: A sequence of values [v1(at)t1, ...., vn(at)tn] where the values
between two successive instants vi(at)ti vj(at)tj are (linearly) interpolated.
These are used to represent for example GPS tracks.
* SequenceSet: A set of sequences {s1, ... , sn} where there is a temporal
gap between them. These are used to represent for example GPS tracks where
the signal was lost during a time period.

To compute the selectivity of temporal types we assume that time and space
dimensions are independent and thus we can reuse all existing analyze and
selectivity infrastructure in PostgreSQL/PostGIS. For the various durations
this amounts to
* Instant: Use the functions in analyze.c and selfuncs.c independently for
the value and time dimensions
* InstantSet: Use the functions in array_typanalyze.c, array_selfuncs.c
independently for the value and time dimensions
* Sequence and SequenceSet: To simplify, we do not take into account the
gaps, and thus use the functions in rangetypes_typanalyze.c,
rangetypes_selfuncs.c independently for the value and time dimensions

However, this requires that the analyze and selectivity functions in all
the above files satisfy the following
* Set the staop when computing statistics. For example in
rangetypes_typanalyze.c the staop is set for
STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM but not for
STATISTIC_KIND_BOUNDS_HISTOGRAM
* Always call get_attstatsslot with the operator Oid not with InvalidOid.
For example, from the 17 times this function is called in selfuncs.c only
two are passed with an operator. This also requires to pass the operator as
an additional parameter to several functions. For example, the operator
should be passed to the function ineq_histogram_selectivity in selfuncs.c
* Export several top-level functions which are currently static. For
example, var_eq_const, ineq_histogram_selectivity, eqjoinsel_inner and
several others in the file selfuncs.c should be exported.

That would solve all the problems excepted for
STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM, since in this case the staop will
always be Float8LessOperator, independently of whether we are computing
lengths of value ranges or of tstzranges. This could be solved by using a
different stakind for the value and time dimensions.

If you want I can prepare a PR in order to understand the implications of
these changes. Please let me know.

Esteban

On Thu, Sep 5, 2019 at 5:11 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Esteban Zimanyi <ezimanyi(at)ulb(dot)ac(dot)be> writes:
> > We are developing the analyze/selectivity functions for those types. Our
> > approach is to use the standard PostgreSQL/PostGIS functions for the
> value
> > and the time dimensions where the slots starting from 0 will be used for
> > the value dimension, and the slots starting from 2 will be used for the
> > time dimension. For example, for tfloat we use range_typanalyze and
> related
> > functions for
> > * collecting in slots 0 and 1, STATISTIC_KIND_BOUNDS_HISTOGRAM
> > and STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM for the float ranges of the
> value
> > dimension
> > * collecting in slots 2 and 3, STATISTIC_KIND_BOUNDS_HISTOGRAM
> > and STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM for the periods (similar to
> > tstzranges) of the time dimension
>
> IMO this is fundamentally wrong, or at least contrary to the design
> of pg_statistic. It is not supposed to matter which "slot" a given
> statistic type is actually stored in; rather, readers are supposed to
> search for the desired statistic type using the stakindN, staopN and
> (if relevant) stacollN fields.
>
> In this case it seems like it'd be reasonable to rely on the staop
> fields to distinguish between the value and time dimensions, since
> (IIUC) they're of different types.
>
> Another idea is to invent your own slot kind identifiers instead of
> using built-in ones. I'm not sure that there's any point in using
> the built-in kind values, since (a) none of the core selectivity code
> is likely to get called on your data and (b) even if it were, it'd
> likely do the wrong thing. See the comments in pg_statistic.h,
> starting about line 150, about assignment of non-built-in slot kinds.
>
> > Is there any chance that the API for accessing the typanalyze and
> > selectivity functions will be enhanced in a future release ?
>
> Well, maybe you could convince us that the stakind/staop scheme for
> identifying statistics is inadequate so we need another identification
> field (corresponding to a component of the column being described,
> perhaps). I'd be strongly against assigning any semantic meaning
> to the slot numbers, though. That's likely to break code that's
> written according to existing conventions.
>
> regards, tom lane
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Surafel Temesgen 2019-09-06 11:16:22 Re: FETCH FIRST clause PERCENT option
Previous Message Tsunakawa, Takayuki 2019-09-06 08:16:49 RE: [bug fix] Produce a crash dump before main() on Windows