Re: proposal: window function - change_number

From: Mart Kelder <mart(at)kelder31(dot)nl>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: proposal: window function - change_number
Date: 2014-09-21 15:00:27
Message-ID: 42794981.tvsIpqtiso@mobimaker
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Pavel (and others),

Op zondag 21 september 2014 15:35:46 schreef u:
> 2014-09-21 14:30 GMT+02:00 Mart Kelder <mart(at)kelder31(dot)nl>:
> > Hi Pavel (and others),
> >
> > Pavel Stehule wrote:
> > > Hi
> > > I tried to solve following task:
> > >
> > > I have a table
> > >
> > > start, reason, km
> > > =============
> > >
> > > 2014-01-01 08:00:00, private, 10
> > > 2014-01-01 09:00:00, commerc, 20
> > > 2014-01-01 10:00:00, commerc, 20
> > > 2014-01-01 11:00:00, private, 8
> > >
> > > and I would reduce these rows to
> > >
> > > 2014-01-01 08:00:00, private, 10
> > > 2014-01-01 09:00:00, commerc, 20 + 20 = 40
> > > 2014-01-01 11:00:00, private, 8
> > >
> > > It is relative hard to it now with SQL only. But we can simplify this
> > > task
> > >
> > > with window function that returns number of change in some column. Then
> > > this task can be solved by
> > >
> > > select min(start), min(reason), sum(km)
> > > from (select start, reason, km, change_number(reason) over (order by
> > > start))
> > >
> > > group by change_number;
> >
> > What about
> >
> > select srk.reason, min(srk.start), sum(srk.km)
> > from start_reason_km srk
> > group by srk.reason, (select max(start) from start_reason_km other WHERE
> > other.start < srk.start and other.reason != srk.reason);
>
> This query is Cartesian product, so for some large data it is significantly
> slower then window function (required only sorts without joins)

I think we have the same queryplan in mind (with only one scan). As far as I
know, SQL is a language where you define the result you want to get, and let
the server find a way how to find the data. I think windowed function also say
how the server needs to get the information.

The real challenge is of course of finding heuristics to remove the additional
join. In this particular case, I can tell how to remove the inner join from
the subquery:
* the where-clause of the self-join contains other.start < srk.start. From
that we can conclude that if the table is (or can be) sorted on "start", we
have seen the data before the subquery is executed
* because we only need an aggregate, we need to store the intermediate "max"
for each "reason". And then add the result to the stream.

Recently, I had a simular problem with a table containing a timestamp, a state
and a object where the state belongs to. A object remains in a state until
there is a more recent tuple in the table. I needed basically to query all the
previous state for each tuple, but preverably without the additional join.

> My motivation was a) to implement described task without Cartesian product.

Good reason (if you consider the queryplan and not the query).

> b) introduce some tool for this kind of problems. I seen more times a
> request .. reduce a time series, and a window function "change_number" (or
> maybe "consistent_series_number") can be good candidate.

I also need to note that there is a lot of difference in complexity between the
possible solutions of this problem. Where a new window function can probably
be very easy implemented, the optimizer changes descripted above are complex
and not easy to implement.

I also want to note that I am not really against new window functions, I only
want to point out that a more generic solution also might be possible.

Regards,

Mart

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2014-09-21 15:20:28 Re: proposal: window function - change_number
Previous Message Stephen Frost 2014-09-21 14:06:48 Re: pgsql: Row-Level Security Policies (RLS)