Re: Google Summer of Code: question about GiST API advancement project

From: GUO Rui <ruig2(at)uci(dot)edu>
To: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc: pgsql-hackers(at)postgresql(dot)org, a(dot)korotkov(at)postgrespro(dot)ru
Subject: Re: Google Summer of Code: question about GiST API advancement project
Date: 2019-04-04 05:55:03
Message-ID: CAEuz5PRU7APLk0SixBLU-3SacS9GSqr3WgNmcjVs0Mv1ATNuSg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dear Andrey Borodin,

I discussed the above topic with the professors at my school, and got the
following points:

1. The case that the volume of an MBB is 0 should be very rare, and if the
data is skewed (e.g. only a few nodes have non-NULL value on a dimension)
then the data can be pre-proceeded and normalized before it goes to the
database, thus the storage and query can be much faster;
2. The performance of the R-tree family may depend on the specific data set
and even the order of the data insertions, so one algorithm may be better
on one dataset and slower on another, thus the benchmark should include
different datasets;

I totally agree that by adopting the RR*-tree algorithm we can improve the
performance of PostgreSQL. For my proposal, I'll:
1. Document the benchmarks I found available online (e.g.
https://github.com/ambling/rtree-benchmark), and then state how we'd like
to generate data ourselves (e.g. data with a Gaussian distribution, or the
same dataset but different insertion order...) to test with for a wilder
coverage;
2. Create tools to generate a report on current PostgreSQL performance with
the benchmark;
3. Plan to improve the R-tree and GiST part of PostgreSQL. For the
discussion in the email thread
https://www.postgresql.org/message-id/flat/CAJEAwVFMo-FXaJ6Lkj8Wtb1br0MtBY48EGMVEJBOodROEGykKg%40mail(dot)gmail(dot)com#CAJEAwVFMo-FXaJ6Lkj8Wtb1br0MtBY48EGMVEJBOodROEGykKg(at)mail(dot)gmail(dot)com
<https://www.postgresql.org/message-id/flat/CAJEAwVFMo-FXaJ6Lkj8Wtb1br0MtBY48EGMVEJBOodROEGykKg%40mail(dot)gmail(dot)com#CAJEAwVFMo-FXaJ6Lkj8Wtb1br0MtBY48EGMVEJBOodROEGykKg(at)mail(dot)gmail(dot)com>
, I prefer to do a *scale-based* trick rather than using bits in a float or
creating a new struct;
4. Generate a performance report on PostgreSQL with the above R-tree patch;
The following would be marked as *optional*:
5. Optimize GiST with New APIs (e.g. non-penalty-based choose subtree
function, also discussed in the above email thread);
6. For skewed data, try to warn the user, and then suggest methods to cook
the data (e.g. the normalization algorithms in ML); pre-proceeding the data
should not be the duty of the database;
7. Other advanced features of RR*-tree and GiST bulk loading;

Any comments or feedback on the above ideas? I'll work on a draft proposal
ASAP.

Many thanks,
Rui Guo

On Sun, Mar 31, 2019 at 10:53 AM Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
wrote:

> Hi!
>
> > 31 марта 2019 г., в 14:58, GUO Rui <ruig2(at)uci(dot)edu> написал(а):
> >
> > I'm Rui Guo, a PhD student focusing on database at the University of
> California, Irvine. I'm interested in the "GiST API advancement" project
> for the Google Summer of Code 2019 which is listed at
> https://wiki.postgresql.org/wiki/GSoC_2019#GiST_API_advancement_.282019.29
> .
> >
> > I'm still reading about RR*-tree, GiST and the PostgreSQL source code to
> have a better idea on my proposal. Meanwhile, I have a very basic and
> simple question:
> >
> > Since the chooseSubtree() algorithm in both R*-tree and RR*-tree are
> heuristic and somehow greedy (e.g. pick the MBB that needs to enlarge the
> least), is it possible to apply machine learning algorithm to improve it?
> The only related reference I got is to use deep learning in database join
> operation (https://arxiv.org/abs/1808.03196). Is it not suitable to use
> machine learning here or someone already did?
>
> If you are interested in ML and DBs you should definitely look into [0].
> You do not have to base your proposal on mentor ideas, you can use your
> own. Implementing learned indexes - seems reasonable.
>
> RR*-tree algorithms are heuristic in some specific parts, but in general
> they are designed to optimize very clear metrics. Generally, ML algorithms
> tend to compose much bigger pile of heuristics and solve less
> mathematically clear tasks than splitting subtrees or choosing subtree for
> insertion.
> R*-tree algorithms are heuristic only to be faster.
>
> Best regards, Andrey Borodin.
>
> [0] https://arxiv.org/pdf/1712.01208.pdf

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2019-04-04 05:56:58 Re: Extending USE_MODULE_DB to more test suite types
Previous Message Justin Pryzby 2019-04-04 05:51:38 Re: Pluggable Storage - Andres's take