Re: Google Summer of Code: question about GiST API advancement project

From: GUO Rui <ruig2(at)uci(dot)edu>
To: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc: pgsql-hackers(at)postgresql(dot)org, a(dot)korotkov(at)postgrespro(dot)ru
Subject: Re: Google Summer of Code: question about GiST API advancement project
Date: 2019-04-05 09:07:10
Message-ID: CAEuz5PRfkd9S+MNsfzkbLtH-rtN6O4PseJcm7Zit-U4fVMZR7g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I drafted my proposal about the above topic at
https://docs.google.com/document/d/1X7Lw-c0rLYuSjwLNfw6qXpN5Cf1_0u2gXtgEgLkNezA/edit?usp=sharing
. Looking forward to your feedback.

On Wed, Apr 3, 2019 at 10:55 PM GUO Rui <ruig2(at)uci(dot)edu> wrote:

> Dear Andrey Borodin,
>
> I discussed the above topic with the professors at my school, and got the
> following points:
>
> 1. The case that the volume of an MBB is 0 should be very rare, and if the
> data is skewed (e.g. only a few nodes have non-NULL value on a dimension)
> then the data can be pre-proceeded and normalized before it goes to the
> database, thus the storage and query can be much faster;
> 2. The performance of the R-tree family may depend on the specific data
> set and even the order of the data insertions, so one algorithm may be
> better on one dataset and slower on another, thus the benchmark should
> include different datasets;
>
> I totally agree that by adopting the RR*-tree algorithm we can improve the
> performance of PostgreSQL. For my proposal, I'll:
> 1. Document the benchmarks I found available online (e.g.
> https://github.com/ambling/rtree-benchmark), and then state how we'd like
> to generate data ourselves (e.g. data with a Gaussian distribution, or the
> same dataset but different insertion order...) to test with for a wilder
> coverage;
> 2. Create tools to generate a report on current PostgreSQL performance
> with the benchmark;
> 3. Plan to improve the R-tree and GiST part of PostgreSQL. For the
> discussion in the email thread
> https://www.postgresql.org/message-id/flat/CAJEAwVFMo-FXaJ6Lkj8Wtb1br0MtBY48EGMVEJBOodROEGykKg%40mail(dot)gmail(dot)com#CAJEAwVFMo-FXaJ6Lkj8Wtb1br0MtBY48EGMVEJBOodROEGykKg(at)mail(dot)gmail(dot)com
> <https://www.postgresql.org/message-id/flat/CAJEAwVFMo-FXaJ6Lkj8Wtb1br0MtBY48EGMVEJBOodROEGykKg%40mail(dot)gmail(dot)com#CAJEAwVFMo-FXaJ6Lkj8Wtb1br0MtBY48EGMVEJBOodROEGykKg(at)mail(dot)gmail(dot)com>
> , I prefer to do a *scale-based* trick rather than using bits in a float
> or creating a new struct;
> 4. Generate a performance report on PostgreSQL with the above R-tree patch;
> The following would be marked as *optional*:
> 5. Optimize GiST with New APIs (e.g. non-penalty-based choose subtree
> function, also discussed in the above email thread);
> 6. For skewed data, try to warn the user, and then suggest methods to cook
> the data (e.g. the normalization algorithms in ML); pre-proceeding the data
> should not be the duty of the database;
> 7. Other advanced features of RR*-tree and GiST bulk loading;
>
> Any comments or feedback on the above ideas? I'll work on a draft proposal
> ASAP.
>
> Many thanks,
> Rui Guo
>
>
>
>
> On Sun, Mar 31, 2019 at 10:53 AM Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
> wrote:
>
>> Hi!
>>
>> > 31 марта 2019 г., в 14:58, GUO Rui <ruig2(at)uci(dot)edu> написал(а):
>> >
>> > I'm Rui Guo, a PhD student focusing on database at the University of
>> California, Irvine. I'm interested in the "GiST API advancement" project
>> for the Google Summer of Code 2019 which is listed at
>> https://wiki.postgresql.org/wiki/GSoC_2019#GiST_API_advancement_.282019.29
>> .
>> >
>> > I'm still reading about RR*-tree, GiST and the PostgreSQL source code
>> to have a better idea on my proposal. Meanwhile, I have a very basic and
>> simple question:
>> >
>> > Since the chooseSubtree() algorithm in both R*-tree and RR*-tree are
>> heuristic and somehow greedy (e.g. pick the MBB that needs to enlarge the
>> least), is it possible to apply machine learning algorithm to improve it?
>> The only related reference I got is to use deep learning in database join
>> operation (https://arxiv.org/abs/1808.03196). Is it not suitable to use
>> machine learning here or someone already did?
>>
>> If you are interested in ML and DBs you should definitely look into [0].
>> You do not have to base your proposal on mentor ideas, you can use your
>> own. Implementing learned indexes - seems reasonable.
>>
>> RR*-tree algorithms are heuristic in some specific parts, but in general
>> they are designed to optimize very clear metrics. Generally, ML algorithms
>> tend to compose much bigger pile of heuristics and solve less
>> mathematically clear tasks than splitting subtrees or choosing subtree for
>> insertion.
>> R*-tree algorithms are heuristic only to be faster.
>>
>> Best regards, Andrey Borodin.
>>
>> [0] https://arxiv.org/pdf/1712.01208.pdf
>
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Floris Van Nee 2019-04-05 09:13:29 Re: speeding up planning with partitions
Previous Message Thomas Munro 2019-04-05 09:05:02 Re: Using condition variables to wait for checkpoints