Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)

From: Oleg Bartunov <obartunov(at)postgrespro(dot)ru>
To: Stefan Keller <sfkeller(at)gmail(dot)com>
Cc: Oleg Ivanov <o(dot)ivanov(at)postgrespro(dot)ru>, Nikolay Samokhvalov <samokhvalov(at)gmail(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)
Date: 2021-04-21 09:16:30
Message-ID: CAF4Au4zBNDoxQEOps2L4b=XH=nR221tM73hFKpaFmiGwseopZw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Apr 20, 2021 at 8:56 PM Stefan Keller <sfkeller(at)gmail(dot)com> wrote:

> Dear Olegs, dear Nikolay, dear all
>
> Allow me to revive this thread:
>
> Are there any advances in a learned index for PostgreSQL?
>
> Background: I'm trying to benchmark those experimental indices. For
> this I did some bibliography work (see below). Fun fact: Not only
> Postgres people love high-proof drinks, sorry, high-proof index names:
> see "From wisckey to bourbon" :-).
>

Have you seen recent paper "Benchmarking Learned Indexes" ?
https://vldb.org/pvldb/vol14/p1-marcus.pdf
I haven't look into the code https://github.com/learnedsystems/sosd

>
> My main discovery is a discussion report - which included Stonebraker
> - entitled "ML-In-Databases: Assessment and Prognosis" [1]. The first
> two takeaways are "#1: The initial comparisons of learned indices with
> optimized traditional indices should be further expanded to include
> concurrency control and multi-user settings. (...) #2: A key benefit
> of learned indices may come from the learned structures requiring
> lower space utilization, rather than a reduction in search time."
>
> Yours, Stefan
>
>
> [1] Kraska, T., Minhas, U. F., Neumann, T., Papaemmanouil, O., Patel,
> J. M., Ré, C., & Stonebraker, M. (2021). ML-In-Databases: Assessment
> and Prognosis. Data Engineering, 3. Web access:
> https://www.cs.brandeis.edu/~olga/publications/ieee2021.pdf
>
>
> Bibliography (without any claim for completeness):
>
> Kraska, T., Beutel, A., Chi, E. H., Dean, J., & Polyzotis, N. (2018,
> May). The case for learned index structures. In Proceedings of the
> 2018 International Conference on Management of Data (pp. 489-504). Web
> access https://dl.acm.org/doi/pdf/10.1145/3183713.3196909
>
> "Recursive model index, a learned index structure" (based on Kraska et
> al. 2017? 2018). Source code: https://github.com/BenJoyenConseil/rmi
> (Go language), https://github.com/learnedsystems/RMI (Rust)
>
> Kaur, T. (2018). Learned Index Structures: Practical Implementations
> and Future Directions. Master Thesis. Web access:
>
> https://wwwiti.cs.uni-magdeburg.de/iti_db/publikationen/ps/auto/kaur2018learned.pdf
> (pretends to be open source C++ but none found)
>
> Kipf, A., Marcus, R., van Renen, A., Stoian, M., Kemper, A., Kraska,
> T., & Neumann, T. (2020, June). RadixSpline: a single-pass learned
> index. In Proceedings of the Third International Workshop on
> Exploiting Artificial Intelligence Techniques for Data Management (pp.
> 1-5). Web access: https://dl.acm.org/doi/10.1145/3401071.3401659),
> Source code: https://github.com/learnedsystems/RadixSpline (C++).
>
> Dai, Y., Xu, Y., Ganesan, A., Alagappan, R., Kroth, B.,
> Arpaci-Dusseau, A., & Arpaci-Dusseau, R. (2020). From wisckey to
> bourbon: A learned index for log-structured merge trees. In 14th
> {USENIX} Symposium on Operating Systems Design and Implementation
> ({OSDI} 20) (pp. 155-171). Web access:
> https://www.usenix.org/system/files/osdi20-dai_0.pdf
>
> Ding, J., Minhas, U. F., Yu, J., Wang, C., Do, J., Li, Y., ... &
> Kraska, T. (2020, June). ALEX: an updatable adaptive learned index. In
> Proceedings of the 2020 ACM SIGMOD International Conference on
> Management of Data (pp. 969-984). Web access:
> https://doi.org/10.1145/3318464.3389711 /
> https://dblp.org/rec/conf/sigmod/DingMYWDLZCGKLK20 . Source code:
> https://github.com/microsoft/ALEX (C++, MIT license)
>
> Am Di., 12. Dez. 2017 um 18:02 Uhr schrieb Oleg Ivanov
> <o(dot)ivanov(at)postgrespro(dot)ru>:
> >
> > On 12/12/2017 04:33 PM, Oleg Bartunov wrote:
> > > On Mon, Dec 11, 2017 at 11:11 PM, Nikolay Samokhvalov
> > > <samokhvalov(at)gmail(dot)com> wrote:
> > >> Very interesting read: https://arxiv.org/abs/1712.01208
> > >>
> > >> HN discussion: https://news.ycombinator.com/item?id=15894896
> > >>
> > >> Some of the comments (from Twitter
> > >> https://twitter.com/schrockn/status/940037656494317568): "Jeff Dean
> and co
> > >> at GOOG just released a paper showing how machine-learned indexes can
> > >> replace B-Trees, Hash Indexes, and Bloom Filters. Execute 3x faster
> than
> > >> B-Trees, 10-100x less space. Executes on GPU, which are getting faster
> > >> unlike CPU. Amazing."
> > >>
> > >> Can those ideas be applied to Postgres in its current state? Or it's
> not
> > >> really down-to-earth?
> > > Oleg made some analysis of the paper.
> > If the keys are comparable and the data distribution is simple enough
> > (which seems to happen rather often) and the data does not change, we
> > can learn the distribution, and so perform the search faster, and save
> > the memory also. That is what authors wrote and that is what must work
> > in my opinion.
> >
> > The limitations of the approach, which authors mention in their work:
> > + The data does not change. The proposed method can be extended more or
> > less straightforward to nearly append-only workloads and to workloads in
> > which the data distribution does not change or changes very slowly (the
> > data itself or its amount may change, nevertheless). Otherwise it is
> > challenging, because the key positions cannot be considered as
> > independent values anymore, but it looks still possible. The other
> > proposed by the authors option is to store new objects in buffer and
> > retrain model in the background, but that does not look as a nice
> solution.
> > + GPU are fast, but the latency of doing operations on the GPU almost
> > certainly avoids all speed benefits for such small models. The only case
> > in which it is reasonable to use GPU is training models (i. e. building
> > such indexes).
> >
> > The following left unclear for me from the paper:
> > + How effectively the approach can be implemented taking into account
> > the limitations above.
> > + For hash functions authors propose to replace hash function with the
> > learned CDF, which can decrease the amount of collisions, which decreaes
> > the memory consumption. That is reasonable, but this hash function
> > implicitly uses the information about the order on the keys. I suppose
> > that using the ordering on the data and with access to the data
> > histogram one could achieve similar memory consumption decrease.
> > + The proposed hierarchical learning looks natural for the problem, but
> > it is not clear that it is the best option. For example, one might use
> > the end-to-end learning procedure with weights sparsity regularizers as
> > well. I wonder whether the last approach can obtain the competitive
> > results with the first one, and if not, then why. Anyway, I have a
> > feeling that the proposed method has a room for improvement.
> >
> > In general, the approach still needs some additional research (mostly
> > about the changing data), and has some points to think about how to
> > implement them properly (paging, for example), but it looks promising
> > enough nevertheless.
> >
>

--
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2021-04-21 09:19:58 Re: Replication slot stats misgivings
Previous Message Masahiko Sawada 2021-04-21 09:15:36 Re: Replication slot stats misgivings