| From: | Hayk Manukyan <manukyantt(at)gmail(dot)com> | 
|---|---|
| To: | Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> | 
| Cc: | Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> | 
| Subject: | Re: Feature request for adoptive indexes | 
| Date: | 2021-11-02 12:04:29 | 
| Message-ID: | CAF+kZOEmFL2FQXODbshbUntYZ9tejjEcN_NwVtWTEg16uihmzQ@mail.gmail.com | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
Tomas Vondra
> Are you suggesting those are not the actual best/worst cases and we
> should use some other indexes? If yes, which ones?
I would say yes.
In my case I am not querying only sequence column.
I have the following cases which I want to optimize.
1. Select * from Some_table where job = <somthing> and nlp = <something>
and year = <something> and  *scan_id = <something> *
2. Select * from Some_table where job = <somthing> and nlp = <something>
and year = <something> and  *Issue_flag = <something> *
3. Select * from Some_table where job = <somthing> and nlp = <something>
and year = <something> and  *sequence = <something> *
Those are queries that my app send to db that is why I said that from *read
perspective* our *best case* is 3 separate indexes for
 *(job, nlp, year, sequence)* AND *(job, nlp, year, Scan_ID)* and *(job,
nlp, year,  issue_flag)*  and any other solution like
 (job, nlp, year, sequence, Scan_ID, issue_flag) OR  (job, nlp, year )
INCLUDE(sequence, Scan_ID, issue_flag)  OR just (job, nlp, year) can be
considered as* worst case *
I will remind that in real world scenario I have ~50m rows and about *~5k
rows for each (job, nlp, year )*
From *write perspective* as far as we want to have only one index our* best
case* can be considered any of
*(job, nlp, year, sequence, Scan_ID, issue_flag)* OR * (job, nlp, year )
INCLUDE(sequence, Scan_ID, issue_flag) *
and the* worst case* will be having 3 separate queries like in read
perspective
(job, nlp, year, sequence) AND (job, nlp, year, Scan_ID) and (job, nlp,
year,  issue_flag)
So I think the comparison that we did is not right because we are comparing
different/wrong things.
For right results we need to compare this two cases when we are doing
random queries with 1,2,3  and random writes.
вт, 2 нояб. 2021 г. в 01:16, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>:
>
>
> On 11/1/21 21:06, Robert Haas wrote:
> > On Tue, Oct 26, 2021 at 11:11 AM Tomas Vondra
> > <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
> >> If I get what you propose, you want to have a "top" tree for (job, nlp,
> >> year), which "splits" the data set into subsets of ~5000-7000 rows. And
> >> then for each subset you want a separate "small" trees on each of the
> >> other columns, so in this case three trees.
> >>
> >> Well, the problem with this is pretty obvious - each of the small trees
> >> requires separate copies of the leaf pages. And remember, in a btree the
> >> internal pages are usually less than 1% of the index, so this pretty
> >> much triples the size of the index. And if you insert a row into the
> >> index, it has to insert the item pointer into each of the small trees,
> >> likely requiring a separate I/O for each.
> >>
> >> So I'd bet this is not any different from just having three separate
> >> indexes - it doesn't save space, doesn't save I/O, nothing.
> >
> > I agree. In a lot of cases, it's actually useful to define the index
> > on fewer columns, like (job, nlp, year) or even just (job, nlp) or
> > even just (job) because it makes the index so much smaller and that's
> > pretty important. If you have enough duplicate entries in a (job, nlp,
> > year) index to justify create a (job, nlp, year, sequence) index, the
> > number of distinct (job, nlp, year) tuples has to be small compared to
> > the number of (job, nlp, year, sequence) tuples - and that means that
> > you wouldn't actually save much by combining your (job, nlp, year,
> > sequence) index with a (job, nlp, year, other-stuff) index. As you
> > say, the internal pages aren't the problem.
> >
> > I don't intend to say that there's no possible use for this kind of
> > technology. Peter G. says that people are writing research papers
> > about that and they probably wouldn't be doing that unless they'd
> > found some case where it's a big win. But this example seems extremely
> > unconvincing.
> >
>
> I actually looked at the use case mentioned by Peter G, i.e. merged
> indexes with master-detail clustering (see e.g. [1]), but that seems
> like a rather different thing. The master-detail refers to storing rows
> from multiple tables, interleaved in a way that allows faster joins. So
> it's essentially a denormalization tool.
>
> Perhaps there's something we could learn about efficient storage of the
> small trees, but I haven't found any papers describing that (I haven't
> spent much time on the search, though).
>
> [1] Algorithms for merged indexes, Goetz Graefe
>      https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.140.7709
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
| From | Date | Subject | |
|---|---|---|---|
| Next Message | osumi.takamichi@fujitsu.com | 2021-11-02 12:07:01 | RE: Failed transaction statistics to measure the logical replication progress | 
| Previous Message | Jeevan Ladhe | 2021-11-02 11:52:29 | Re: refactoring basebackup.c |