Skip site navigation (1)
Skip section navigation (2)
## Better estimates of index correlation

### Responses

### pgsql-hackers by date

Currently, we don't measure any statistics about the ordering correlation of multi-column indexes, which means that btcostestimate has to pick a number out of the air if there's more than one column. We've been around on that at least once already: it used to use first column's correlation divided by number of columns, and now uses first column's correlation times 0.75, and neither of those approaches have any theoretical basis whatsoever. I got started thinking about this again in connection with John Surcombe's recent complaint on pgsql-performance, which seems likely to be related to having a good correlation estimate for one index but not another. It strikes me that it'd be possible to have btvacuumcleanup directly measure order correlation when it's processing a btree index, yielding a reliable answer for any btree index regardless of number of columns. We could do that by comparing the heap block numbers of adjacent index entries' TIDs and counting the number of up-transitions (block number larger than previous index entry) versus down-transitions (block number smaller than previous). Since btvacuumcleanup scans the index in physical order not logical order, we could not meaningfully make comparisons between the last entry on an index page and the first entry on the next, but I think it's quite reasonable to assume that the statistics of those comparisons would be the same as the statistics of the intra-page comparisons. So at the end of the cleanup scan we would have total numbers of up-transitions and down-transitions. It's clear that a perfectly correlated index would have many up-transitions and no down-transitions; a perfectly reverse-correlated index would have the opposite; and an uncorrelated index would have about equal numbers of them. So we could approximate the index correlation with something like (up_transition_count - down_transition_count) / (up_transition_count + down_transition_count) My statistics are a bit rusty so I'm not sure if we need a square or square root or something in there, but it seems like this would work, and would add only a negligible amount of overhead. We could then have VACUUM save the number into pg_statistic or pg_index, and away we go. I'm not planning to do anything about this idea right now, since I'm still hip-deep in collations, but I thought I'd throw it out to get it on the record. Comments? regards, tom lane

- Re: Better estimates of index correlation at 2011-03-14 02:20:01 from Joshua D. Drake
- Re: Better estimates of index correlation at 2011-03-16 00:32:01 from Jeff Davis

Next: From:Joshua D. DrakeDate:2011-03-14 02:20:01Subject: Re: Better estimates of index correlationPrevious: From: Nikhil SontakkeDate: 2011-03-13 23:34:05Subject: Re: Fwd: index corruption in PG 8.3.13