Re: estimating # of distinct values

From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: estimating # of distinct values
Date: 2011-01-18 00:36:59
Message-ID: 4D34E0AB.9080901@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dne 9.1.2011 13:58, Jim Nasby napsal(a):
>> A resource fork? Not sure what you mean, could you describe it in more
>> detail?
>
> Ooops, resource forks are a filesystem thing; we call them relation forks. >From src/backend/storage/smgr/README:

OK, I think using relation forks seems like a good solution. I've done
some basic research and I think these are the basic steps when adding a
new fork:

1) define a new item in the ForkNum enum (relfilenode.h) - this should
be somethink like DISTINCT_FORK I guess

2) modify the ForkNames (catalog.c) and the relpathbackend so that the
proper filename is assigned to the fork

And then it will be accessed through smgr (smgrcreate etc.). Am I right
or is there something else I need to do?

There are a few open questions though:

1) Forks are 'per relation' but the distinct estimators are 'per
column' (or 'per group of columns') so I'm not sure whether the file
should contain all the estimators for the table, or if there should
be one fork for each estimator. The former is a bit difficult to
manage, the latter somehow breaks the current fork naming convention.

2) Where to keep the info that there is an estimator for a column? I
guess we could put this into pg_attribute (it's one boolean). But
what about the estimators for groups of columns? Because that's why
I'm building this - to get distinct estimates for groups of columns.

I guess we'll need a new system catalog to track this? (The same
will be true for multi-dimensional histograms anyway).

3) I still am not sure how to manage the updates, i.e. how to track the
new values.

One possibility might be to do that synchronously - whenever a new
item is inserted into the table, check if there's an estimator and
update it. Updating the estimators is quite efficient (e.g. the
bitmap manages to do 10.000.000 inserts in 9 seconds on my ancient
workstation) although there might be issues with locking etc.

The other possibility is to update the estimator asynchronously, i.e.
store the new values somewhere (or just ctid of the row), and then
process it periodically. I'm not sure how to intercept the new rows
and where to store them. In another fork? Somewhere else?

regards
Tomas

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kevin Grittner 2011-01-18 00:52:09 Re: SSI patch version 12
Previous Message Kevin Grittner 2011-01-18 00:33:08 Re: SQL/MED - file_fdw