Skip site navigation (1)
Skip section navigation (2)
## Improving N-Distinct estimation by ANALYZE

### Responses

### pgsql-hackers by date

Improving N-Distinct estimation =============================== v1.1 OBJECTIVES Answer these points... - Are there any performance issues that can be directly attributed to mis-estimation of N-Distinct ("D") by the ANALYZE command? - If so, can we do better than we currently achieve? How? - Would altering the estimate of D cause problems in other places? Comments are sought on the problem and the possible solutions. Please get involved if you can help with detailed analyses on this topic. SUMMARY The estimation of D is difficult and imprecise. The current method works well in many cases, yet breaks down *badly* in one particular very common use case of database design: a large dependent table with a multi-column Primary Key. In some cases the estimate of D *decreases* as the size of the table increases, though the estimate of D is an underestimate in almost all cases, whatever the table design. PostgreSQL cvstip currently seriously under estimates D for very large dependent tables with rows that are clustered around one of the columns of a multi-column Primary Key. The mis-estimation identified can lead to poor system performance should the mis-estimation lead to the use of HashAggregate or general mis-selection of optimal join plans. An example of this is the orders and lineitem tables from DBT-3/TPC-H, which have a 1:M relationship. There are M lineitems associated with each order and all are inserted in one transaction into lineitem. If M is relatively large, then problems may ensue. A problem SQL statement may be something like: SELECT l_orderkey, sum(l_extendedprice) from lineitem; This issue could have an impact on any large table in a 1:M relationship where the actual number of distinct values, D, is much larger than the sample size, n. (D >> n). This can also effect associative tables where more than one 1:M relationship exists to separate tables, such as Fact tables in a star schema. The issue is alleviated by setting the column statistics target higher, though this merely increases the size range of the table over which problems may occur. There are a number of ways we can improve on the current estimates, using techniques suggested in later statistical research. Some proposals are made and comments are sought on the problem and the possible solutions. It is possible that we need more than one estimate of D for various purposes. We might potentially need a low estimate of D for use in join planning, whereas a higher estimate to reduce the risk of hash table operations. This approach might be taken initially to allow us to implement improved estimators without throwing out many previously good plans. WHAT WE CURRENTLY DO WITH ANALYZE Notation D = estimate of the number of distinct values (aka n_distinct) N = number of rows in table n = number of rows in sample Sampling method * Fixed sample size, no matter how big table, following Chaudhuri et al's 1998 paper on sample size sufficiency for statistics histograms. * Sample blocks = sample rows = 300 * col stats target Results * Count rows/value for all values observed in sample f1 = number of unique values in sample d = number of values in sample * If f1 == n => assume unique: D = N and scale with N else If f1 == 0 => assume D = d else => apply Haas-Stokes [1998] estimator * If D > 10% of N => scale with N [There are a variety of techniques selected from Haas-Stokes [1998], Chaudhuri et al [1998], Vitter and Knuth. Sometimes these authors have discussed the same subject and come up with different answers, so you need to be careful to say which reference you mean when discussing these.] ISSUES 1. Estimation of D; mentioned above and covered in more detail below. (see ESTIMATES OF D FOR DEPENDENT TABLES) 2. The sample size calculation correctly follows Chaudhuri et al [1998] when the number of rows in the table is 1 million. However, smaller tables are overestimated and larger tables are underestimated. The sample size should be multiplied by 2.3 (i.e. ln(10)) for every x10 larger table size. e.g. a 100 million row table requires sample size 4.6 times larger to have the same accuracy for histogram selection. OBSERVATIONS 1. *All* methods of statistical analysis are improved by larger sample fractions. The D estimator method currently in use shows an optimum of accuracy and sample fraction at around 5% of a table, as shown in the author's original paper [Haas Stokes (1998)]. The current implementation's error rates climb higher as table size increases. 2. In terms of I/O, ANALYZE is relatively inefficient, since it uses a row sampling technique rather than a block sampling technique. This would translate directly into a performance drop from large sample ratios, but since we currently use a fixed sample size this problem is not yet visible for larger tables. With a 2GB table, we would typically sample 1% of the blocks, yet around 0.025 - 0.05% of the rows. 3. Large values of statistics target (up to 1000) could cause a number of problems with statistics catalog table growth (mentioned on -perform). Setting these values appropriately can take significant effort. Automatic scaling of such parameters is desirable. 4. ANALYZE doesn't use more memory if maintenance_work_mem is set high, nor does it use less if it is set low. Actual memory usage isn't measured at all. With very long rows this is max 24 MB with default stats target settings and BLCKSZ, or 2.4 GB with highest stats target (1000). This probably is of lower importance since stats targets are only usually set higher on larger databases, which typically have larger memory configurations anyway - and very long rows are uncommon because of TOAST. Typical memory usage by ANALYZE would be < 1 MB with default settings i.e. maybe as low as a 0.01% sample for a very large table. ESTIMATES OF D FOR DEPENDENT TABLES Lets look at this example from DBT-3/TPC-H: Two tables, orders and lineitem. Each row in orders has M rows in lineitem, so they have a 1:M relationship. create table orders (o_orderkey integer, o_custkey integer, o_orderstatus char(1), o_totalprice numeric(12,2), o_orderdate date, o_orderpriority char(15), o_clerk char(15), o_shippriority integer, o_comment varchar(79), primary key ( o_orderkey ) ); create table lineitem (l_orderkey integer, l_partkey integer, l_suppkey integer, l_linenumber integer, l_quantity numeric(12,2), l_extendedprice numeric(12,2), l_discount numeric(12,2), l_tax numeric(12,2), l_returnflag char(1), l_linestatus char(1), l_shipdate date, l_commitdate date, l_receiptdate date, l_shipinstruct char(25), l_shipmode char(10), l_comment varchar(44), primary key ( l_orderkey, l_linenumber ) ); where lineitem.l_orderkey references o_orderkey The rows in lineitem are all inserted in the same transaction that the order is inserted. As a result, they are very likely to be in the same data block, or adjacent data blocks (*) ANALYZE randomly samples rows, so that the average gap between randomly selected rows increases as the table size increases, because of the fixed sample size. Since the clustered rows are typically close together, then the apparent number of multiple instances of the same data value decreases as the sample fraction decreases. Since the sample size is currently fixed, this means that the D estimate decreases as the table size increases. (This is proven in a test case below). (*) The only alleviation from this effect occurs when we have the FSM full of randomly placed blocks. In that case, we will sometimes get consecutively INSERTed orderline rows in blocks that are wide apart within the table. However in our example, the fulfillment of orders is not random and so blocks with freespace tend to appear first at the beginning of the table and cycle through the table over time. So, even in the unlikely event that we have rows with the same l_orderkey value widely separated in the table, we are still unlikely to actually observe that when sample fraction is low. Data Warehouse applications seldom delete rows, hence the FSM is unused, so this slight cause for hope is unlikely to exist. There is a further effect of concern here. We currently apply a "lift" to the D estimate when we think that the number of distinct values is high enough to imply that it should be scaled according to N. In the above example, when sample ratio is too small, D will hit the point where it is too low to be scaled and we suddenly "bomb out" to a much lower value. Test results on a table constructed as follows: Number of orderkey vals Rows per orderkey 200,000 2 200,000 4 200,000 6 200,000 8 200,000 10 Total orderkey values = 1,000,000 (D-exact) Total rows = 6,000,000 orderkey stats target D estimates 10 (10% blocks, 3k rows read) 106k, 113k, 114k 20 (20% blocks, 6k rows read) 201k, 185k, 211k 30 (30% blocks, 9k rows read) 301k, 303k, 324k 40 (40% blocks, 12k rows read) 431k, 378k 60 (60% blocks, 18k rows read) 530k 80 (80% blocks, 24k rows read) 646k(-0.107), 732k(-0.122) 100 (all blocks, 30k rows read) 823k(-0.137), 782k(-0.13), 785k(-0.13) 200 (all blocks, 60k rows read) 794k(-0.132), 810k(-0.135) The numbers in brackets denote that we have inferred scaling by N should occur and that we can estimate D as -(1/n_distinct). The test uses increasing stats target to increase sample size. I assume that increasing the table size while maintaining the data distribution, yet maintaining constant sample would have the same effect as the results shown here, since they would offer similar sample fractions. The results show that D is *inversely* proportional to sample fraction, but only over the middle range of sample sizes. At each end of the sample size scale, we handle matters correctly. If we sample enough, we recognise the actual distribution and decide to scale the result. If we sample a very small fraction and this reveals an apparently unique distribution, we decide that the distribution itself is unique and we suddenly decide D = N. The test results don't seem too bad if you view the estimate of D as at most a factor of 10 wrong. However, since the error scales up with the size of the table, we can imagine very large estimation errors. The results show that even when we scan 60% of the example table we consistently get an answer around half of the actual value of D. This is misleading because we are using row sampling, so we have sampled 1800 blocks out of 29412, yet only examined 1800 rows out of 6 million. So we've paid the high I/O cost to scan the table, but not got the benefit from that. That could lead to the conclusion that increasing sample size has little or no benefit, but that is only true if we use row rather than block sampling techniques. We do currently already calculate the correlation for each attribute, so this could be used to influence the results. However, correlation value itself may be poor in a steady state table such as line_item since the filling of data blocks is likely to be cyclic. REVIEW OF RESEARCH Andrew Dunstan points out Brutlag & Richardson [2000] "A Block Sampling Approach to Distinct Value Estimation". In this paper a number of different techniques are reviewed. [Some previous block-based estimators have been proposed on-list, though these had less statistical basis and the range of application was based on apriori knowledge of D] >From it we note that block sampling is an effective technique when the attributes are genuinely randomly distributed within the table. Block sampling is also effective at identifying clustered (as opposed to merely correlated) attributes with relatively low sample fractions. Chaudhuri's estimator is based on a least risk approach, rather than a greatest accuracy approach, which does sound appealing should we not be able to apply an improved estimator. Haas & Stokes [1998] note that "Initial experiments indicated that the reduction in estimation error due to using the high-order jack-knife is outweighed by the increase in error due to uncertainty in the moment estimates." ...Experiments both here and by Brutlag & Richardson show that not doing so leads to poor results with clustered and correlated data. Brutlag & Richardson conclude that "a hybrid estimator" may yield a better approach than the ones they have come up with. Chaudhuri et al [1998] mention an "adaptive page sampling" approach. PROPOSALS The current sampling approach visits many blocks yet retrieves few rows. We know that: i) block sampling is an effective means of detecting clustered data ii) block sampling could lead to bias in the conclusions reached ii) the current ANALYZE approach has problems with clustered data I propose 1. decide that D should scale with N if the |correlation| > an arbitrary limit [0.99 suggested to ensure only highly certain correlations are taken] (and only if we have not already decided that D > d). This alone would "fix" the estimate of D for all of the cases highlighted in my test case, above. However, my test was both correlated and clustered - and many real cases would be clustered only (say if people are reading values from a sequence before insertion the correlation would not be so near perfect). 2. When i) we have a large table (> 1 000 000 rows if known, or > 50 000 * max stats target blocks) ii) we have ample additional memory to collect a larger sample size that we make these relatively minor changes to the current approach. a). we collect a sample that consists of both block and row sampled data. Any estimator applied to this data will then naturally be a hybrid estimator. This would be done in such a way that no new I/O would be incurred, i.e. we would block sample *some* of the blocks which we will read for row sampling purposes. Doing that will increase our knowledge of clustered data, as well as lifting the sample size as is required to maintain the accuracy of histogram calculation. This means we will still use a fixed size sample, but the sample size will be larger according to maintenance_work_mem. The sample will still fit in memory, so the required sorts will still be fast memory sorts. Also, the sample size is larger without also increasing the statistics targets for particular columns and this will happen automatically when maintenance_work_mem is set higher. e.g. If we estimate memory usage as 1000 * n, we can estimate how much memory is available for additional block-related sampling. If we block sample 10% of all blocks selected for row samples, then this will use at most BLCKSZ * n /10 additional memory. I'm not suggesting (yet) we follow the suggestion of Chaudhuri et al for adaptive page/block sampling, since we do not actually compare values retrieved until we sort them later. That would require more involved changes to the current ANALYZE mechanisms. We can look at the row lengths to ensure that the block sampled rows do not completely swamp the row sampled ones. b). look at individual block clustering using the tupnoLink and the tuple TIDs. If the tuple TID == tupnoLink[tuple]->TID then they are from the same block. This will allow us to spot block clustered data even when the data is not highly correlated because of the effects of deletion and insertion on steady state tables. If clustering appears to exist, apply the best block estimator from Brutlag & Richardson [2000]; if not, stick with Haas/Stokes. 3. We should also apply multi-column heuristics to the estimation of D, once we have estimated all columns. For column groups (pairs, triples etc) that form part of a PK, we know that it must be true that D1 * D2 ... Dk >= N. In many cases we will be very confident of our estimate of D when we decide = d. i.e. When we have two columns, we can use this to infer that D1 = N/d when D2 = d. So we can do this in any case where we have confident estimates of all but one column; the required information is available at that time. e.g. if line_item primary key ( l_orderkey, l_linenumber ) and we know that there are at most 10 l_linenumber values in the table, then there should be N/10 values for l_orderkey, so set it to that if it is lower (only). 4. There are a number of other heuristics we could use but those become more specialised and complex as we proceed down that path. 5. We should implement hash-table overflow logic for HashAggregates just as has been done for HashJoins. The overflow logic in place merely copes with the problem, which could mean that some queries run for very extended durations. We should emit a log message advising when the overflow logic subdivides the hash table more than twice (i.e. we have underestimated D by more than x4), so that admins can take action and/or -hackers gets to find out when badness occurs somewhere. 6. A set of functions to override n_distinct when we wish to do this, allowing database designers to set some of these values by definition, rather than allowing for ANALYZE to discover them. This would require another column on pg_statistic so that the override value is never overridden itself by subsequent ANALYZEs. =================================================================

- Re: Improving N-Distinct estimation by ANALYZE at 2006-01-04 19:49:16 from Tom Lane
- Re: Improving N-Distinct estimation by ANALYZE at 2006-01-04 23:57:49 from Jim C. Nasby
- Re: Improving N-Distinct estimation by ANALYZE at 2006-01-05 00:07:37 from Josh Berkus

Next: From:Josh BerkusDate:2006-01-04 19:41:36Subject: Re: Inconsistent syntax in GRANTPrevious: From: Tom LaneDate: 2006-01-04 16:40:02Subject: Re: postmaster/postgres options assimilation plan