Re: TB-sized databases

From: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To: Pablo Alcaraz <pabloa(at)laotraesquina(dot)com(dot)ar>
Cc: pgsql-performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: TB-sized databases
Date: 2007-11-28 15:12:05
Message-ID: 20071128151205.GL5118@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Pablo Alcaraz escribió:

> In my opinion there are queries that I think they ll need to be tuned for
> "huge databases" (huge databases = a database which relevant
> tables(indexes) are (will be) far bigger that all the ram available):
>
> -- example table
> CREATE TABLE homes (
> id bigserial,
> name text,
> location text,
> bigint money_win,
> int zipcode;
> );
> CREATE INDEX money_win_idx ON homes(money_win);
> CREATE INDEX zipcode_idx ON homes(zipcode);

Your example does not work, so I created my own for your first item.

alvherre=# create table test (a int primary key);
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "test_pkey" for table "test"
CREATE TABLE
alvherre=# insert into test select * from generate_series(1, 100000);
INSERT 0 100000
alvherre=# analyze test;
ANALYZE

> SELECT max( id) from homes;
> I think the information to get the max row quickly could be found using the
> pk index. Idem min( id).

alvherre=# explain analyze select max(a) from test;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
Result (cost=0.03..0.04 rows=1 width=0) (actual time=0.054..0.057 rows=1 loops=1)
InitPlan
-> Limit (cost=0.00..0.03 rows=1 width=4) (actual time=0.041..0.043 rows=1 loops=1)
-> Index Scan Backward using test_pkey on test (cost=0.00..3148.26 rows=100000 width=4) (actual time=0.034..0.034 rows=1 loops=1)
Filter: (a IS NOT NULL)
Total runtime: 0.143 ms
(6 rows)

> SELECT max( id) from homes WHERE id > 8000000000;
> Same, but useful to find out the same thing in partitioned tables (using id
> like partition criteria). It would be nice if Postgres would not need the
> WHERE clause to realize it does not need to scan every single partition,
> but only the last. Idem min(id).

Yeah, this could be improved.

> SELECT * from homes WHERE money_win = 1300000000;
> Postgres thinks too easily to solve these kind of queries that it must to
> do a sequential scan where the table (or the index) does not fix in memory
> if the number of rows is not near 1 (example: if the query returns 5000
> rows). Same case with filters like 'WHERE money_win >= xx', 'WHERE
> money_win BETWEEN xx AND yy'. But I do not know if this behavior is because
> I did a wrong posgresql's configuration or I missed something.

There are thresholds to switch from index scan to seqscans. It depends
on the selectivity of the clauses.

> SELECT count( *) from homes;
> it would be *cute* that Postgres stores this value and only recalculate if
> it thinks the stored value is wrong (example: after an anormal shutdown).

This is not as easy as you put it for reasons that have been discussed
at length. I'll only say that there are workarounds to make counting
quick.

> SELECT zipcode, count( zipcode) as n from homes GROUP BY zipcode;
> it would be *very cute* that Postgres could store this value (or is this
> there?) on the index or wherever and it only recalculates if it thinks the
> stored value is wrong (example: after an anormal shutdown).

Same as above.

> Last but not least, it would be *excelent* that this kind of optimization
> would be posible without weird non standard sql sentences.

Right. If you can afford to sponsor development, it could make them a
reality sooner.

--
Alvaro Herrera Valdivia, Chile ICBM: S 39º 49' 18.1", W 73º 13' 56.4"
"You're _really_ hosed if the person doing the hiring doesn't understand
relational systems: you end up with a whole raft of programmers, none of
whom has had a Date with the clue stick." (Andrew Sullivan)

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Brendan McMahon 2007-11-28 15:30:22 Optimizer regression 8.2.1 -> 8.2.3 on TSEARCH2 queries with ORDER BY and LIMIT
Previous Message david 2007-11-28 15:03:26 Re: TB-sized databases