Space wasted by denormalized data

From: Decibel! <decibel(at)decibel(dot)org>
To: pgsql-general(at)postgresql(dot)org
Subject: Space wasted by denormalized data
Date: 2008-02-28 21:57:17
Message-ID: 20080228215717.GN1212@decibel.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

I wrote this query to identify how much space is being wasted in a
database by denormalized data. For each field in each table, it
calculates how much space the field is taking in it's table as well as
indexes, then it calculates how much space would be needed if that field
was normalized into a separate table. It places some (somewhat
arbitrary) minimums on how much space would have to be saved to include
that field in the output. If you want to get rid of the limit you should
still keep savings > 0 in the query, otherwise you'll start seeing
normalization suggestions that make no sense (like normalizing an int).

As it stands, the query reports a total for the entire database, but you
can use just sub-sections of the query to see savings per-table, or
per-field.

-- Total
SELECT pg_size_pretty(sum(sum)::bigint)
-- Summarize by table
FROM (SELECT schemaname, tablename, sum(savings), pg_size_pretty(sum(savings)::bigint)
-- Get pretty size. Start here if you want per-table info
FROM (SELECT *, pg_size_pretty(savings::bigint)
FROM (
-- Here's where the real work starts
SELECT *, table_space_delta + coalesce(index_space_delta, 0) - side_table_space AS savings
FROM (
-- Figure out how much space we'd save in indexes by converting to an int
SELECT *, index_tuple_count*(avg_width-4) AS index_space_delta
FROM (SELECT s.*, index_count, index_tuple_count
FROM (SELECT schemaname, tablename, attname, null_frac, avg_width, n_distinct
-- How much space would we gain by changing this field to an int?
, reltuples*(1-null_frac)*(avg_width-4) AS table_space_delta
-- Estimate how big our "side table" will be
, CASE WHEN n_distinct >= 0 THEN n_distinct ELSE -n_distinct * reltuples END
* (24+4+avg_width+6+4+6+avg_width) AS side_table_space
FROM pg_stats s
JOIN pg_class c ON c.relname=tablename AND c.relkind='r'
JOIN pg_namespace n ON n.oid = c.relnamespace AND n.nspname=s.schemaname
WHERE schemaname NOT IN ('pg_catalog','information_schema')
) s
NATURAL LEFT JOIN (
SELECT n.nspname AS schemaname, c.relname AS tablename, attname
, count(*) AS index_count, sum(i.reltuples) AS index_tuple_count
FROM pg_index x
JOIN pg_class c ON c.oid = x.indrelid
JOIN pg_class i ON i.oid = x.indexrelid
JOIN pg_namespace n ON n.oid = c.relnamespace
JOIN pg_attribute a ON a.attrelid = i.oid
GROUP BY n.nspname, c.relname, attname
) i
) a
) a
) a
-- Minimum savings to consider per-field
WHERE savings > 1e6) a
GROUP BY schemaname, tablename
-- Minimum savings to consider per-table
HAVING sum(savings) > 1e7
ORDER BY sum(savings) DESC
) a
;
--
Decibel!, aka Jim C. Nasby, Database Architect decibel(at)decibel(dot)org
Give your computer some brain candy! www.distributed.net Team #1828

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Colin Wetherbee 2008-02-28 22:12:17 Re: Space wasted by denormalized data
Previous Message Scara Maccai 2008-02-28 21:46:54 Re: partitioning using dblink