Quick Links

Re: database 1.2G, pg_dump 73M?!

From:	"Joris Dobbelsteen" <Joris(at)familiedobbelsteen(dot)nl>
To:	"Ross Boylan" <RossBoylan(at)stanfordalumni(dot)org>
Cc:	<pgsql-general(at)postgresql(dot)org>
Subject:	Re: database 1.2G, pg_dump 73M?!
Date:	2008-03-30 22:44:19
Message-ID:	E4953B65D9E5054AA6C227B410C56AA91B5F37@exchange1.joris2k.local
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

>-----Original Message-----
>From: Ross Boylan [mailto:RossBoylan(at)stanfordalumni(dot)org]
>Sent: Monday, 31 March 2008 0:23
>To: Joris Dobbelsteen
>Cc: RossBoylan(at)stanfordalumni(dot)org; pgsql-general(at)postgresql(dot)org
>Subject: RE: [GENERAL] database 1.2G, pg_dump 73M?!
>
>
>On Sun, 2008-03-30 at 22:59 +0100, Joris Dobbelsteen wrote:
>> >-----Original Message-----
>> >From: Ross Boylan [mailto:RossBoylan(at)stanfordalumni(dot)org]
>> >Sent: Sunday, 30 March 2008 23:43
>> >To: Joris Dobbelsteen
>> >Cc: RossBoylan(at)stanfordalumni(dot)org; pgsql-general(at)postgresql(dot)org
>> >Subject: Re: [GENERAL] database 1.2G, pg_dump 73M?!
>> >
>> >On Sun, 2008-03-30 at 21:22 +0200, Joris Dobbelsteen wrote:
>> >> From the top contenders, about half are indexes, so you
>are stuck
>> >> with ~200 MB of data in the tables.
>> >> Postgresql has some wasted space due to placement of the
>tuples in
>> >> a block and overhead for each block and row. I don't know
>> >those values,
>> >> but they are in the range of 24 bytes per tuple, I believe.
>> >Secondly a
>> >> block is 8 KB by default and tuples cannot be stored into
>multiple
>> >> blocks (thats what toast to work around).
>> >>
>> >> All in all: Lookup tuple sizes, if they are small than
>the overhead
>> >> from postgresql can be a big factor. If you are huge you loose on
>> >> portions of unoccupied space in blocks. I believe pg_statistics
>> >> will provide this information.
>> >There is a pg_statistic (no "s") table, but I don't know
>how to get
>> >tuple size from it--the documentation refers to the source code to
>> >figure out the codes. Backing up a step, I don't know what a tuple
>> >is in Postgres, and don't see an entry for it in the index.
>>
>> It was pg_stats.
>> You get avg_width. It gives this per column.
>>
>> So probably you want
>> SELECT tablename, SUM(avg_width)
>> FROM pg_stats
>> WHERE schemaname = 'public'
>[Thanks; I didn't know about using schemaname to limit it to
>interesting tables]
>> GROUP BY tablename [RB added] ORDER BY tablename;
> tablename | sum
>-----------+-----
> client | 62
> counters | 25
> file | 109
> filename | 18
> fileset | 53
> job | 149
> jobmedia | 52
> media | 226
> mediatype | 16
> path | 82
> pool | 179
> status | 29
> storage | 23
> version | 4
>
>So, for example, if each tuple has 24 bytes of overhead, the
>overhead more than doubles the size of the file table (18
>bytes), which has a big record count. Am I following correctly?

Yes.
(Note not to pin down on the 24 bytes, it varies between versions. I
think this is close enough however).

>Between the space taken up by indices and the other overhead,
>the size difference between the sql dump and the db disk
>useage is starting to seem more reasonable.
>
>The database uses SQL-ASCII encoding, so I'd expect the
>textual parts (filenames and paths) to take up the same space
>(just for the basic storage, not counting overhead/indices) in
>Postgres as on the dump, 1 byte/character.

That's the idea.

>I'm not sure what "If you are huge you loose on portions of
>unoccupied space in blocks" means.

A tuple has to be in exactly 1 page (or block). It cannot span multiple
pages. Hence if your tuples happens to be a half page size or slightly
larger, than only a single one will fit in a single page. So you waste
half the page. You don't have to worry about this, as your tuples are
significantly smaller than that.

(As a reference only: if a tuple is larger than a blocksize some
attributes will be evicted to the toast table or compressed. So that is
to overcome the limitations of 8000 bytes per tuple.)

[snip]
>>
>> ># select distinct relname, reltuples, relpages from pg_class where
>> >relkind='r' and substring(relname, 1, 3) != 'pg_';
>> >
>> > relname | reltuples | relpages
>> >-------------------------+-----------+----------
>> > basefiles | 0 | 0
>> > cdimages | 0 | 0
>> > client | 2 | 1
>> > counters | 1 | 1
>> > device | 0 | 0
>> > file | 650659 | 11558
>> > filename | 623012 | 3958
>> > fileset | 22 | 1
>> > job | 384 | 10
>> > jobmedia | 596 | 7
>> > location | 0 | 0
>> > locationlog | 0 | 0
>> > log | 0 | 0
>> > media | 245 | 9
>> > mediatype | 2 | 1
>> > path | 67908 | 966
>> > pool | 5 | 1
>> > sql_features | 439 | 6
>> > sql_implementation_info | 12 | 1
>> > sql_languages | 4 | 1
>> > sql_packages | 10 | 1
>> > sql_parts | 9 | 1
>> > sql_sizing | 23 | 1
>> > sql_sizing_profiles | 0 | 0
>> > status | 19 | 1
>> > storage | 2 | 1
>> > unsavedfiles | 0 | 0
>> > version | 1 | 1
>>
>> Do relpages * 8096 / reltuples.
>> Default installs have 8 KB pages/blocks by default.
>>
>> For file it should be ~144 bytes/tuple.
>> For filename it should be ~51 bytes/tuple.
>
>Those tuple size estimates seem consistent with the tuple
>sizes reported earlier when the overhead is added in. I'm impressed!
>
[snip]

- Joris

In response to

Re: database 1.2G, pg_dump 73M?! at 2008-03-30 22:23:22 from Ross Boylan

Browse pgsql-general by date

	From	Date	Subject
Next Message	brian	2008-03-30 23:14:04	Re: shortcut for select * where id in (arrayvar)
Previous Message	Ross Boylan	2008-03-30 22:23:22	Re: database 1.2G, pg_dump 73M?!