Re: database 1.2G, pg_dump 73M?!

From: Ross Boylan <RossBoylan(at)stanfordalumni(dot)org>
To: Joris Dobbelsteen <Joris(at)familiedobbelsteen(dot)nl>
Cc: RossBoylan(at)stanfordalumni(dot)org, pgsql-general(at)postgresql(dot)org
Subject: Re: database 1.2G, pg_dump 73M?!
Date: 2008-03-30 22:23:22
Message-ID: 1206915802.8239.244.camel@corn.betterworld.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general


On Sun, 2008-03-30 at 22:59 +0100, Joris Dobbelsteen wrote:
> >-----Original Message-----
> >From: Ross Boylan [mailto:RossBoylan(at)stanfordalumni(dot)org]
> >Sent: Sunday, 30 March 2008 23:43
> >To: Joris Dobbelsteen
> >Cc: RossBoylan(at)stanfordalumni(dot)org; pgsql-general(at)postgresql(dot)org
> >Subject: Re: [GENERAL] database 1.2G, pg_dump 73M?!
> >
> >On Sun, 2008-03-30 at 21:22 +0200, Joris Dobbelsteen wrote:
> >> From the top contenders, about half are indexes, so you are stuck
> >> with ~200 MB of data in the tables.
> >> Postgresql has some wasted space due to placement of the tuples in a
> >> block and overhead for each block and row. I don't know
> >those values,
> >> but they are in the range of 24 bytes per tuple, I believe.
> >Secondly a
> >> block is 8 KB by default and tuples cannot be stored into multiple
> >> blocks (thats what toast to work around).
> >>
> >> All in all: Lookup tuple sizes, if they are small than the overhead
> >> from postgresql can be a big factor. If you are huge you loose on
> >> portions of unoccupied space in blocks. I believe pg_statistics will
> >> provide this information.
> >There is a pg_statistic (no "s") table, but I don't know how
> >to get tuple size from it--the documentation refers to the
> >source code to figure out the codes. Backing up a step, I
> >don't know what a tuple is in Postgres, and don't see an entry
> >for it in the index.
>
> It was pg_stats.
> You get avg_width. It gives this per column.
>
> So probably you want
> SELECT tablename, SUM(avg_width)
> FROM pg_stats
> WHERE schemaname = 'public'
[Thanks; I didn't know about using schemaname to limit it to interesting
tables]
> GROUP BY tablename [RB added] ORDER BY tablename;
tablename | sum
-----------+-----
client | 62
counters | 25
file | 109
filename | 18
fileset | 53
job | 149
jobmedia | 52
media | 226
mediatype | 16
path | 82
pool | 179
status | 29
storage | 23
version | 4

So, for example, if each tuple has 24 bytes of overhead, the overhead
more than doubles the size of the file table (18 bytes), which has a big
record count. Am I following correctly?

Between the space taken up by indices and the other overhead, the size
difference between the sql dump and the db disk useage is starting to
seem more reasonable.

The database uses SQL-ASCII encoding, so I'd expect the textual parts
(filenames and paths) to take up the same space (just for the basic
storage, not counting overhead/indices) in Postgres as on the dump, 1
byte/character.

I'm not sure what "If you are huge you loose on portions of unoccupied
space in blocks" means.
>
> >Is a tuple just a row? That's what the docs say for the following
> >report:
>
> Yes.
> Also where I typed "block" it will refer to "page".
>
> ># select distinct relname, reltuples, relpages from pg_class
> >where relkind='r' and substring(relname, 1, 3) != 'pg_';
> >
> > relname | reltuples | relpages
> >-------------------------+-----------+----------
> > basefiles | 0 | 0
> > cdimages | 0 | 0
> > client | 2 | 1
> > counters | 1 | 1
> > device | 0 | 0
> > file | 650659 | 11558
> > filename | 623012 | 3958
> > fileset | 22 | 1
> > job | 384 | 10
> > jobmedia | 596 | 7
> > location | 0 | 0
> > locationlog | 0 | 0
> > log | 0 | 0
> > media | 245 | 9
> > mediatype | 2 | 1
> > path | 67908 | 966
> > pool | 5 | 1
> > sql_features | 439 | 6
> > sql_implementation_info | 12 | 1
> > sql_languages | 4 | 1
> > sql_packages | 10 | 1
> > sql_parts | 9 | 1
> > sql_sizing | 23 | 1
> > sql_sizing_profiles | 0 | 0
> > status | 19 | 1
> > storage | 2 | 1
> > unsavedfiles | 0 | 0
> > version | 1 | 1
>
> Do relpages * 8096 / reltuples.
> Default installs have 8 KB pages/blocks by default.
>
> For file it should be ~144 bytes/tuple.
> For filename it should be ~51 bytes/tuple.

Those tuple size estimates seem consistent with the tuple sizes reported
earlier when the overhead is added in. I'm impressed!
>
> Probably you will get some signficant differences here.
>
> >> Another factor is representation in the SQL dump might be more
> >> efficient than in the database, but this highly depends on your data
> >> set. For example, a int8 takes 8 bytes in a table, while it takes
> >> between 1 and ~20 in a SQL dump.
> >>
> >> How the plain SQL dump becomes this small I cannot explain without
> >> much much more details.
> >
> >On Tom's point, bacula regularly inserts entries into the
> >tables and then, days to months later, deletes them. As far
> >as I know, the VACUUM FULLs I just did were the first ever; I
> >did do several of them because I kept getting messages about
> >needing more fsm_pages.
> >
> >I am still trying to figure out if the database was getting
> >any automatic vacuuming at all. The Postgres documentation
> >(the database is 8.2, though I'm moving to 8.3 soon) sounds as
> >if it's on automatically, but the Debian-specific
> >documentation suggests I may need to do some additional things
> >to enable it.
> >
> >Probably the fsm_pages being low also hurt disk useage, since
> >the message accompanying the vacuum said that's what happens
> >if fsm_pages is low. It was 20k; vacuum said I needed 56k,
> >and I upped it to 80k.
> >I'm not sure if my recent cleaning has brought the needed
> >fsm_pages down.
> >
> >I've only been doing partial backups for the last few months,
> >so there's probably less info in the tables than under normal
> >conditions. I suppose it's possible the space I gained was
> >just a temporary win.
>
> Looks like some configuration changes are needed to tune your
> installation to better suite you needs. fsm_pages can be modified,
done
> but I
> don't know if any other parameters might need change for that.
I didn't seem to bump into any system limits; I had already upped the
overall shared memory limit a bit.
> Documentation can help and probably a lot of people here can do and tell
> it blindfolded.

I'm not a DB admin; I only play one on my computer. I clearly need to
figure out how to get regular vacuum, analyze, and reindexing going (if
they aren't going already).
>
Thanks for all your help.
Ross

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Joris Dobbelsteen 2008-03-30 22:44:19 Re: database 1.2G, pg_dump 73M?!
Previous Message Joris Dobbelsteen 2008-03-30 21:59:36 Re: database 1.2G, pg_dump 73M?!