Re: PostgreSQL GIT mirror status

From: "Daniel Farina" <drfarina(at)acm(dot)org>
To: "Peter Eisentraut" <peter_e(at)gmx(dot)net>
Cc: pgsql-www(at)postgresql(dot)org, "Jeff Davis" <pgsql(at)j-davis(dot)com>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject: Re: PostgreSQL GIT mirror status
Date: 2009-01-09 10:53:24
Message-ID: 7b97c5a40901090253w25ddd4e5q2e104e58a998610f@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-www

Okay, final report:

I suggest running 'git gc' from time to time instead of repack
directly. It seems smart enough on modern git versions to have some
sensible limits and generally do the right thing to keep a repository
in shape, in spite of its name suggesting it's really 'just' for
garbage collection. It'll also detect an excessive number of packs and
consolidate them. Tweaking the gc options may be preferable to messing
around with repack options directly, but I found there was no need to
tweak to see large improvement.

Secondly, 'git gc' has the '--aggressive' option. This used to do
something really misleading, but I'm pretty sure it's fixed 'now',
although I couldn't point you to an exact version. This makes life
easy: just run 'git gc --aggressive' once in a long while. Given the
current data it seems that the pack should be about 100M
afterwards.

Thirdly, I found a lot of garbage. There was no garbage when I used
wget to fetch a copy of repo (and over 600000 objects) but then when I
pushed to a git clone git chose only to send something in the 300000
object range. I suspect the difference is in the reflog or something,
but I still can't explain why there was so much garbage that's not
connected to branches or tags. Regardless, all the branches seem
present and 'git fsck' says everything is okay. I'm trying to figure
out where those extra objects are reachable from, but that's mostly
for completeness -- everything seems to be working convincingly.

I only have access to a machine where I've set up a 'dumb' git repo
that only serves via http. It's at
http://fdr.lolrus.org/postgresql.git

If you are interested in grabbing a verbatim copy of my objects and
repo, you can run the following to get an exact, untouched mirror:

$ wget -np -erobots=off -r http://fdr.lolrus.org/postgresql.git

You will probably have to delete any spurious 'index.html' files that
wget grabs before the repository will work as-is.

Conclusion: 361M (plus pathological performance issues) to 246M (just
repacking) to 110M (aggressive repacking).

fdr

Addendum:

I tried repack with much deeper delta chains (that's what too so long
to compute as alluded to in my previous email) and it did cut down
size by another 20 megs or so, but many operations are much more
costly because of the long chains. The 20 meg increase in size buys a
lot of performance, so I think default 'git gc --aggressive' uses a
more reasonable trade-off.

In response to

Responses

Browse pgsql-www by date

  From Date Subject
Next Message Daniel Farina 2009-01-09 10:55:17 Re: PostgreSQL GIT mirror status
Previous Message Brendan Jurd 2009-01-08 22:12:42 Re: Wiki wizard help?