Thoughts on the mirroring system etc

From: "Magnus Hagander" <mha(at)sollentuna(dot)net>
To: <pgsql-www(at)postgresql(dot)org>
Subject: Thoughts on the mirroring system etc
Date: 2005-01-20 12:11:37
Message-ID: 6BCB9D8A16AC4241919521715F4D8BCE476685@algol.sollentuna.se
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-www

Hello!

In light of yesterdays release and what was probably the largest hit so
far on the current websites "way of things", I had a couple of thoughts.
The site more or less went down, which is not good. What's in there now
is a temporary fix, and a permanent one is needed. And one that does not
need manual intervention to fix (as this one did). So here are some
thoughts on what I think need to be done.

I know some of these things have been discussed before. Some exactly the
same way, some slightly different. I know steps are in motion to do some
of them. I'm just lining up everything here. And yes, actually offering
to help out if wanted. Just say the words.

And if I'm stepping on someones toes here, let me apologize in advance.
Just point me in the right direction. It's not my intention to be
someone who just complains about what is now, I'd rather be someone who
helps with ideas on how to move forward.

Number of mirrors
-----------------
* There are currently almost 60 mirrors for the static web content.

* During the very largest load during slashdotting etc, the three
servers serving up the static content totalled no more than a little
over 6Mbit of traffic, at around less than 500 requests / second.

* During this time, wwwmaster pushed around 1.5Mbit

* As long as www.postgresql.org is fast, people will *not* pick their
local mirror for the web (ftp is a different thing, as it's more
bandwidth intensive).

This leads me to the conclusion that we do *not* in fact need the large
mirror network to handle the bandwidth load. In fact, most of those
sites probably use up more bandwidth syncing than they save. It *is much
needed* for redundancy, however, and we need better automation for that
one. (A lot of man-hours were thrown in to fix this problem. For next
time, it's better if it's done before)

My suggestion for this is to limit the number of mirrors to around 5,
give or take a few. But instead, put higher demands on these mirrors
than we do now. Demand they sync every 30 minutes (or 60, but you get
the point). Demand that they have a fast machine and a fast network
connection. There have been enough offers of servers and networks that
this should not be a major problem. Demand that they respond to
www.postgresql.org - if it can have a dedicated IP, even better.
Distributed across the world of course.

The other mirrors can stay if they want. Don't let them sync to the
master, to keep the load down, just to another mirror (as it is now with
only srv4, borg and eastside syncing to wwwmaster, and all others
syncing to svr4).

For wwwmaster, have two machines at different locations. Use Slony to
replicate the database. Some coding probably needed to manually handle
some updates (like the logs), since Slony isn't multimaster yet.
wwwmaster held up fine now, but if something happens to the box or the
network it's on we're dead in the water.

Then do some "DNS magic" to do the load balancing:
* Create a new zone, let's call it "mirrors.postgresql.org". With a TTL
of no more than 10-15 minutes. Distribute this zone to more DNS servers
than the current zone, since the load on the nameservers will be much
higher. But require that all these machine respond to update
notifications so they pick up changes *right away*. By creating a new
zone we can both separate the handling of it (so a bug only affects this
and not say the mailinglists etc), and we can keep the TTL on the main
zone fairly large.

* Add a CNAME for www.postgresql.org to
www-static.mirrors.postgresql.org

* Have a script running at a dedicated machine somewhere *very* well
connected that is *not* one of the webservers. This script will poll the
website every 5 minutes. If the site does not respond, it's dropped from
the zone right away. If it is not up to date, the site is dropped from
the zone if it's more than <n> minutes old (depending on how often sync
is demanded)

* This also provides a way to gracefully take one machine out of the
cluster without needing any manual hacking of DNZ zones, etc. Simply
stop syncing and then wait an hour or so and all requests should be
elsewhere. Then once the machine is upgraded/reinstalled/moved/whatever,
just start syncing again and things should be picked up again.

A similar solution for wwwmaster, of course.

I am willing to invest some time in doing these scripts if wanted. I
don't think it's a huge amount of work. And parts of it has already been
done by dave in the current mirror checking script.

A similar solution can be made for the ftp servers, but I think it's of
less need there. If we want to do it, let's start with www and take it
from there if necessary.

Sync speed
----------
After setting up eastside to help handle the load of www.postgresql.org
I noticed the sync was horribly slow when nothing had changed. This was
because it synced the attributes on all files every time - the update
date, I beleive. Dave has committed a couple of patches I made for this
now, and sync time has dropped from >5 minutes down to <5 seconds.

A mirror pull when *nothing* has changed is right now around 400Kb. With
60 servers syncing up that's a full 24Mb every time when nothing has
changed. With just 5 servers, well, do the math ;-)

Bittorrent/Ftp
--------------
As Dave has already referred to, I think it'd be good to link bittorrent
links from every file in the ftp browser. Slashdot linked directly to
the bittorrent downloads, and that showed. But once it fell down on the
slashdot page, the amount of people using bittorrent fell off very fast.
During peak my two seeders sent about 4Mbit/sec on bittorrent. Also, the
load hit bt.postgresql.org instead of www.postgresql.org, so it was not
distributed.

Since this means more bittorrent seeders, it should perhaps be on a
separate box from the web stuff. There could be several that just
rsynced the .torrents between each other so the project always has a
couple of seeders in. This would be a very easy point for people to just
"plug in more bandwidth" when required as well, since bittorrent
automatically makes sure that nobody can serve a non-up-to-date file,
etc. With some tweak to the scripts it ought to be possible to make this
run with just one process serving a whole lots of torrents - they just
need to be in the same directory.

As for ftp mirrors, the bandwidth demand there is no dobut much higher
than it is on the web servers, so keeping more mirrors here make a lot
of sense. Also, some of the ftp sites that mirror us now have *huge*
amuonts of bandwidth (in the size of many gigabits/sec).

wwwmaster
---------
If you hit the ftp browser (or a download link), and then click anything
in the menu, you get the whole site served from wwwmaster. If the above
is fixed, so mirrors are all referred to as www.postgresql.org, it
should be as simple as sticking a <base href> in there or something. BUt
until then, perhaps some creative coding in the framework can fix it so
links that are hit on wwwmaster point back to www whereas the static
site uses relative links only?

Wow. That was a lot longer than initially intended. Hope someone has the
patience to read it all ;-)

//Magnus

Browse pgsql-www by date

  From Date Subject
Next Message Dave Page 2005-01-20 12:47:25 Re: Thoughts on the mirroring system etc
Previous Message Dave Page 2005-01-20 10:09:22 Re: Site way slow, please check