Re: robots.txt on git.postgresql.org

From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: robots.txt on git.postgresql.org
Date: 2013-07-10 08:36:06
Message-ID: CABUevEyUM-CEmmBcHmX6VrnkHj8O7xYk6ZvfdSfk-T8O4jd-Vw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jul 10, 2013 at 10:25 AM, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:
> On 07/09/2013 11:30 PM, Andres Freund wrote:
>> On 2013-07-09 16:24:42 +0100, Greg Stark wrote:
>>> I note that git.postgresql.org's robot.txt refuses permission to crawl
>>> the git repository:
>>>
>>> http://git.postgresql.org/robots.txt
>>>
>>> User-agent: *
>>> Disallow: /
>>>
>>>
>>> I'm curious what motivates this. It's certainly useful to be able to
>>> search for commits.
>>
>> Gitweb is horribly slow. I don't think anybody with a bigger git repo
>> using gitweb can afford to let all the crawlers go through it.
>
> Wouldn't whacking a reverse proxy in front be a pretty reasonable
> option? There's a disk space cost, but using Apache's mod_proxy or
> similar would do quite nicely.

We already run this, that's what we did to make it survive at all. The
problem is there are so many thousands of different URLs you can get
to on that site, and google indexes them all by default.

It's before we had this that the side regularly died.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeevan Chalke 2013-07-10 11:31:23 Regex pattern with shorter back reference does NOT work as expected
Previous Message Dave Page 2013-07-10 08:35:24 Re: robots.txt on git.postgresql.org