Re: Commitfest 2023-03 starting tomorrow!

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: Justin Pryzby <pryzby(at)telsasoft(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Greg Stark <stark(at)mit(dot)edu>, "Gregory Stark (as CFM)" <stark(dot)cfm(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Commitfest 2023-03 starting tomorrow!
Date: 2023-03-22 05:45:40
Message-ID: CA+hUKGK=7mTwheXRfxz=bD47+m7WUa2xWmce0EfoycsfRN98wg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Mar 21, 2023 at 10:59 PM Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> wrote:
> I gave a talk on Friday at a private EDB mini-conference about the
> PostgreSQL open source process; and while preparing for that one, I
> ran some 'git log' commands to obtain the number of code contributors
> for each release, going back to 9.4 (when we started using the
> 'Authors:' tag more prominently). What I saw is a decline in the number
> of unique contributors, from its maximum at version 12, down to the
> numbers we had in 9.5. We went back 4 years. That scared me a lot.

Can you share the subtotals?

One immediate thought about commit log-based data is that we're not
using git Author, and the Author footer convention is only used by
some committers. So I guess it must have been pretty laborious to
read the prose-form data? We do have machine-readable Discussion
footers though. By scanning those threads for SMTP From headers on
messages that had patches attached, we can find the set of (distinct)
addresses that contributed to each commit. (I understand that some
people are co-authors and may not send an email, but if you counted
those and I didn't then you counted more, not fewer, contributors I
guess? On the other hand if someone posted a patch that wasn't used
in the commit, or posted from two home/work/whatever accounts that's a
false positive for my technique.)

In a quick and dirty attempt at this made from bits of Python I
already had lying around (which may of course later turn out to be
flawed and need refinement), I extracted, for example:

postgres=# select * from t where commit =
'8d578b9b2e37a4d9d6f422ced5126acec62365a7';
commit | time |
address
------------------------------------------+------------------------+----------------------------------------------
8d578b9b2e37a4d9d6f422ced5126acec62365a7 | 2023-03-21 14:29:34+13 |
Melanie Plageman <melanieplageman(at)gmail(dot)com>
8d578b9b2e37a4d9d6f422ced5126acec62365a7 | 2023-03-21 14:29:34+13 |
Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
(2 rows)

You can really only go back about 5-7 years before that technique runs
out of steam, as the links run out. For what they're worth, these
numbers seem to suggests around ~260 distinct email addresses send
patches to threads referenced by commits. Maybe we're in a 3-year
long plateau, but I don't see a peak back in r12:

postgres=# select date_trunc('year', time), count(distinct address)
from t group by 1 order by 1;
date_trunc | count
------------------------+-------
2015-01-01 00:00:00+13 | 13
2016-01-01 00:00:00+13 | 37
2017-01-01 00:00:00+13 | 144
2018-01-01 00:00:00+13 | 187
2019-01-01 00:00:00+13 | 225
2020-01-01 00:00:00+13 | 260
2021-01-01 00:00:00+13 | 256
2022-01-01 00:00:00+13 | 262
2023-01-01 00:00:00+13 | 119
(9 rows)

Of course 2023 is only just getting started. Zooming in closer, the
peak period for this measurement is March/April, as I guess a lot of
little things make it into the final push:

postgres=# select date_trunc('month', time), count(distinct address)
from t where time > '2021-01-01' group by 1 order by 1;
date_trunc | count
------------------------+-------
2021-01-01 00:00:00+13 | 83
2021-02-01 00:00:00+13 | 70
2021-03-01 00:00:00+13 | 100
2021-04-01 00:00:00+13 | 109
2021-05-01 00:00:00+12 | 54
2021-06-01 00:00:00+12 | 82
2021-07-01 00:00:00+12 | 86
2021-08-01 00:00:00+12 | 83
2021-09-01 00:00:00+12 | 73
2021-10-01 00:00:00+13 | 68
2021-11-01 00:00:00+13 | 66
2021-12-01 00:00:00+13 | 48
2022-01-01 00:00:00+13 | 68
2022-02-01 00:00:00+13 | 73
2022-03-01 00:00:00+13 | 110
2022-04-01 00:00:00+13 | 90
2022-05-01 00:00:00+12 | 47
2022-06-01 00:00:00+12 | 50
2022-07-01 00:00:00+12 | 72
2022-08-01 00:00:00+12 | 81
2022-09-01 00:00:00+12 | 105
2022-10-01 00:00:00+13 | 68
2022-11-01 00:00:00+13 | 74
2022-12-01 00:00:00+13 | 58
2023-01-01 00:00:00+13 | 65
2023-02-01 00:00:00+13 | 61
2023-03-01 00:00:00+13 | 64
(27 rows)

Perhaps the present March is looking a little light compared to the
usual 100+ number, but actually if you take just the 1st to the 21st
of previous Marches, they were similar sorts of numbers.

postgres=# select date_trunc('month', time), count(distinct address)
from t
where (time >= '2022-03-01' and time <= '2022-03-21') or
(time >= '2021-03-01' and time <= '2021-03-21') or
(time >= '2020-03-01' and time <= '2020-03-21') or
(time >= '2019-03-01' and time <= '2019-03-21')
group by 1 order by 1;
date_trunc | count
------------------------+-------
2019-03-01 00:00:00+13 | 57
2020-03-01 00:00:00+13 | 57
2021-03-01 00:00:00+13 | 77
2022-03-01 00:00:00+13 | 72
(4 rows)

Another thing we could count is distinct names in the Commitfest app.
I count 162 names in Commitfest 42 today. Unfortunately I don't have
the data to hand to look at earlier Commitfests. That'd be
interesting. I've plotted that before back in 2018 for some
conference talk, and it was at ~100 and climbing back then.

> So I started a conversation about that and some people told me that it's
> very easy to be discouraged by our process. I don't need to mention
> that it's antiquated -- this in itself turns off youngsters. But in
> addition to that, I think newbies might be discouraged because their
> contributions seem to go nowhere even after following the process.

I don't disagree with your sentiment, though.

> This led me to suggesting that perhaps we need to be more lenient when
> it comes to new contributors. As I said, for seasoned contributors,
> it's not a problem to keep up with our requirements, however silly they
> are. But people who spend their evenings a whole week or month trying
> to understand how to patch for one thing that they want, to be received
> by six months of silence followed by a constant influx of "please rebase
> please rebase please rebase", no useful feedback, and termination with
> "eh, you haven't rebased for the 1001th time, your patch has been WoA
> for X days, we're setting it RwF, feel free to return next year" ...
> they are most certainly off-put and will *not* try again next year.

Right, that is pretty discouraging.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2023-03-22 06:00:55 Re: [PoC] Let libpq reject unexpected authentication requests
Previous Message Kyotaro Horiguchi 2023-03-22 05:27:40 Re: Error "initial slot snapshot too large" in create replication slot