Re: Large C files

From: Peter Geoghegan <peter(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Peter Eisentraut <peter_e(at)gmx(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Jan Urbański <wulczer(at)wulczer(dot)org>
Subject: Re: Large C files
Date: 2011-09-06 23:13:52
Message-ID: CAEYLb_XQf6JmppcNnEtN+T5mEuYw7YZxEe=LV_hyLWndGPfnWQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 6 September 2011 21:07, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Tue, Sep 6, 2011 at 3:56 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> The only way I would entertain thoughts of major refactoring would be
>> if comprehensive tests were contributed, so we could verify everything
>> still works afterwards.
>
> That sounds like a really good idea.  Mind you, I don't have a very
> clear idea of how to design such tests and probably no immediate
> availability to do the work either, but I like the concept very much.

More or less off the top of my head, I don't think that it's much
trouble to write an automated test for this. Of course, it would be as
flawed as any test in that it can only prove the presence of errors by
the criteria of the test, not the absence of all errors.

Here's what could go wrong that we can test for when splitting a
monster translation unit into more manageable modules that I can
immediately think of, that is not already handled by compiler
diagnostics or the build farm (I'm thinking of problems arising when
some headers are excluded in new .c files because they appear to be
superfluous, but turn out to not be on some platform):

* Across TUs, we somehow fail to provide consistent linkage between
the old object file and the sum of the new object files.

* Within TUs, we unshadow a previously shadowed variable, so we link
to a global variable rather than one local to the original/other new
file. Unlikely to be a problem. Here's what I get when I compile
xlog.c in the usual way with the addition of the -Wshadow flag:

[peter(at)localhost transam]$ gcc -O0 -g -Wall -Wmissing-prototypes
-Wpointer-arith -Wdeclaration-after-statement -Wendif-labels
-Wformat-security -fno-strict-aliasing -fwrapv -Wshadow -g
-I../../../../src/include -D_GNU_SOURCE -c -o xlog.o xlog.c -MMD -MP
-MF .deps/xlog.Po
xlog.c: In function ‘XLogCheckBuffer’:
xlog.c:1279:12: warning: declaration of ‘lower’ shadows a global declaration
../../../../src/include/utils/builtins.h:793:14: warning: shadowed
declaration is here
xlog.c:1280:12: warning: declaration of ‘upper’ shadows a global declaration
../../../../src/include/utils/builtins.h:794:14: warning: shadowed
declaration is here
xlog.c: In function ‘XLogArchiveNotifySeg’:
xlog.c:1354:29: warning: declaration of ‘log’ shadows a global declaration
xlog.c: In function ‘XLogFileInit’:
xlog.c:2329:21: warning: declaration of ‘log’ shadows a global declaration
xlog.c: In function ‘XLogFileCopy’:
xlog.c:2480:21: warning: declaration of ‘log’ shadows a global declaration
xlog.c: In function ‘InstallXLogFileSegment’:
xlog.c:2598:32: warning: declaration of ‘log’ shadows a global declaration
xlog.c: In function ‘XLogFileOpen’:
xlog.c:2676:21: warning: declaration of ‘log’ shadows a global declaration
xlog.c: In function ‘XLogFileRead’:
*** SNIP, CUT OUT A BIT MORE OF SAME***

Having looked at the man page for nm, it should be possible to hack
together a shellscript for src/tools/ that:

1. Takes one filename as its first argument, and a set of 2 or more
equivalent split file names (no file extension - there is a
requirement that both $FILENAME.c and $FILENAME.o exist).

2. Looks at the symbol table for the original C file's corresponding
object file, say xlog.o, as output from nm, and sorts it.

3. Intelligently diffs that against the concatenated output of the
symbol table for each new object file. This output would be sorted
just as the the single object file nm output was, but only after
concatenation. Here, by intelligently I mean that we exclude undefined
symbols. That way, it shouldn't matter if symbols go missing when, for
example, Text section symbols from one file but show up in multiple
other new files as undefined symbols, nor should it matter that we
call functions like memcpy() from each new file that only appeared
once in the old object file's symbol table.

4. Do a similar kind of intelligent diff with the -Wshadow ouput
above, stripping out line numbers and file names as the first step of
a pipline, using sed, sorting the orignal file's compiler output, and
stripping, concatenating then sorting the new set of outputs. I think
that after that, the output from the original file should be the same
as the combined output of the new files, unless we messed up.

If you have to give a previously static variable global linkage, then
prima facie "you're doing it wrong", so we don't have to worry about
that case - maybe you can argue that it's okay in this one case, but
that's considered controversial. We detect this case because the
symbol type goes from 'd' to 'D'.

Of course, we expect that some functions will lose their internal
linkage as part of this process, and that is output by the shellscript
- no attempt is made to hide that. This test doesn't reduce the
failure to a simple pass or fail - it has to be interpreted by a
human. It does take the drudgery out of verifying that this mechanical
process has been performed correctly though.

I agree that C files shouldn't be split because they're big, but
because modularising them is a maintainability win. I also think that
10,000 line C files are a real problem that we should think about
addressing. These two views may seem to be in tension, but they're not
- if you're not able to usefully modularise such a large file, perhaps
it's time to reconsider your design (this is not an allusion to any
problems that I might believe any particular C file has) .

Anyone interested in this idea? I think that an actual implementation
will probably reveal a few more problems that haven't occurred to me
yet, but it's too late to produce one right now.

On 6 September 2011 08:29, Peter Eisentraut <peter_e(at)gmx(dot)net> wrote:
> I was thinking about splitting up plpython.c, but it's not even on that
> list. ;-)

IIRC the obesity of that file is something that Jan Urbański intends
to fix, or is at least concerned about. Perhaps you should compare
notes.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jan Urbański 2011-09-06 23:22:29 Re: Large C files
Previous Message Robert Haas 2011-09-06 22:57:07 Re: [PATCH] Log crashed backend's query (activity string)