fixing bookindex.html bloat

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org, Peter Eisentraut <peter(at)eisentraut(dot)org>
Subject: fixing bookindex.html bloat
Date: 2022-02-13 20:16:18
Message-ID: 20220213201618.qz6p6noon3wagr3f@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Sometime last year I was surprised to see (not on a public list unfortunately)
that bookindex.html is 657kB, with > 200kB just being repetitions of
xmlns="http://www.w3.org/1999/xhtml" xmlns:xlink="http://www.w3.org/1999/xlink"

Reminded of this, due to a proposal to automatically generate docs as part of
cfbot runs (which'd be fairly likely to update bookindex.html), I spent a few
painful hours last night trying to track this down.

The reason for the two xmlns= are different. The
xmlns="http://www.w3.org/1999/xhtml" is afaict caused by confusion on our
part.

Some of our stylesheets use
xmlns="http://www.w3.org/TR/xhtml1/transitional"
others use
xmlns="http://www.w3.org/1999/xhtml"

It's noteworthy that the docbook xsl stylesheets end up with
<html xmlns="http://www.w3.org/1999/xhtml">
so it's a bit pointless to reference http://www.w3.org/TR/xhtml1/transitional
afaict.

Adding xmlns="http://www.w3.org/1999/xhtml" to stylesheet-html-common.xsl gets
rid of xmlns="http://www.w3.org/TR/xhtml1/transitional" in bookindex specific
content.

Changing stylesheet.xsl from transitional to http://www.w3.org/1999/xhtml gets
rid of xmlns="http://www.w3.org/TR/xhtml1/transitional" in navigation/footer.

Of course we should likely change all http://www.w3.org/TR/xhtml1/transitional
references, rather than just the one necessary to get rid of the xmlns= spam.

So far, so easy. It took me way longer to understand what's causing the
all the xmlns:xlink= appearances.

For a long time I was misdirected because if I remove the <xsl:template
name="generate-basic-index"> in stylesheet-html-common.xsl, the number of
xmlns:xlink drastically reduces to a handful. Which made me think that their
existance is somehow our fault. And I tried and tried to find the cause.

But it turns out that this originally is caused by a still existing buglet in
the docbook xsl stylesheets, specifically autoidx.xsl. It doesn't omit xlink
in exclude-result-prefixes, but uses ids etc from xlink.

The reason that we end up with so many more xmlns:xlink is just that without
our customization there ends up being a single
<div xmlns:xlink="http://www.w3.org/1999/xlink" class="index">
and then everything below that doesn't need the xmlns:xlink anymore. But
because stylesheet-html-common.xsl emits the div, the xmlns:xlink is emitted
for each element that autoidx.xsl has "control" over.

Waiting for docbook to fix this seems a bit futile, I eventually found a
bugreport about this, from 2016: https://sourceforge.net/p/docbook/bugs/1384/

But we can easily reduce the "impact" of the issue, by just adding a single
xmlns:xlink to <div class="index">, which is sufficient to convince xsltproc
to not repeat it.

Before:
-rw-r--r-- 1 andres andres 683139 Feb 13 04:31 html-broken/bookindex.html
After:
-rw-r--r-- 1 andres andres 442923 Feb 13 12:03 html/bookindex.html

While most of the savings are in bookindex, the rest of the files are reduced
by another ~100kB.

WIP patch attached. For now I just adjusted the minimal set of
xmlns="http://www.w3.org/TR/xhtml1/transitional", but I think we should update
all.

Greetings,

Andres Freund

Attachment Content-Type Size
pg-html-stylesheet.diff text/x-diff 1.7 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-02-13 20:16:58 Re: Mark all GUC variable as PGDLLIMPORT
Previous Message Tom Lane 2022-02-13 20:09:20 Re: Adding CI to our tree