Re: How to setup default value "0000-00-00" for "date"

From: Marco Colombo <marco(at)esi(dot)it>
To: Michal Taborsky <michal(at)taborsky(dot)cz>
Cc: Richard Huxton <dev(at)archonet(dot)com>, <pgsql-general(at)postgresql(dot)org>, <emilu(at)cs(dot)concordia(dot)ca>
Subject: Re: How to setup default value "0000-00-00" for "date"
Date: 2004-08-20 11:06:39
Message-ID: Pine.LNX.4.44.0408201201510.3512-100000@Megathlon.ESI
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Fri, 20 Aug 2004, Michal Taborsky wrote:

> Richard Huxton wrote:
> > Where you don't have a valid date to store you should use NULL. This
> > business of storing zeroes is a horrible MySQL design mistake.
>
> Well, yes and no. It certainly is a design mistake and introduces
> incosistency into the database, but after I was bitten several times by
> NULL values I'd go for solution like this any day. Let me explain.

He refers to MySQL design, not _your_ design. But by letting you
insert zeros, MySQL misguided you in your design...

> We had a table of messages, which was inserted to randomly and every few
> minutes we'd walk through the unprocessed messages and perform some work
> on them. I, trying to have the database as clean as possible, used this
> table definition (simplified):
>
> messages (
> id serial,
> data text,
> arrived timestamp default now,
> processed timestamp)
>
> So after the message arrived, it had the processed field set to null,
> which was perfectly consistent and represented what it realy was--an
> unknown value.

No. You're using one field to emulate two. First, you're using
processed as a flag, to indicate the unprocesses/processed status.
Then, you use is to store the timestamp. To clarify, you use that
field to put two different questions to the system:

1) has this message been processed?
2) _when_ was this message processed?

This design mistake introduces the need of an 'invalid' value you
have to put in the field, cause that's the 'flag' part I've mentioned.
And NULL does not mean 'invalid', it means 'not available'.

A timestamp field can be used to answer only to one question: 'when'.
Be careful what NULL means here: it does not mean 'never', it means
'I don't know when it was processed'. You cannot consider a message
with NULL timestamp as 'not processes yet', because that's only one
case. It could have been processed, but for some reason someone
forgot to store the timestamp.

So your claim:
"...which was perfectly consistent and represented what it realy was--an
unknown value."
does not hold.

It's a subtle design mistake. Because you seem to be sure that
if a message was processed, then the timestamp must be available,
you're implying that if the field is NULL, the message is unprocessed.
This kind of 'knowledge' is external to the DB. You'd better let
the DB know the full story if you want it to provide nice answers
to your queries. It's not just an index problem.

[...]
> SELECT * FROM messages WHERE processed='-infinity'.

There! Here you have to introduce your 'flag' value again. The dual
nature of the field has to be represented somehow... see? Since
NULL failed in its role of 'flag', you're using another value.
Now ask yourself what the meaning of processed is here:

1) if it's NULL, you can't tell _when_ (not _if_) the message was processed;
2) if it's -infinity (call it SPECIAL_FLAG_VALUE), the message was NOT
processed;
3) if it's different from -infinity, then the message was processed, and
you can tell when...

Compare it to the natural meaning of a timestamp field:

1) if it's NULL, you can't tell _when_ the message was processed;
2) otherwise you can tell _when_ the message was processed.

that simple.

[...]
> 1.1.1970 00:00). Just make sure you make a note of it somewhere and my
> suggestion is you write a COMMENT ON that column for future generations.

:) If you want to take care of future generations, my suggestion is
to design the database the right way from the start.
If you need to check for unprocessed messages, and thus you want the db
to answer to the question: 'has this message been processed?'
you should introduce a way to represent that kind of knowledge
explicitly, with a boolean field. Once you have two fields:

is_processed boolean,
processed_ts timestamp,

you can express queries more naturally (and they'll run fast).
'Saving' one field may lead to horrors at query time.

And of course, you may want to place contraints on those fields:
- if is_processed is false, then processed_ts must be NULL;
- if is_processed is true, then processed_ts must be NOT NULL;
- is_processed may be NOT NULL.

The first is natural, the second tricky: the system will force
anyone to specify a timestamp when changing the state of the
message to 'processed'. It means no partial records, no 'I will
fill it in later'.
The third one is tricker: it places a contraint on the knowledge you
need to have before you can insert a message in the system...
if someone walks in and tells you: 'let's put message XXX into the
system, but I don't know if it was processed or not' you'll have to
say 'I'm sorry no, I can't', and hope that answer gets accepted.
That message will have to stay outside the system, until someone
discovers if it was processed or not. If that message _has_ to be in
(maybe just because your boss expects it to be in) you'll have to
provide a value... that means the DB will provide potentially
wrong answers (think of count() on processed messages).

.TM.
--
____/ ____/ /
/ / / Marco Colombo
___/ ___ / / Technical Manager
/ / / ESI s.r.l.
_____/ _____/ _/ Colombo(at)ESI(dot)it

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Razvan 2004-08-20 11:12:04 Re: Postgres and JBOSS
Previous Message Michal Galusza 2004-08-20 10:15:35 Stored Procedures and libpq library