The second puzzle in Joe Celko’s SQL Puzzles is about tracking employee absenteeism.
The initial table structure is like this:
CREATE TABLE personnel ( emp_id integer PRIMARY KEY ); CREATE TABLE excuse_list ( reason_code varchar(40) UNIQUE ); CREATE TABLE absenteeism ( emp_id integer NOT NULL REFERENCES personnel (emp_id), absent_date date NOT NULL, reason_code varchar(40) NOT NULL REFERENCES excuse_list (reason_code), severity_points integer NOT NULL CONSTRAINT severity_points_range CHECK (severity_points BETWEEN 1 AND 4), PRIMARY KEY (emp_id, absent_date) );
There are already some nice properties to notice in this schema:
- All columns are
NOT NULL, which is ideal.
- The list of valid absence reasons is extracted into an explicit table that will be easier to work with than an enum column within the absenteeism table.
- There’s a simple but useful check constraint to ensure
severity_pointsis valid (this could also have been a foreign key to a separate table).
- There’s a natural primary key of
(emp_id, absent_date)on the
absenteeismtable. This will improve consistency and let the database optimise queries more effectively.
The goal in this SQL puzzle is to make (kind of nasty!) decisions about the consequences of employee absenteeism using this schema, with fiddly rules around how to count and rate absence:
- If an employee gets >40 severity points with a one-year period, they get fired.
- Absences of more than one day in a row are counted as illness, not as absence, i.e. we only count single isolated days.
- We do want to be able to count illness days separately, they just aren’t considered for the absence-firing rule above.
We can change the schema if it makes this easier to achieve.
First solution: allow severity points to be zero
This approach solves the issue of not counting subsequent contiguous days as absence, but as illness instead, by giving them a severity score of zero.
To do that we need to allow zero as a valid value for the
ALTER TABLE absenteeism DROP CONSTRAINT severity_points_range , ADD CONSTRAINT severity_points_range CHECK (severity_points BETWEEN 0 AND 4);
Note that this is a good reason to define check constraints with a name, i.e.
CONSTRAINT foobar CHECK (...) and not just
Celko makes a good point that it is a mistake to treat zero as a “waste of space” or not a real value. Zero is a number that is just as meaningful as any other. See also: zero, one, infinity rule.
We can then use a single update query to zero-out the days of absence that we want to treat as illness, i.e with zero severity:
INSERT INTO excuse_list (reason_code) VALUES ('long term illness'); UPDATE absenteeism SET severity_points = 0, reason_code = 'long term illness' WHERE EXISTS ( SELECT * FROM absenteeism AS A2 WHERE absenteeism.emp_id = A2.emp_id AND absenteeism.absent_date = (A2.absent_date - INTERVAL '1 DAY') );
Here we’re using a subquery to identify a day of absence for the same employee that is one day before the current one, i.e. all but the first in a contiguous series of days. If there is such a preceding day, we set severity points to zero and the reason code to “long term illness”.
It’s then straightforward to sum up the total absence severity points for all employees to see if anyone has accumulated 40 or more points:
SELECT emp_id, SUM(severity_points) FROM absenteeism GROUP BY emp_id;
This doesn’t seem to include the “within a one-year period” requirement described in the original problem statement, but maybe that is assumed.
With that you can then do a delete query like this to “fire” those employees (which seems a bit brutal but this is just an example after all):
DELETE FROM personnel WHERE emp_id = ( SELECT a.emp_id FROM absenteeism a WHERE a.emp_id = personnel.emp_id GROUP BY a.emp_id HAVING SUM(severity_points) >= 40 );
GROUP BY is not needed in SQL-92+, only in older SQL implementations, which is surprising to me. I would have thought it was essential here.
It’s quite nice that the business rule is described pretty naturally in the SQL query.
This solution seems fine, although it would be nice if the database actually enforced the “severity score is zero if there is a preceding day of absence” constraint, rather than requiring an update query to keep that correct.
Second solution: include the “one-year period” requirement
Now it’s acknowledged that the original firing victim selection query did not consider the one-year period requirement, which can be added like this:
SELECT emp_id FROM absenteeism WHERE absent_date >= (CURRENT_TIMESTAMP - INTERVAL '365 DAYS') AND absent_date < CURRENT_TIMESTAMP GROUP BY emp_id HAVING SUM(severity_points) >= 40;
That query uses a suggestion from SQL Performance Explained to use an explicit range condition for queries over continuous date periods.
The other issue raised in this second solution is that the previous solution left absences in place, or at least would try to – the first delete query would actually be rejected for breaking a foreign key constraint on the
As you might expect, the best way to deal with this is to let the database handle it, in this case with an
ON DELETE clause when creating the table:
CREATE TABLE absenteeism ( emp_id integer NOT NULL REFERENCES personnel (emp_id) ON DELETE CASCADE, absent_date date NOT NULL, reason_code varchar(40) NOT NULL REFERENCES excuse_list (reason_code), severity_points integer NOT NULL CONSTRAINT severity_points_range CHECK (severity_points BETWEEN 1 AND 4), PRIMARY KEY (emp_id, absent_date) );
Another issue is also raised: the zero-points-for-illness update will only work if weekend days are also marked as absences, which is likely not to happen. The suggested solution is to also allow routinely marking weekends as absences with a severity score of zero and a dedicated reason_code.
I’m not a fan of this idea, as I think it conceptually breaks what the
absenteeism table means in this domain. Anyone querying the
absenteeism table always has to know and remember to allow for this special weekends workaround in their query. It’s a bit like tomb-stoning or soft deletes, with similar problems.
Third solution: explicit calendar table
The above objection is now covered by the suggestion to use an explicit
calendar table that describes working days (similar to the problem that php-business-time and ts-business-time address, but solved with an SQL table).
CREATE TABLE calendar ( cal_date date NOT NULL UNIQUE, day_type varchar(40) );
With a calendar table that describes working days, you can then cleanly select the relevant days of absence:
SELECT a.emp_id FROM absenteeism AS a, calendar as c WHERE c.cal_date = a.absent_date AND c.day_type = 'working' AND a.absent_date >= (CURRENT_TIMESTAMP - INTERVAL '365 DAYS') AND a.absent_date < CURRENT_TIMESTAMP GROUP BY a.emp_id HAVING SUM(severity_points) >= 40;