Scripts to populate the date dimension in a PostgreSQL data warehouse

Anybody who’s worked on a data warehouse knows the date dimension is key. Designing it is complex and filling it might be tricky and time consuming. In this article I simply share SQL scripts I came up with to fill the date dimension in a PostgreSQL data warehouse. 

Step 1: create the table

The first step is to create the date dimension table. Kimball’s book [1] and this article might help designing this table. Here is what I came up with:

CREATE TABLE IF NOT EXISTS date (
   id integer PRIMARY KEY,
   date timestamp NOT NULL,
   -- Formatted dates
   date_uk_format text NOT NULL,       -- dd/mm/yyyy
   date_uk_short_format text NOT NULL, -- d/m/yyyy
   date_us_format text NOT NULL,       -- mm/dd/yyyy
   date_us_short_format text NOT NULL, -- m/d/yyyy
   date_iso_format text NOT NULL,      -- yyyy-mm-dd
   -- Year
   num_year integer NOT NULL,
   -- Quarter
   num_quarter_in_year integer NOT NULL,
   -- Month
   num_month_in_year integer NOT NULL,
   num_month_in_quarter integer NOT NULL,
   -- Week
   num_week_in_year integer NOT NULL,
   num_week_in_quarter integer NOT NULL,
   num_week_in_month integer NOT NULL,
   -- Day
   num_day_in_year integer NOT NULL,
   num_day_in_quarter integer NOT NULL,
   num_day_in_month integer NOT NULL,
   num_day_in_week integer NOT NULL,
   -- Holidays
   is_holiday_us boolean NOT NULL,
   -- Names
   name_month_en text NOT NULL,
   name_month_abbreviated_en text NOT NULL,
   name_day_en text NOT NULL,
   name_day_abbreviated_en text NOT NULL
);

More attributes might be added to the table: holidays in other countries, names translated in other languages, etc. You should also create index to ensure good performance when you access the data.

Step 2: define useful functions

Remember the date table will be filled only once. It means that every date between 2010 and 2024 will necessarily already be present in the table. Typically, when a date is used in another table, the primary key has to be retrieved by scanning the date table. Instead of doing so, I rather create a primary key which directly depends on the date. In other words, the primary key can be calculated from the date itself. In order to do this, the following function can be considered:

CREATE OR REPLACE FUNCTION get_date_primary_key(ts timestamp) RETURNS integer AS $$
   BEGIN
       RETURN 10000 * EXTRACT(YEAR FROM ts) +
              100 * EXTRACT(MONTH FROM ts) +
              EXTRACT(DAY FROM ts);
   END;
$$ LANGUAGE plpgsql;

For instance, get_date_primary_key(‘2017-08-01’::timestamp) returns 20170801.

This approach presents at least two advantages. First, it does not require to scan the date table to determine the primary key corresponding to a given date. Second, unlike generated keys, the primary keys generated thank to this approach are shared amongst environments (e.g., dev, qa and prod) and amongst data marts hosted in the same data warehouse.

Step3: populate the table

I came across a couple of scripts (including this one) to fill the date dimension but they didn’t seem advanced enough. So I created my own script. It is composed of 5 steps, which are materialised by 5 sub-queries:

  1. Generate all the dates between 1st January 2010 and 31st December 2024 (i.e., 5479 rows).
  2. Extract the main attributes out of each time date: id, date, date_*_format, and num_*.
  3. Extract the holidays.
  4. Extract names from numbers: name_*_en.
  5. Extract abbreviated names from names: name_*_abbreviated_en.

This step-by-step approach eases the comprehension and the maintenance of the query.

WITH date1 AS (
   SELECT generate_series('2010-01-01'::timestamp, 
                          '2024-12-31'::timestamp, 
                          '1 day') AS ts
), date2 AS (
   SELECT get_date_primary_key(ts) AS id,
      ts AS date,
      to_char(ts, 'DD/MM/YYYY') AS date_uk_format,
      ...
      EXTRACT(MONTH FROM ts) AS num_month_in_year,
      EXTRACT(ISODOW FROM ts) AS num_day_in_week
   FROM date1
), date3 AS (
   SELECT *,
      num_month_in_year = 1 AND num_day_in_month = 1 OR 
      num_month_in_year = 7 AND num_day_in_month = 4 OR 
      num_month_in_year = 12 AND num_day_in_month = 25 
         AS is_holiday_us
   FROM date2
), date4 AS (
    SELECT *,
       CASE
          WHEN num_month_in_year = 1 THEN 'January'
          ...
          WHEN num_month_in_year = 12 THEN 'December'
       END AS name_month_en,
       CASE
          WHEN num_day_in_week = 1 THEN 'Monday'
          ...
          WHEN num_day_in_week = 7 THEN 'Sunday'
       END AS name_day_en
    FROM date3
), date5 AS (
   SELECT *,
      substring(name_month_en from 1 for 3) AS name_month_abbreviated_en,
      substring(name_day_en from 1 for 2) AS name_day_abbreviated_en
   FROM date4
)
INSERT INTO date
   SELECT id, 
      date, 
      date_uk_format, 
      ...
      num_month_in_year, 
      num_day_in_week,
      is_holiday_us, 
      name_month_en, name_month_abbreviated_en, 
      name_day_en, name_day_abbreviated_en
   FROM date5;

There’s a big chance that you will need to adapt the scripts to your own needs, but feel free to download them to get the ball rolling. Following the same approach, I also wrote the scripts to fill the time dimension.

References

[1] The Data Warehouse Toolkit, 3rd Edition. Ralph Kimball and Margy Ross. Wiley, 2013.