One concept gaining huge momentum lately is file versionning (mostly Git and Subversion). It is quite interesting to track the evolution of the data contained in the files, and not only the last time the file was updated.

On the last post Databases: Meta-Data I was discussing about the merits of having a modification timestamp among with other various informations and more generally about the parallels between a database and a filesystem.

We can also adapt this model to our database by not only storing the last modification date, but also the old data.

Two main options are :

  • replicate the whole row in a history table
  • have a generic history table that only store the old values of the column that have changed

The approach is done in the way this ddj database article, but in a much less intrusive manner. We just have our trigger-based system evolve with every modification logging an insertion in the derived history table.

The history table and trigger become :

TABLE MD_ORDERS_HISTORY ( 
   SERIAL MD_ORDERS_HISTORY_ID PRIMARY KEY,
   TIMESTAMP CTIME,
   CHAR TYPE DEFAULT 'M',
   INT ORDERS_ID FOREIGN_KEY ON ORDERS(ORDERS_ID),
   MONEY PRICE_ADJUSTMENT
)

CREATE INSERT TRIGGER ON ORDERS o
BEGIN
  INSERT INTO MD_ORDERS (ORDERS_ID) VALUES (o.ORDERS_ID)
END

CREATE UPDATE TRIGGER ON ORDERS o
BEGIN
  UPDATE MD_ORDERS SET MTIME=NOW(), MLOGIN=CURRENT_USER() 
    WHERE ORDERS_ID = o.ORDERS_ID
  -- Log the change
  INSERT INTO MD_ORDERS_HISTORY (ORDERS_ID, PRICE_ADJUSTMENT) 
    VALUES (o.ORDERS_ID, o.PRICE_ADJUSTMENT)
END

CREATE DELETE TRIGGER ON ORDERS o
BEGIN
  -- Log the delete
  INSERT INTO MD_ORDERS_HISTORY (ORDERS_ID, PRICE_ADJUSTMENT, TYPE) 
    VALUES (o.ORDERS_ID, o.PRICE_ADJUSTMENT, 'D')
END

The interesting point is that you can now travel back in time in order to see what happened to a row in case of debugging the application or its (mis)usage.

The main issue of this system is that your database usage will be larger by several orders of magnitude, depending of how often your application update its data.

Several possibilities to limit the size exists :

  • Partitioning the historical data, and storing it on a slower (cheaper) array.
  • Pruning the old data since depending on the application, traveling back 1 whole year may be overkill
  • Live aggregating changes. If the data was modified less than 1 hour ago, just update the last history line. We can even have a history_start timestamp and history_stop timestamp to show that a aggregation has taken place.
  • Deferred aggregating changes. Like the live one, but on a scheduled basis. It can even have a dynamic granularity (a granularity of 1 day if the data is 1 year old, 1 hour if it is 1 month old).