2011-05-01

Automatically Record History of Field Changes in Postgres (Dynamic Triggers in PL/pgSQL)

Hoorah! I was able to complete my attempt at writing a single PL/pgSQL function in Postgres 9.0.4 to create history records tracking individual field value changes generically for all my tables. Some developers call this an "audit trail", though an accountant might say otherwise.

Special thanks to John DeSoi on the General Postgres mailing list for pointing me to a crucial code example to make PL/pgSQL interpret: 
   "OLD." || myColumnNameVar
as:
   OLD.first_name    (for example)

The crucial line is:
   EXECUTE 'SELECT ($1).' || ri.column_name || '::text' INTO t USING NEW;

I'm new to SQL and Postgres, so I may be wrong but my interpretation of that line is:
Since there appears no way to make the PL/pgSQL interpreter interpret our desired string, the trick is to look outside of PL/pgSQL and instead use the SQL interpreter. I was incorrectly thinking of the PL/pgSQL interpreter as the same thing as the SQL interpreter, or rather as incorporating the SQL interpreter. Once I saw that line of example code, I realized the two interpreters are separate. The language, the syntax, of PL/pgSQL is a superset of SQL, but the interpreters are separate with one calling the other.

Brilliant! It works, and it seems to be fast enough, at least for my needs.

I also found this blog post on working around the impossibility of dynamic triggers in PL/pgSQL.

I'll share my current code & table structure below. Caveat: This code has not yet been thoroughly tested, nor has it been deployed. I only finalized it today.

[A] I'm working on a greenfield project, where:
  • I've built every table to have a primary key of type uuid named exactly "pkey_".
  • Every table has a TIMESTAMPTZ field named "record_modified_".

My approach below hard-codes these assumptions.

[B] I have this "history_" table:

CREATE TABLE history_
(
 pkey_ uuid NOT NULL DEFAULT uuid_generate_v1mc(), -- The primary key for this table, though no primary key constraint was created (for the sake of performance and conservation). This column and timestamp_ column are the only two columns about this table itself. All other columns are about the inserted/modified/deleted record in some other table.
 table_name_ character varying(120) NOT NULL, -- Name of table whose row is being affected (inserted, deleted, or modified).
 column_name_ character varying(120) NOT NULL, -- Name of the column in some other table whose row value is being modified. This column's value is empty string if the operation was DELETE.
 timestamp_ timestamp with time zone NOT NULL DEFAULT clock_timestamp(), -- The moment this record was created. Using the clock_timestamp() function as a default, to capture the actual moment in time rather than moment when transaction began.
 db_user_name_ character varying(120) NOT NULL DEFAULT "current_user"(), -- The name of the Postgres user logged in to this database connection/session.
 app_name_ character varying(120) NOT NULL DEFAULT current_setting('application_name'::text), -- The name of the application connected to the database. May also include the version number of app, and the name of the human user authenticated within the app.
 old_value_ character varying(120) NOT NULL DEFAULT ''::character varying,
 new_value_ character varying(120) NOT NULL DEFAULT ''::character varying,
 uuid_ uuid NOT NULL, -- The UUID of the row being affected, the row being inserted, updated, or deleted. Assumes every table whose history is being recorded uses the 'uuid' data type as its primary key.
 operation_ character varying(120) NOT NULL, -- What database operation resulted in this trigger running: INSERT, UPDATE, DELETE, or TRUNCATE.
 table_oid_ oid NOT NULL, -- The oid of the table whose record is being modified. May be helpful if a table name changes over time.
 ordinal_position_of_column_ integer NOT NULL, -- The position of the affected column in its table. Every new column gets a number, incremented by one for each. This may be helpful when analyzing changes across a stretch of time during which a column's name was changed. Apparently columns have no oid, so we are recording this number instead.
 transaction_began_ timestamp with time zone NOT NULL DEFAULT transaction_timestamp() -- The time when the current transaction began. Can act like a transaction identifier, to group multiple "history_" rows of the same transaction together. This is not foolproof, as multiple transaction could possibly start in the same split second moment. Assuming the computer's clock has a fine resolution, this chance of a coincidence should be quite miniscule.
)


I do not have a primary constraint for this table. The "pkey_" column acts as a primary key, but there is no need for an index or uniqueness testing for this special table. I suppose I could even drop this column entirely, since this kind of logging table will never have related records nor do we ever care about identifying a particular row. I'll leave it for now, just for the heck of it.

[C] For every table I want to track field-level value changes, I create a trigger like this:

CREATE TRIGGER XXX_trigger_history_
AFTER INSERT OR UPDATE OR DELETE ON XXX_
FOR EACH ROW EXECUTE PROCEDURE make_history_();


where 'XXX' is the name of the particular table.

[D] I created this function:

CREATE OR REPLACE FUNCTION make_history_()
RETURNS TRIGGER
LANGUAGE plpgsql
AS
$BODY$
/*     Purpose: Make a history of changes to most fields in the table calling this trigger function.
       This kind of history tracking is also known as an "audit trail".
       This function works by detecting each change in value for important fields in a certain table.
       This trigger function then calls another function to create a row in the "history_" table.
    This kind of feature is often called an "audit trail" by software developers. I avoid using that term in this context as a real
    audit trail in accounting terms involves more than this simple field change tracking.
*/
/*    © 2011 Basil Bourque. This source code may be used freely forever by anyone taking full responsibility for doing so, without warranty.
  
    Thanks so very much to John DeSoi of the pgsql-general@postgresql.org mailing list for pointing me to this crucial code example:
    http://wiki.postgresql.org/wiki/PL/pgSQL_Dynamic_Triggers
    Before reading that example, my previous efforts led me to conclude a generic history facility written in PL/pgSQL was impossible.
*/
/*     We make these assumptions about any table using this function in its trigger:
           • Has a primary key named "pkey_" of type uuid.
           • Has a field tracking the datetime the record was last modified, named "record_modified_" of type timestamptz.
           • Is in the default/current schema.
       While it might be nice to rewrite this function to escape these assumptions, I've spent all my energies to get this far.
    I welcome feedback from anyone who want to take this further.
*/



/*

For each table on which you want history, create a trigger by executing SQL like this:
CREATE TRIGGER XXX_trigger_history_
AFTER INSERT OR UPDATE OR DELETE ON XXX_
FOR EACH ROW EXECUTE PROCEDURE make_history_();


where XXX is the name of the specific table.
*/

/*     Notes:
      
    The 'OLD' and 'NEW' variables represent the entire row whose INSERT/UPDATE/DELETE caused this trigger to run.
       The 'TG_xxx' variables are special variables created automatically by Postgres for the trigger function.
       For example, TG_OP indicates which modification operation is happening: INSERT, UPDATE, or DELETE.
       http://www.postgresql.org/docs/current/static/plpgsql-trigger.html
       "clock_timestamp()" gets the actual time at the moment of execution. In contrast, most other timestamp
       functions return the time when the current transaction began.
    For more information, see: http://www.postgresql.org/docs/current/static/functions-datetime.html
  
    The "history_" table also includes a column "transaction_began_" defaulting to "transaction_timestamp()". This timestamp can act
    like a transaction identifier, to group multiple "history_" rows of the same transaction together. This is not foolproof, as
    multiple transaction could possibly start in the same split second moment. Assuming the computer's clock has a fine resolution,
    this chance of a coincidence should be quite miniscule. If someone knows a way to get a true transaction id, please share.
*/

/*  History:
    2011-04-31    • Published on the general Postgres mailing list.
    2011-05-01    • Revised to not ignore the ".record_created_" field.
                  • Published on my blog at http://crafted-software.blogspot.com/.
*/

DECLARE
    ri RECORD; -- About this data type "RECORD": http://www.postgresql.org/docs/current/static/plpgsql-declarations.html#PLPGSQL-DECLARATION-RECORDS
    oldValue TEXT;
    newValue TEXT;
    isColumnSignificant BOOLEAN;
    isValueModified BOOLEAN;
BEGIN
    /*RAISE NOTICE E'\n    Running function: make_history_ ----------------\n\n    Operation: %\n    Schema: %\n    Table: %\n',
        TG_OP,
        TG_TABLE_SCHEMA,
        TG_TABLE_NAME;*/

    IF (TG_OP = 'INSERT') OR (TG_OP = 'UPDATE') THEN
        NEW.record_modified_ = clock_timestamp(); -- Record the moment this row is being saved.
      
        FOR ri IN
            -- Fetch a ResultSet listing columns defined for this trigger's table.
            SELECT ordinal_position, column_name, data_type
            FROM information_schema.columns
            WHERE
                table_schema = quote_ident(TG_TABLE_SCHEMA)
            AND table_name = quote_ident(TG_TABLE_NAME)
            ORDER BY ordinal_position
        LOOP
            -- For each column in this trigger's table, copy the OLD & NEW values into respective variables.
            -- NEW value
            EXECUTE 'SELECT ($1).' || ri.column_name || '::text' INTO newValue USING NEW;
            -- OLD value
            IF (TG_OP = 'INSERT') THEN   -- If operation is an INSERT, we have no OLD value, so use an empty string.
                oldValue := ''::varchar;
            ELSE   -- Else operation is an UPDATE, so capture the OLD value.
                EXECUTE 'SELECT ($1).' || ri.column_name || '::text' INTO oldValue USING OLD;
            END IF;
            -- Make noise for debugging.
            /*RAISE NOTICE E'\n    Column #: %\n    Name: %\n    Type: %\n    Old: %\n    New: %\n',
                ri.ordinal_position,
                ri.column_name,
                ri.data_type,
                oldValue,
                newValue;*/
              
            --    ToDo: Add code to throw an Exception if the primary key value is changing (other than from NULL on an INSERT).
          
            --     ToDo: Add code to ignore columns whose data type does not cast well to TEXT/VARCHAR.
          
            --    Ignore some columns:
            --         • Those whose names are marked with a trailing x.
            --        • The primary key.
            --         • Our timestamp field recording the row's  most recent modification.
            isColumnSignificant := (position( '_x_' in ri.column_name ) < 1) AND (ri.column_name <> 'pkey_') AND (ri.column_name <> 'record_modified_');
            IF isColumnSignificant THEN
                isValueModified := oldValue <> newValue;  -- If this nthField in the table was modified, make history.
                IF isValueModified THEN
                    /*RAISE NOTICE E'Inserting history_ row for INSERT or UPDATE.\n';*/
                    INSERT INTO history_ ( operation_, table_oid_, table_name_, uuid_, column_name_, ordinal_position_of_column_, old_value_, new_value_ )
                    VALUES ( TG_OP, TG_RELID, TG_TABLE_NAME, NEW.pkey_, ri.column_name::VARCHAR, ri.ordinal_position, oldValue::VARCHAR, newValue::VARCHAR );
                END IF;
            END IF;
        END LOOP;
  
        RETURN NEW;
    ELSIF (TG_OP = 'DELETE') THEN
        /*RAISE NOTICE E'Inserting history_ row for DELETE.\n';*/
        --    Similar to INSERT above, but refers to OLD instead of NEW, and passes empty values for last 4 fields.
        INSERT INTO history_ ( operation_, table_oid_, table_name_, uuid_, column_name_, ordinal_position_of_column_, old_value_, new_value_ )
        VALUES ( TG_OP, TG_RELID, TG_TABLE_NAME, OLD.pkey_, ''::VARCHAR, 0, ''::VARCHAR, ''::VARCHAR );
        RETURN OLD;
      
    END IF;
     /* Should never reach this point. Branching in code above should always reach a call to RETURN. */
    RAISE EXCEPTION 'Unexpectedly reached the bottom of this function without calling RETURN.';
END;

$BODY$;



[end of code]

The beauty of this is that a history record is created for each field edit regardless of the avenue by which the data was changed. If my desktop app modifies data, a history row is created. If my web app does so, a history row is created. Ditto if I use interactive SQL to fix some data, or use pgAdmin to affect data. In all cases, history is made.

7 comments:

  1. this whole concept is exactly what i needed in my line of work. it works very nice even when table changes are made with QGIS software. thanks for sharing and for your work.

    ReplyDelete
  2. Thank you very much!

    This trigger was very useful for me!

    ReplyDelete
  3. You asked:
    >If someone knows a way to get a true transaction id, please share.

    The most simplest way is to add sequence and use it.

    For example:

    1) Let's create sequence:
    CREATE SEQUENCE query_id_seq
    START 5
    INCREMENT 1
    MINVALUE 1
    MAXVALUE 9223372036854775807
    CACHE 1;

    2) Add new column to history table:
    ALTER TABLE history_ ADD COLUMN query_id bigint not null;

    3) In the DECLARE part of function we will add declaration for new variable:
    query_id bigint;

    4) Then in the body of function (first line) we will get new value for each query (not transaction!):
    query_id=nextval('query_id_seq');

    5) In the "insert" queries we will use this value:
    INSERT INTO history_ ("query_id", operation_, table_oid_, table_name_, obj_id, column_name_, ordinal_position_of_column_, old_value_, new_value_ )
    VALUES (query_id, TG_OP, TG_RELID, TG_TABLE_NAME, NEW.id, ri.column_name::VARCHAR, ri.ordinal_position, oldValue::VARCHAR, newValue::VARCHAR );

    ReplyDelete
  4. You asked:
    >If someone knows a way to get a true transaction id, please share.

    select txid_current();

    ReplyDelete
  5. Did you ever run this in production? I'm looking for a similar solution.

    ReplyDelete
  6. why :

    EXECUTE 'SELECT ($1).' || ri.column_name || '::text' INTO newValue USING NEW;

    is not equivalent to :

    EXECUTE 'SELECT (NEW).' || ri.column_name ||'::text' INTO newValue

    ReplyDelete
  7. thank you!!! the ($1) was the key!

    ReplyDelete