Database Audit Trails: How Would You Do It?

22 commentsWritten on October 19th, 2009 by
Categories: Software Development

For those of you wondering what the background story on the previous post is about: we had a meeting today to discuss 2 alternatives on how we should implement the database audit trail for a new project we're doing. These are the functional requirements:

  • We have to be able to easily retrieve what user X changed in the database on day Y or during a certain time frame.
  • We need to know what record X looked like at point Y in time, and be able to show a historic list of changes for a specific record.
  • We need to configure the duration for which auditing data must be kept in the database on a per-table (or per-schema) level

We've done this in the past, but never in a manner that is just perfect (or close enough to being perfect anyway) for both requirements. So we discussed two alternative approaches to tackling this problem.

The first approach is to have 3 tables. The first two tables would contain metadata about the database structure (tables and columns basically). The third table would have a reference to each column with a previous value, the new value, who changed it and a timestamp of when the change occurred. We would use database triggers that would store the appropriate auditing data in the third table for each action (insert/update/delete). The benefits of this approach would be that it would be pretty easy to set up and also easy to query which changes were made by a certain user. It would be a bit more complex to reconstruct the state of a record at a certain point in time, but it would still be doable. The biggest downside (in my opinion), is that for each insert/update you would have X amount of inserts into the auditing table, and you'd also have to either perform some queries to get the identifier values of the relevant tables/columns or hardcode those identifier values in the generated triggers to avoid the cost of the extra select statements. So if you have an entity with 12 columns, an insert into the entity's table would mean 12 inserts into the auditing table. Updating 4 columns of that entity would lead to 4 insert statements in the auditing table, etc...

The other alternative that we considered was to use a specific auditing table for each real table, and every time a user does an insert/update/delete, insert one record into the table's specific auditing table with the current state of the record along with a timestamp, the userid, and an indication of which action was performed (insert/update/delete). This insert would also be done with database triggers. The benefits of this approach would be that it would be very easy to show what the state of a certain entity was at a certain point in time, and that you'd only pay the cost of the single extra insert. There are however two big downsides to this approach. The first is that it'll lead to an explosion of tables. We'd probably store the auditing tables in a separate database but you still basically have twice the amount of tables (and scripts!) that you'd have to worry about compared to what you'd usually have. We can very realistically expect a couple hundred of real tables in a few years, so you'd have to multiply that by two if you'd go for this auditing approach. Another big downside is that it would be very expensive to perform a query which would tell you what user X changed at point Y in time (or during a time frame).

Neither solution appears to be ideal. Also, this database will definitely grow a lot (more tables and obviously data) as we add more functionality and it's pretty safe to consider this entire project as something that will be expanding in functionality for a couple of years to come.

So my question to you is: Which approach do you think is best, and why? Or better yet, is there a better approach that we simply haven't thought of yet? I would definitely appreciate some feedback on this (either on the validity of both approaches or alternative approaches) so please do feel free to leave your thoughts in the comments :)

  • Patrice

    What i have done in the past is:
    - having one table with a blob, and information about type of changes, user information, status, record identifier, database table name. This table is partitioned on a database level (based on month and on table name).

    - On each save/update we serialize the current record (and related object in case of nhibernate) and the new record in Json, and we get the difference between the two. The difference is saved into the blob.

    - When we delete a record, we save a line with a listener (in nhibernate) into the audit table, with a specific message.

    - For following a trace of the user (view for example), we log one line per access with a specific message with an httpmodule…

    With this, you can provide the complete life of the record, the user who did the changes with his historic.

  • Ryan Roberts

    I have never managed to come up with a solution to that problem at the database level that didn’t suck, especially when you start dealing with schema evolution.

    Have you considered other storage? Couchdb has versioning built right in.

  • Joseph Daigle

    ESRI has a “geodatabase” product which I am very familiar with. The spatial part isn’t important, however they have implemented an archiving solution into the database. Internally it uses a your 2nd solution of an audit table for every physical table that is being archived.

    uerying for the historical version of a particular table is very trivial and fairly easy. However you are correct in that in that is more expensive to query what a single user did. But it’s only N queries where N is the number of tables, which isn’t TOO bad for a single reporting-type query (you could cache or store the results).

    A third approach could be to combine both ideas. The audit tables are a REALLY GOOD IDEA for doing this sort of archiving. However you could also store in a separate table all of the user operations which were performed and when. This way you can easily query what data a user touched and when. Then you can drill down into the historical data as needed.

  • dave-ilsw

    We use a single change log table on a project that I work on. We have change log ID, time stamp, user name, table name, set ID, operation, record ID, field name, old value, new value fields.

    For inserts we just record a single entry. For updates, we record the field name and old and new values. We use a stored procedure to create each change log entry. The stored procedure only records change log entries for fields with a changed value.

    The set ID makes it easy to group the multiple change log entries that are created when multiple fields change in a single update.

    We use a Python script to auto-generate the various triggers that are needed for each field in each table that is subject to change logging (there are some tables that don’t get any change logs and we don’t log changes to memo fields, for example).

  • Justin Rudd

    My #0.02 – maybe you’ve already had these conversations but…

    Figure out what “Y” is first. Figuring out “Y” is much harder than figuring out how to store the data. You are (probably) thinking about storing data till the end of time. I once worked on a system that wanted to store 18 months of archive data. This business touted this as a feature of our platform. No customer talked to needed more than 30 days. But I designed for 18 months, fast queries, etc because the business guys couldn’t define a real “Y”.

    Once “Y” was set at 30 days, scaling it to 60 days or 90 days was trivial because the 30 day solution was so much more simply.

    With that said, I also got the business guys to give me -
    1.) a valid failure rate for archive data (basically on any given day given a normal flow of traffic, how much was I allowed to lose)
    2.) an amount of time before the archive data had to be searchable (i.e immediately, 2 hours, a day?).
    3.) What is the latency on searching? Does it have to be as fast as the real system? Or can it take some time?

    With those 3 pieces of information, in my case I was able to offload archiving and searching onto a messaging system, and I only saved the delta. I implemented VCDIFF (http://www.faqs.org/rfcs/rfc3284.html) on the textual representation. I didn’t do anything fancy with the deltas. Just kept them in a chain (1 -> 2 -> 3 -> etc). Had Eric Sink’s article (http://www.ericsink.com/entries/time_space_tradeoffs.html) been around, I probably would have gone with a combination of key frame chains and flower chains.

    Anyway – the deltas were stored in a BDB in a B-Tree database. I used a simple consistent hashing algorithm to pick the BDB to store to. Then I had a cron that ran every 1 hour to stop the message processing system, backup the BDB to a SAN, and restart the messaging system.

    Search was also done through messaging. But my search was really easy. The client would put in their entity IDs, I would issue the query to the BDBs and aggregate the results. It typically took about 250 to 700 milliseconds (depending on network saturation).

  • Phil Haselden

    Take a look at Temporal database patterns to see if that meets your needs. I blogged a little about this a while back here http://haselden.spaces.live.com/blog/cns!C7AD1671702F1899!118.entry. That entry contains links to Martin Fowler articles etc on the subject.

  • http://jonathan-oliver.blogspot.com Jonathan Oliver

    We kicked around temporal database design for a bit to show how data changed across time. It quickly made things way to complicated to be practice or even usable. You will definitely want to look at Greg Young’s patterns on the subject–Command Query Responsibility Separation. Take a look at Greg’s InfoQ presentation (starting at the 4 minute mark):
    http://www.infoq.com/presentations/greg-young-unshackle-qcon08

    I’ve compiled some resources on his patterns on my blog:
    http://jonathan-oliver.blogspot.com/2009/03/dddd-and-cqs-getting-started.html

  • http://computeristsolutions.com josh

    honestly, I would consider archiving policy a part of this question. I’ve done two auditing approaches, both with single tables. First, the really simple one just logged a message describing the change and the user. It was basically good for proving who did what, but not for point in time inspections because it was a string message. Second was actually similar to the first suggestion. With a single table having columns for timestamp info, user, entity, value before change, and value after change. This was effective for both proving who did what, and point in time inspection even though it took effort to get that.

    We did encounter an issue when it came to archiving policy. If you we going to say archive everything over 30 days, then it was fine. However, if you wanted to vary the archive terms by entity it was a problem since the logging for all entities was in a single table. If you need to vary archival term by entity, you need to have separate auditing tables for each entity or duplicates of each table for auditing. I suppose you could manually extract audit entries by entity type manually, but that’s error prone.

  • http://dotnet.org.za/cjlotz Carel Lotz

    If you are using Sql Server 2008, it has support for this out-of-the-box using its Change Data Capture mechanism. I doesn’t support all your requirements (like tracking the user) but it is interesting to see how they approach and solve the problem.

  • http://benpittoors.wordpress.com den Ben

    @Josh
    The first approach in the post uses table- and column metadata for use in the audit trail table. You can easily query for a specific entity so different archiving policies per entity wouldn’t be a problem. Don’t see why that would be error prone…

  • http://jack-fx.com/net/ Jack

    it is something like sql server’s system tables, sys.tables, sys.column and others.

  • James L

    Journal the changes to a text file

  • Dalibor Carapic

    If you are using Microsoft SQL I would recommend serializing objects into xml and creating one table which contains all the relevant version data (timestamp, user etc) and one xml column which contains xml representation of the record (example:

    100
    test
    … etc …

    You get one audit table (which I guess could be filled by some sort of a generic function) with the ability query for the data you would need (via MS SQL xml querying capabilities).

    If you are using another database which has no xml querying then I would still go with the same principle, but maybe not with Xml format (maybe ‘column=value’).

  • Dalibor Carapic

    Repost, forgot its html:
    If you are using Microsoft SQL I would recommend serializing objects into xml and creating one table which contains all the relevant version data (timestamp, user etc) and one xml column which contains xml representation of the record (example:
    <table name=”user”>
    <col name=”id”>100</col>
    <col name=”name”>test</col>
    … etc …
    </table>

    You get one audit table (which I guess could be filled by some sort of a generic function) with the ability query for the data you would need (via MS SQL xml querying capabilities).

    If you are using another database which has no xml querying then I would still go with the same principle, but maybe not with Xml format (maybe ‘column=value’).

  • http://thinkbeforecoding.com Think Before Coding

    Your needs seems more at business level (know what user did what when ?) than at database level.
    If you make a history of your database, you’ll only know the state at a given date, but you won’t be able to know why these changes happened.

    Follow Jonathan advice, storing domain events as your main storage, then derive RDBMS views from events. You can easily reconstruct any view of your data from you event stream then.

  • ijrussell

    Have you looked at http://www.codeplex.com/AutoAudit. It was written by Paul Nielsen, a Sql MVP who writes the Sql Server Bible books.

  • http://weblogs.sqlteam.com/mladenp/ Mladen

    my favorite way of auditing data is to use async triggers with service broker.
    you take the deleted and inserted pseudo tables do FOR XML PATH(”) on them and send them to a SB queue.
    you can have 2 queues: 1 for saving raw data from inserted and deleted tables and one for saving diff-ed data.
    this takes care of DML changes.

    for DDL changs you can use DDL triggers which also give you data in xml format.
    with that you can again do the same thing like above.
    you can have 2 tables: 1 for DML and 1 for DDL changes.

    using this async SB pattern you can have you auditing on its own server that is meant just for auditing.
    all child servers can all send audit data to this centra auditeing server.

    i’ve implemented this and it works great. also here’s my article on the subject:
    http://www.sqlteam.com/article/centralized-asynchronous-auditing-across-instances-and-servers-with-service-broker

  • Dathan Bennett

    I agree with Dalibor, with a small addition: we keep an audit schema similar to the first option you described, but instead of serializing changes per field, we serialize per record by encapsulating the changes in an XML document. Then, if using Microsoft SQL Server, querying the audit log can be done from within TSQL.

    On the other hand (this is the addition to Dalibor’s suggestion), to more efficiently support queries to find when row X was changed, we add a junction table to create a many-to-many relationship between the audit table and the table that stores column metadata (could also be done against the information schema, system views, etc. I guess, though I’ve never done it). This way, when working within a database that does not support XML querying, you can still efficiently trim your results to only changes that affected the value of the particular column you’re working with. If you’re working in a scenario where surrogate primary keys of a uniform data type are used, you can add the primary key as a field to your audit table – at that point it’s easy and fast to get a list of all audit entries that changed the value of a particular cell in your database, regardless of your particular RDBMS. You’ll still have to iterate through the results to find the particular change you’re looking for, but in my experience this is a very efficient and maintainable solution.

    Small caveat: creating links between audit rows and column metadata rows can make your auditing model somewhat painful when your schema starts to evolve.

  • http://nirajrules.wordpress.com Niraj

    Hi Davy,

    I posted my thoughts here. Let me know your thoughts.

    Niraj

  • Brandon Morales

    I like to keep the history in each table, and combine that with views and triggers. For each table I want an audit trail I have meta columns {date_created, date_active, date_end, date_replaced}, a primary_key column, and the history_key column. When a new record is added, the date_created, date_active columns receive the current date. The primary key(PK) is a GUID auto generated, and a trigger copies PK to the history key. If a record is updated, we replace the update command in a trigger to keep the old record and just update the date_replaced and date_end columns with the date the record was updated. Then to add a new record, with the updated data, with the previous records history key and the previous records date_created date, but with a new date active. Similar process with deletes, only we don’t populate the date_replaced column only the date_end.

    We then use a view to only show records that don’t have values in the date_end column so we only show current data, not history information.

  • Pingback: Implementing Audit Trail for your application « Niraj Bhatt – Architect's Blog

  • Khansadique123

    Audit Trail (Data Activity Monitoring) is a tool through which we can record all user’s activity on the database, where users are interacting with database directly or through application. It is very useful for those who wish to keep eyes on user activity with data in the database. Also it is helpful for Q.A. (quality assurance team) who is testing application and wish to know data flow before to approve application. This tool is quite easy to manage and view report than other tools. The person who is not DBA can play with this tool easily. This tool is designed for management/administration department to audit data in database.  It is just like a CC camera on database.
    visit http://www.bintobin.com for more detail.