Versioning in an Event Sourced System - TLDR

This is an extremely short version of the excellent writing by Greg Young on versioning in Event Sourcing. It can also be applied in other systems as well.

Most of what is written here - my own interpretation of the original book or sometimes quoted as-is and as such, all credit goes to Greg Young for this. I only added minor notes and only extended on the GDPR erasure a touch.

I want to write it in a "bullet point" style for myself as a reminder and a short summary.

Versioning techniques

Basic Versioning

A new version of an event must be convertible from the old version of the event.
If not, it is not a new version of the event but rather a new event.
  • Event.version (version on every event)
  • InventoryItemDeactivated, InventoryItemDeactivated_v2, InventoryItemDeactivated_v17 (class for every version)
Thus, any InventoryItemDeactivated_vNNN can be converted to InventoryItemDeactivated_CURRENT and previous versions are never “visible” from app perspective
  • Adding new version of an event can break consumers.
  • The workaround is to introduce a “schema version” server - more complexity.

Double Write (Double Push)

You should generally avoid versioning your system via types in this way.
Using types to specify schema leads to the inevitable conclusion that
all consumers must be updated to understand the schema before a producer is.
  • Producer writes both versions of events.
  • The old event is deprecated with time.
  • Fine in “stable situation” but will fail on “replays”.
  • Ok for distributed systems; not recommended for Event Sourcing.

Weak Schema

  • Options before - “Strong Schemas” (schema is part of each event, likely indicated by version field).
  • Rules to follow but more flexibility.
    • always follow the “mapping” algorithm before.
    • not allowed to rename, only add or remove.
    • some expectations may not stand (Id may not be present).
  • Does not require adding a new type of event or bumping up a version field.
  • Very simple to implement.
  • Instead of deserializing - map. The mapping rules:
    • exists on both (event JSON and class) -> use value from JSON
    • exists on event JSON but not on the class -> ignore
    • exists on class but not on JSON -> use a default value
So given JSON:
{
    number : 3.9,
    str : 'hello',
    other : 15
}
when mapped to a type
class Foo {}
 decimal number;
 string str;
}
it would produce an output of Foo { number=3.9, str=”hello”}

Hybrid Schema

  • Use Strong Schema for required fields.
  • Use Week Schema for all other fields.
  • Example: Protobufs.

General Versioning Concerns

Versioning of Behavior

The thought here is when hydrating state to validate a command logic in the domain can change over time.
If replaying events to get to current state how do you ensure that the same logic that was used
when producing the event is also used when applying the event to hydrate a piece of state?
The answer is: you don’t.
For example:
  • Apply ItemsSold(price=100) to add 9% tax.
  • The tax changes to 10% at a certain point.
  • WRONG you add a conditional to the “apply” logic of the aggregate to accommodate it.
  • OK include the tax into the event at the time that it is being created.
  • May have limitations if such data cannot be stored in the event (credit card for example).
If you find yourself putting branching logic or calculation logic in a projection,
especially if it is based on time, you are probably missing logic in the creation of that event.

Changing Semantic Meaning

Example:
  • 27 degrees.
  • 27 degrees.
  • 27 degrees.
  • 80.6 degrees.
  • 80.6 degrees.
Celsius vs Fahrenheit.
This becomes very challenging in Event Sourced systems as replaying must support multiple meanings of the data.

Snapshots

  • Snapshots are often not worth implementing due to the conceptual and operational costs associated.
  • Snapshots have typical versioning problems found in any structured data.
  • When snapshot changes, it is likely that it needs to be rebuilt rather than “upgraded”
    • example - add a new column with data from existing events
    • build the new snapshot, then swap deprecating the previous one with time
    • watch out for deleting the snapshots - data could be very expensive to rebuild

Suggestions to avoid versioning problems

  • TicketPaidForAndIssued - avoid “And” in the events - this looks like two separate events
  • Avoid “switch fields”. For example: IsOverride, IsAdmin etc.
  • Watch for 1-to-1 relations between commands and events:
    • in trading: Sell(400@0.99) may emit events TradeOccurred(100@1.00), TradeOccurred(300@0.99)
    • there is no PlaceTrade command here
  • avoid temporal coupling between the two things that are happening:
    • if later, these two concepts are split apart, there will be a messy versioning issue, likely requiring a “Copy-Replace” (see further)

Dealing with errors in Event Sourced systems

Errors are fixed with compensating actions, which can be:

  • Partial Reversal (I got $9000, should have $8000 -> reverse $100) - hard to follow from accounting/auditing perspective.
  • Full Reversal (I got $9000, should have $8000 -> reverse $9000 and start again) - simpler to understand.

How?

  • Create compensating actions (e.g. events) in advance and providing a way to use those
  • Create new ad-hoc event types, emit events directly to an Event Store (much like fixing data in RDBS)
    • may require updating some/all consumers first if this is a new type of event
    • ok for small systems; dangerous for larger ones.
  • Hybrid: introduce a special type of event (Corrected, Cancelled)
    • Cancelled - would include the id (and optionally body) of the event it cancels
However, not all Event Sourced systems have natural compensating actions.
  • Introducing compensating actions can be a useful exercise for understanding the domain.
  • Example: TruckLeft but the wrong truck was scanned.
  • A significant number of these compensating actions suggests that it needs to be considered.

What to fix?

Find what needs to be fixed:
  • Iterate through suspect aggregates, instantiate and apply the compensating action
    • often can’t know all ids of the affected aggregates
  • Create a (temporary) projection to identify the problem/problematic aggregates

Dealing with big mistakes in Event Sourced systems

When something goes really wrong to the point that the events are completely wrong:
  • Copy-Replace (nuclear option of versioning)
    • popular Event Stores allow some sort of “map+merge/split” of events producing the more appropriate stream
    • became one of the most popular patterns replacing other versioning strategies
    • often not very simple
      • switching from one stream to another and eliminating race conditions
      • read models out of sync - let those be rebuild
      • consumers need to be updated?
    • may have to emit additional events
      • pointing back to the original stream
      • StreamInvalidated event allows projectors to more efficiently rebuilt themselves, receiving a new stream
    • cheat: instead of splitting aggregate stream - use one stream for multiple aggregates
    • cheat: instead of merging aggregate stream - use two streams for one aggregate
      • also allows automatic deletion of some events for legislative purposes by deleting/expiring one stream
  • Transform (e.g. Simple Copy-Replace)
    • read events out of the old stream
    • accumulate (in memory or some sort of persistence)
    • transform
    • write to a new stream
    • delete/deprecate the original stream
  • Copy-Transform
    • ~> “Copy-Replace” on the whole Event Store rather than just a stream
    • process:
      • bring the new Event Store
      • only the previous Event Store is accepting the writes
      • new Event Store follows the previous one
      • change the new Event Store as necessary + test
      • new Event Store tells the old one to drain the requests and stop accepting the writes (marker event can be used for that)
      • the load balancer is pointed to the new Event Store
    • could be costly to rebuild all projections from the new store
    • at least double the hardware necessary
  • In Place Copy-Replace - some Event Stores allow “truncating before” a certain event
    • similar to Copy-Replace but done on the same stream
    • instead of writing to a new stream the events are appended back to the end of the same stream
    • the truncation occurs or the pointer is moved indicating the begging of the stream
  • Add/Change/Delete the events
    • one of the simplest examples (no deletes): GDPR erasure - remove data from PII events, add metadata indicating the erasure - same event, just no data on it
    • consumers MUST be aware of the mutation and may require new types of events to “know” about it (PersonDataErased(personId))
    • in more complex scenarios - likely to require downtime due to migration

Versioning Bankruptcy (or what if we have too much data)

So how do accountants handle data over long periods of time?
They don’t.
  • Create Initialized events when migrating to event sourced systems.
  • Similar events to Initialized can be used to “truncate” the system at a certain point (could be an end of financial year in accounting).
  • Archive + Initialize for a period of time:
    • migrate the whole environment much like Copy-Transform
    • “archive” the old one (including projections, Event Store)
    • Initialize the new environment, much like it was migrated from a non Event Sourced system
  • Hard to analyze data over multiple periods

Versioning Process Managers

Basic Versioning

  • Do what the organization does.
  • A “change” to a business process is more likely to be a new business process - so create a new process manager.

Upgrading process managers - strategies

  • Upcasting State - replace in-place to change the future and currently running processes.
  • Direct to Storage - upgrade the state of the process manager instances. Very, very dangerous.
  • New Version Migrates - some frameworks support “mapping” state from old to new process manager and thus automatically migrating and switching (no downtime).
  • Takeover/Handoff (the cleanest of them all)
    • the existing process manager is told to quit (via EndYourselfDueToTakeover message)
    • it emits TakoverRequested message with any relevant state
    • that message starts the new version of the process manager
    • the EndYourselfDueToTakeover and TakoverRequested have the same CorrelationId
  • Event Sourced Process Managers
    • used by Akka.Persistence
    • helps when the previous process manager’s state does not have enough information for the new one
    • a state is not in the process manager - is replayed from the messages
    • more complex
    • often used in conjunction with Takeover/Handoff except that the state is not included in the message
In most circumstances if you are trying to version running Process Managers, you are doing it wrong.

Instead focus on releasing new processes in the same way the business does.