Anyone for T?
The term ETL (extract, transform, load) has dominated data conversations for half a century. The basic premise was that you get data from somewhere (E), do something to the data to increase their usefulness (T) and then put the data somewhere else (L).
We experienced ELT (extract and load the data quickly into your control before transforming).
In reality there was a mix of Es, Ts and Ls along the data pipeline, each with its own idiosyncrasies and business rules. The ‘T’ rules were often hidden in the E or the L (for example deciding not to load records if data fields have a null value is not a simple load decision but a transformation of the data through application of a rule). Just think about the effect on your counts if you’re not aware of the rule.
Tool vendors told us that ‘they’ve got the T’ and all we needed to do was apply their tool (with its ever-growing number of connectors) to handle the E and the L, and they’d see to the rest. It wasn’t that simple, and enterprises ended up with a morass of code and tools doing parts of the E and/or L and/or T and often many times over.
The result – lack of control and governance, repeated process failures, endless rework, poor lineage, and traceability, and most importantly a loss of trust.
If anything, the ‘T’ challenge has grown as storage and compute have become less costly allowing for more data to be stored and processed, as cloud computing adoption has accelerated, and as IoT and the digital revolution create ever larger sources of rich data in ever more disparate sources.
Unless we choose to ignore the evidence of our own experience, we have to change our approach and accept two basic truths:
- That there isn’t a monolithic T that can be miraculously wished away by the next extraordinary technology
- That T is inherently complex and requires thought and planning if it is to be effective.
We must challenge the way T has been managed (with such limited success for 50 years) and accept that T is actually a framework of design and quality principles. We should aim to shape a simplified data architecture, that meets business needs and reduce avoidable costs of T (such as cleansing, fixing, deduping, rework) and thereby liberate human and technology resources to deliver the real value of T – which is enriching data to its most contextually valuable state for the purposes for which it is needed.
The framework has just three simple principles – ABC:
Agility and autonomy
Begin with quality
Create once, use many times.
These three principles require a change of mindset as they are predicated on delegation of responsibility and accountability to teams that deliver the data outcomes. While best delivered in a squad mode of delivery, any enterprise can reap immediate rewards by applying the broad principles to any operating model with the right level of commitment from the senior leadership team. The details of each framework will be shared in subsequent blogs and discussions sessions, with just the high-level of each principle presented below.