Why good data practice matters

21 June 2021

At Opencast we know how important it is to get data management and strategies right. Avi Marco explains why data integration matters for modern business.

My experience of working with businesses down the years on data projects has shown just what can happen when they either fail to manage their data well enough, or more importantly don’t locate their data work sufficiently in the wider business context.

From the consumer client that found its marketing efforts hampered by unreliable data, to the SaaS company unable to do business reporting, to the financial business making haphazard changes that could have brought down an entire platform, poor data policies have threatened wider business operations. These firms found out the hard way that they should have been managing their data better.

I lived and shared the pain at these businesses. And I continue to be frustrated by companies that pour money down the drain on ill-considered data initiatives.

In all of its services - whether in software delivery, architecture, cloud, or user-centred design - Opencast maintains a broader perspective, with a clear focus on business outcomes. The same broad perspective is needed in the data space.

One key issue that’s of real interest to data professionals right now is data integration - which brings data together from disparate sources into a combined, usable view.

With the advent of multi-hybrid clouds this challenge has become more complex than ever before. Anyone engaging with the question of which data integration solution to adopt will be thinking about this challenge. Even before we consider tooling, there is a more basic architectural question of how we address data transformation.

Light blue box with an icon of a cloud in with lines coming out of it
"One key issue that’s of real interest to data professionals right now is data integration - which brings data together from disparate sources into a combined, usable view"

Anyone for T?

The term ETL (extract, transform, load) has dominated data conversations for half a century. The basic premise was that you get data from somewhere (E), do something to the data to increase their usefulness (T) and then put the data somewhere else (L).

We experienced ELT (extract and load the data quickly into your control before transforming).

In reality there was a mix of Es, Ts and Ls along the data pipeline, each with its own idiosyncrasies and business rules. The ‘T’ rules were often hidden in the E or the L (for example deciding not to load records if data fields have a null value is not a simple load decision but a transformation of the data through application of a rule). Just think about the effect on your counts if you’re not aware of the rule.

Tool vendors told us that ‘they’ve got the T’ and all we needed to do was apply their tool (with its ever-growing number of connectors) to handle the E and the L, and they’d see to the rest. It wasn’t that simple, and enterprises ended up with a morass of code and tools doing parts of the E and/or L and/or T and often many times over.

The result – lack of control and governance, repeated process failures, endless rework, poor lineage, and traceability, and most importantly a loss of trust.

If anything, the ‘T’ challenge has grown as storage and compute have become less costly allowing for more data to be stored and processed, as cloud computing adoption has accelerated, and as IoT and the digital revolution create ever larger sources of rich data in ever more disparate sources.

Unless we choose to ignore the evidence of our own experience, we have to change our approach and accept two basic truths:

  1. That there isn’t a monolithic T that can be miraculously wished away by the next extraordinary technology
  2. That T is inherently complex and requires thought and planning if it is to be effective.

We must challenge the way T has been managed (with such limited success for 50 years) and accept that T is actually a framework of design and quality principles. We should aim to shape a simplified data architecture, that meets business needs and reduce avoidable costs of T (such as cleansing, fixing, deduping, rework) and thereby liberate human and technology resources to deliver the real value of T – which is enriching data to its most contextually valuable state for the purposes for which it is needed.

The framework has just three simple principles – ABC:

Agility and autonomy
Begin with quality
Create once, use many times.

These three principles require a change of mindset as they are predicated on delegation of responsibility and accountability to teams that deliver the data outcomes. While best delivered in a squad mode of delivery, any enterprise can reap immediate rewards by applying the broad principles to any operating model with the right level of commitment from the senior leadership team. The details of each framework will be shared in subsequent blogs and discussions sessions, with just the high-level of each principle presented below.

Agility and autonomy

Business measures are often calculated by a central BI team that scrabbles around to produce them and often challenged by other groups who claim their metrics are different, perhaps more up to date or calculated differently. The time and effort wasted in these conversations can be reduced by simply empowering the teams closer to the original data to do the job while holding them accountable for the fidelity of the metrics they produce.

This approach will not only move delivery closer to the people with domain expertise, but it will also help them take ownership of the metrics by which their function is measured. It’s quite likely that these conversations will lead to better definitions of more meaningful metrics that are easier to adapt as market conditions change. Of course, there will remain some global metrics that need to be produced centrally, but that job will be easier by far if the constituting measures can be trusted.

Begin with quality

While data quality may be measured in different ways, the core principle is that quality should be established as early in the data lifecycle as possible, and ideally at the point the data are first created. For too long, the data industry has thrived on costly remedial actions to the problem of data quality. If a data element matters to your business, get it right up front.

It may be as simple as insisting that a development team validate input to a text field. It might be as difficult as overcoming the digital sales team’s insistence that each customer journey be as light as possible, losing valuable information for insights or data science, or as developers decide they don’t want to transfer a data payload in the name of application efficiency.

Of course, the developers or sales team might be right in arguing that efficiency at that point outweighs the subsequent cost to the business, but they often have a parochial view and miss the bigger picture. So have the conversation to determine what is the right balance.

Create once, use many times

We saw that key business metrics are often produced in a reporting layer, right at the end of the data pipeline. For example, a measure of customer sentiment. That’s fine if your company is happy to look in the rear-view mirror to explain why they churned. But what if you want to direct customers to a dedicated digital journey while they are engaging with your business to increase stickiness and to prevent churn?

Does your mobile app get a customer sentiment or value score from an API into your reporting platform (which incidentally only reports scores at close of business yesterday and misses all of today’s important interactions)? Does your mobile app recalculate the scores (and does the development team ensure that their definitions remain consistent with those in the reporting tool)? And the customer save team – where do they go to find customer value scores when dealing with a complaint – perhaps an iFrame that calls yesterday’s reporting data or a widget that accesses the app team’s calculation?

Spend time to identify the best point in the data pipeline to calculate these scores once, consistently, and authoritatively so that many applications or services can consume the same essential metric. This will produce a trusted and sustainable metric at a lower cost to your business.

Applying these principles to data transformation will profoundly impact outcomes by improving data quality and trust in a cost-efficient way that drives structured thinking to the design of a simplified data architecture. Then you’ll be ready to make your data tool choices.