A lot is happening in the data world these days — ever since Jeff Bezos talked about specialized databases for specialized workloads. A natural consequence of this specialization is the proliferation of databases in a typical enterprise today – sometimes, upto 10 different types of databases. Data teams have built 1-1, 1-many and many-many connections between various databases leading to 100’s of data pipelines. There is Jevons Paradox at play here as well. As databases proliferate, pipelines that connect them also increase. These data pipelines have become declarative and engineers are empowered with better tools to create even more pipelines.
A typical data landscape in an enterprise looks like the figure below – this one is from a leading ride-sharing company.
Chief Data Officers (CDO) are under tremendous pressure to derive value out of this data, especially in times of crisis like today [1]. Analytics and AI use cases push the limits of data management as well. In fact, the complexity and heterogeneity are increasing in enterprises. Case in point, data engineering [2,3] is *the* fastest-growing tech job in the US.
Cloud data lakes have emerged as a popular architectural pattern to help organizations get value out of this data. Schema-on-read (the new paradigm that data lakes introduced) has a benefit that it allows organizations to approach data challenges in a crawl-walk-run fashion – unlike data warehouses where more up-front planning needs to be done [4]. However, there is no free lunch — the price has to be paid somewhere, for the agility, flexibility and cost advantages of data lakes. This usually happens since the data often lacks context, doesn’t meet the quality required for applications, and is not easily understandable or discoverable by users. Problems of data consistency and accuracy make it hard to derive value from data lakes and to trust the analytics based on this data. The traditional methods of manually documenting, classifying and assessing the data don’t scale to the volume of cloud-based data lakes.
To exacerbate the problem, sometime during the mid-2000s, data (particularly as it relates to business decision making) crossed an important line. Previously, the majority of such data was sourced internally and its quality and reliability were in the hands of the internal systems and IT developers that created and maintained it. Since then, an increasing proportion of data comes from external sources. While internal data quality has often been questioned, it certainly far exceeds that of external data [5].
To summarize, enterprises are swimming in data. However, to truly enable data-driven enterprises, you need to let a lot of people use this data. This is the problem that Data Observability aims to solve. Data Observability enables enterprises to run predictable data pipelines by providing contextual information for data monitoring and data quality.
The ultimate goal of this endeavor is to make Big Data work the way you’d imagine it worked if you’d only used Small Data before. In effect, to operationalize data trustworthiness in enterprises.
References
[0] Werner Vogels wrote one size fits all database, fits no one.
[1] Are we asking too much of Chief Data Officers and their data teams?
[2] Data Engineers are part of the Analytics “dream team”
[3] What does a Data Engineer do?
[4] Data Lakes and Data Warehouses – It’s not either/or but both!
[5] Data Quality market is expected to be a $2.5B market in the next 8 years.
[6] Database of Databases – www.dbdb.io