7 January 2022

The future history of Data Engineering

These get posted at groupby1.substack.com first and posted here as a backup. Comments enabled here and on substack, appreciate any questions.

Trigger warning: this may trip your thought leadership nerve, but I’m mostly riffing, and I think it is additive. This post was easy to start, a pain to finish and lots of fun in the middle.

This is a narrative for the near future of Data Engineering in startups, and I think makes some interesting points. I do think the post has avenues for expansion, especially counter-arguments in the context of enterprise tech.

Hello to new subscribers, and a shoutout to anyone using the RSS feed.

Thanks to the LocallyOptimistic.com community for some lively discussions on an early draft of this post.

Intro

The core premise of this post is:

Most businesses' data engineering needs have been solved or will shortly be solved by managed services that 10 years ago would require endless and extensive self-built ETL pipelines, databases and tools.

For the exceeding majority of businesses, this means they can and should focus on building capacity for business logic, analysis and predictions instead of data engineering.

The minority of businesses that need streaming services / low latency batch data, will further push the boundary, using specialist Data Engineers.

The implications are that while Data Engineering is growing rapidly, so too are the forces that will undermine the need for Data Engineers, and the current under-supply of competent engineers will lead to an over-supply of junior engineers (this should ring a bell to the Web-dev then Full-stack then Data Science boot camps).

Let us break down the premise further, as that is a massive generalisation, and to the reader of this niche corner of opinion, may seem counterintuitive, inflammatory, and frankly, stupid.

6 points, expanded into 6 sections. Let's go.

1. Majority

Keep in mind the context for the majority of businesses - technology is often an expensive misdirect when implemented badly. This is exceedingly true when the technology is not directly aligned with their competitive advantage. Majority here means all businesses investing in technology, not just the typecast "blitz scale tech business". All businesses should take advantage of data tooling, or they will in effect be flying blind relative to their peers (or get an advantage over their peers if they get it right).

2. Data engineering

Plumbing of the data - ETL, data warehouse, streaming, batch, orchestration, infra etc. Niche skills that are hard to hire for.

Distinct from Software Engineering. Mostly not backend developers, more commonly generalists, occasionally highly skilled specialists.

Data Engineering is quite contentiously named as is made clear later.

3. Business logic

This is covered elsewhere - but in the most abstract terms - businesses should hope that their engineers’ primary focus is on improving the ability to represent the businesses various states, and enhance the ability to interact and modify these states, and even predict outcomes to modifications.

Businesses should strive not to have people worrying about managing infrastructure, plumbing, ops etc over and above what is strictly necessary. Playing on the margin of this point is what the CTO does.

4. Managed services

Think about Sysadmins of the mid-2000s, arcane knowledge that is now redundant in almost every business, due to AWS, then Heroku, now Vercel, Supabase etc flying up the stack. (Or hadoop specialists. Big Data DBA anyone?).

Same with Data Engineering. Tech abstraction as a service. Managed Services are arriving fast with the likes of Snowflake, Fivetran and the commodifying follow-ons. They’re aggressively chasing down the almighty dollar, undercutting margins, offering better cost structures, as well as a flurry of bundling mergers and consolidation.

5. The minority

Many businesses will still have an exceedingly strong need to increase their advantage through data engineering. Take High-Frequency Trading as an example. These businesses will progress the field, and the best data specialists will be needed in those spaces.

6. Implications

This one is clear, don’t get caught on the wrong side of any sea change. I would (do) argue that the ETL engineer skill-set is mostly going to be marginalised until :skull:.

In much the same way that the market demand for boot-camp Data Scientists is low, due in part to oversupply, better tooling and additionally a reorientation around the expectations of a Data Scientist, so too do I propose that Data Engineering demand dynamics will change.

I'd like to hope that the rate of ETL code being written is in decline because most can rely on managed services or open-source ELT extractors.

This point gets some pushback, discussed further in part 6.

But more generally, here is something about changes in the tide:

When the tide turns, there is a definite moment when the tide has indeed turned, but that change in direction becomes apparent to different boats at different times. This depends on context, location, keel depth and distance from both the equator and the moon (not to mention the sun). The gravitational pull has changed, but the water doesn’t start moving everywhere at the same time.

This blog also, as it became clear through writing it, and quoting sources, agrees with a certain viewpoint on specialisation, the link shall possibly become obvious.

So, that is the intro, I’m going to stick with those 6 sections so that coherence abounds, and explain / expand the points in finer detail.

1. Majority

Nearly every company needs a data person. Any company that has ambitions to beat market returns on their investor’s capital and doesn't have someone in a broadly data-dedicated data role will certainly struggle to compete.

10-15 years ago an easy indication that a company wasn't keeping up was having no IT person, the modern equivalent is having no Data Person.

But the premise is that that person no longer needs to be a Data Engineer.

Reference the sysadmin/dba type roles, which for 99.99% of businesses does not exist, because cloud providers hire those people and abstract their role into a service.

The thrust is that Data Engineering could go the way of the dba. Niche, specialised.

Who/What is the data person then?

Data engineering in the ETL/ELT sense has historically been complex, difficult, emergent, at times chaotic, and required niche software engineering skills.

Now, Extract and Load for most businesses using generic SaaS tools, is solved. Using the standard set of CRM, HR, Finance, and Ops tools, 80% of your ELT work is done for you at a standard, predictable price.

Commoditised EL SaaS is ubiquitous, with the second wave (of EL providers) offering better services at more favourable terms than Fivetran, with multiple variants and mutations.

T for transform, with general best practices courtesy of dbt, is where the bulk of the analytics work lies. Critically this is where the data person should start.

They should be, to some degree, what is known as an Analytics Engineer, but possibly more usefully, not a specialist Data Engineer (nor a Data Scientist for that matter, but that bridge has been crossed).

They should be a purple data generalist:

The data world needs more purple people — generalists who can navigate both the business context and the modern data stack. Let's put aside skillset dichotomies, and learn to feel comfortable in the space between.

If you do need a Data Engineer, probably for some or other niche API that isn’t supported by your EL tool, then this is great work to outsource! (My day-to-day work: Supporting companies on their fringe data engineering needs when their internal team wants some extra capacity or capability).

But the first full time data hire needs to be obsessed with business impact.

2. Data Engineering

A tale of two types of data Engineers: Again with the generalisations! In my view and from what I've seen in the job market, there are two types of data engineers at the moment:

(1) Data Engineers: Software engineers, Data

Described as: Software engineering specialists, with data as the core specialisation, who can focus on the niche areas of data engineering and can work with complex real-time data systems.

Needed When: Only required in tech businesses, and only when software engineers cannot assist. This is not needed for 99% of businesses and these candidates know what they want to work on and have the agency to decide.

Characteristics:

Tools-oriented
Computer scientists / Very good software engineers
Driven by curiosity
Driven towards perfection of the craft
Want the solutions to be elegant, optimised
A specialised role for a specialised business problem

Currently and into the future hired to do the following (quote from a slack group from someone who may or may not want the shoutout):

When building out some data-focused applications, like, say, a streaming data enrichment layer that serves up some curated data real-time to other micro-services, we need software engineers and data engineers. Occasionally you’ll find unicorns who can do it all (we have a few of them), but the vast majority of software engineers aren’t experienced enough with data to also be able to solve complex, big-data, non-SQL problems as well as someone more specialised could.

(2) Data Engineers: Solutions oriented engineers, Data

Described as: Business optimisers. Data engineers that engineer data because it is the biggest blocker in the optimisation of a bigger picture issue, namely analytics as it relates to business improvement efforts. I love this post from erikbern.com:

You work with the recruiting team to define a profile for a generalist data role, that emphasizes core software skills, but with a generalist attitude and a deep empathy for business needs. For now, you remove all the mentions of artificial intelligence and machine learning from the job posting.

Needed when: Data engineering data extraction and centralisation is identified as the key issue in a long line of issues. The primary bottleneck in the optimisation process.

Characteristics:

Goals oriented
Background in an adjacent engineering field
Driven by optimisation, the ultimate goal
Utilitarian problem solvers, relied upon to get the job done
Functionally broader skill set, maybe even new to the domain, and not (yet) experts in technology

Historically hired to do the following:

If you were on a “traditional data team” pre 2012, your first data hire was probably a data engineer. You needed this person to build your infrastructure: extract data from the Postgres database and SaaS tools that ran your business, transform that data, and then load it into your data warehouse.

Currently hired to:

Build data warehouse, pipelines, dimensional modelling, deploy analytics tools, string it together, but critically, to drive change in a business.

In short - Type 2 wants the solution to be cheaper, easier, faster, best fit, 80/20, is less intrinsically interested in the how and more interested in the impact on outcomes.

In my opinion, this is basically now Analytics Engineers, and if you disagree with my take on this concept, speak to an Analytics Engineer who had the title Data Engineer, and ask them if they can relate. Similar experience to those Data Scientists who preferred Analytics Engineering.

Another way to think about these distinctions is (erikbern.com again):

I often think of people as (and this is an unfair crude generalization etc) roughly on a spectrum between tools-oriented and goal-oriented.

Memed as:

3. Business Logic

Engineers as optimisation specialists

My background is in industrial engineering, which is broadly a stats'y engineering field incubated in the optimisation of systems (typically factories).

Layouts, flows, bottlenecks, JIT, supply chain etc. Mostly a solved field in many regards (shout out to the bullwhip effect 🚛 🚚 🚛).

The broad optimisation process for most of those businesses looks something like

Collection of SaaS and ERP-like systems to track and account for things
Data engineering to extract the various states
Analytics on the states
Decisions to change the states
Track decisions in ERP (i.e. repeat)

When I left engineering school, ERP implementation was where the demand was, and large chunks of engineers ended up implementing/consulting/suggesting various guises of a(n) ERP / CRM / database / app / spreadsheet / chalkboard.

However, it became clear to me (with hindsight) that this was quickly becoming a commodity technology and skillset (i.e. outsource to contractors), and that Data Engineering was the real skill bottleneck.

Businesses were amassing large data sets but struggling to access them, let alone analyse them, and so having lucked into a DE role, I made this transition.

Optimising down the optimisation list

Data Engineering is no longer the bottleneck! This is a huge relief, because Data Engineering is not optimising, rather just a necessary lift and shift. It is purely an operational burden brought about by decisions made with siloed data as the tradeoff.

Now surely 3. Analytics on the states is the biggest hurdle and opportunity.

Analytics is currently a headache, which requires significant investment, and where I suggest the investment is made. There is much more value in time spent on the building of “Business Logic”. In this case, Analytics.

Analytics extends far beyond data modelling and analysis, encompassing business processes, people processes, management and communication.

Analytics also pushes back into software engineering, system designing and overall value chain analysis.

4. Managed services

The Data engineering "type (2)" makeover

My day to day is where I form this opinion. I have done much less data engineering as it relates to Data Warehouse fine tuning, and ELT troubleshooting since the tools became so much easier. I do a lot more analytics and a lot more modelling. The problem has moved, onwards, up-system. The old bottleneck has largely been removed and solved. Optimised.

To quantify this stance, consider why there is a literal tsunami of new spins on data products: metric-stores, reverse ETL, metadata, discovery, quality, etc etc. Great data from Benn Stancil:

In 2017, Y Combinator—an incubator of both startups and the Silicon Valley zeitgeist—funded 15 analytics, data engineering, and AI and ML companies. In 2021, they funded 100 (my emphasis)

These are viable partly because the EL bottleneck was eased, the storage got cheaper and dbt made the whole thing more manageable.

Suddenly the problem wasn't getting the data, it was using the data.

Typically the domain of the elite. The reason Airbnb, Linkedin etc have needed a data catalog for near decades is because they had the engineering clout to make it necessary.

The sudden simplification of this process has meant that the next, hitherto unknown bottleneck gets suddenly bashed into, and there is immense value to be gained by unlocking it.

Build it, will they come?

If offered, many businesses will jump at a SaaS subscription, rather than spending that money on hiring/expanding an engineering team.

The term engineering is derived from the Latin ingenium, meaning "cleverness" and ingeniare, meaning "to contrive, devise" wiki

When the data is easy to centralise, combine and analyse, engineers won't be needed to devise and contrive data combining solutions.

They can go and contrive and devise something else, that is complex, and that gives the company a competitive advantage.

Eventually, analytics engineering could face the same turn of the tide. When the tooling gets so good that the team is composed entirely of analysts and product people, and no contriving engineers.

In the same way that structural engineers are only required when building on quicksand, data engineers are only required when building upon a dataswamp. As the tooling gets better, so do the foundations stabilise.

On the margin

The companies I advise and work with often have much less need for Data Engineering at the outset.

However to clarify one point - when they do need Data Engineering, it is a requirement for specialisation. There is indeed more Data Engineering to be done, but this is increasingly specialised (this is a semi-deliberate contradiction to this entire post that I am OK with).

The companies need help with the edge-case, marginally viable solution, where something emerges, crucial to them, that falls through the cracks of the 80/20 SaaS solutions. The point is that these needs come later down the line. Not at the outset of a data project, but later, once the bulk of the crucial, impactful elements are working and generalist data practitioners have exhausted their options.

A caveat: the assemblage of the appropriate tools in the appropriate order to match business needs and maturity is a tricky problem indeed. Probably something that would benefit from the skills of a Data Engineer. More on that in section 6.

5. Minority

Data ENGINEERING isn't going anywhere.

I recently discussed this with someone from a quant hedge fund, and while they had a computer science background, they were "data" + "engineering" to a profound degree. They needed a real-time (real time real-time) data feed from all of the brokers, with extensive transformation across all of them. Multiple decision systems integrated with predictive models, and then reliably send orders back into that system, in near real-time.

This system literally was the business. Complex, differentiating. Building this was one of maybe two things that the company needed to execute to beat the competition.

Data engineering in certain contexts is necessary, but likely to be a specialisation increasingly of interest to the minority.

The above point alone isn't that contentious.

What is contentious is the WHEN.

Has the tied turned, is it still rising. Who is seeing the signs and who is missing them. Who is seeing evidence where there is none.

6. Implications and Evidence from the field

Hello?

2 more minutes, less hand waving I promise.

Implications for Engineers

This entire post makes the same point as this specialisation bombshell:

What is the right level of specialization? For data teams and anyone else.

It seems fair that, if tools didn't require so much knowledge to use (I'm looking at you, Kubernetes), then on the margin, the need for specialisation would be less.

The extension of this point is that because the data engineering toolset got so much better, the specialisation required is now less. Snowflake and BigQuery users agree.

The implication for engineers whose work is now easier is the following:

Either you move in the direction of the new business problem.

Or you move to a new business that still has the old problem.

Or you specialise further until you find another domain to play in, and wait for the tide to turn again.

Erik's blog above makes another point, which made me realise this is a mostly “deeply inspired” notion, so much so that I've used the tools/goals oriented concept in an earlier section.

I often think of people as (and this is an unfair crude generalisation etc) roughly on a spectrum between tools-oriented and goal-oriented. Some people have their favourite tools, and that's what they like to use. They make their whole career about honing a craft with those skills. Other people are more entrepreneurial, and don't care about what tools they use: they care about the ultimate goal.

This topic was quite contentious on Twitter. People made some very stern remarks about specialisation when Erik posted it initially, and I guess I'm not surprised. People are very likely going to fight against any concept that undermines their career domain.

However, this contentiousness further highlights the opportunity:

Contrarian ideas, when right, are "the valuable thing" from the Taleb and Zero to One books:

“What important truth do very few people agree with you on?”

Should this point be right, it will be proven right by (another) tool that reduces the need for specialisation and sells for ${LOTS} because it enables achieving Goal X (data-driven-whatnot) without hiring a team of 100 ludicrously demanding human specialists with endless needs.

Arguments, of which there are a few, against this, include that the startup ELT paradigm is a minority and that data engineering work is firmly entrenched in the structures of larger businesses, especially enterprises. The refinement I think is worth making, is that while this may be true, the hope is that it will become less so. Like the shift to the cloud, I would hope that what we describe as ELT now leads to us finding a better way of doing things, whatever it may end up being, that is as transformational for Data Teams as cloud computing was for Software Teams. (Noteworthy that “hybrid-cloud” has proven so popular with enterprise)

And a pushback to this: enterprises aren’t most businesses. Most businesses don’t have a large tech team, most businesses didn’t exist a decade ago. However most Data Engineers are not employed by most businesses, hence this disconnect. Most Data Engineers would disagree with this premise, but the point is that most businesses won’t need a Data Engineer.

Looking at history, this happened before, take a look at Data Science as a field, maybe due for a renaissance in the guise of ML. “Data Science” was a crutch for companies not knowing what to do with their data.

Implications for Businesses

The message from the communities and my experience is clear - Data Engineering as it once was is generally less of a challenge - but building a coherent “data platform” remains a chore.

What is possibly the most complex part of “Data”, and what they really need help with is, what I suppose quite fairly is called Data Platform Engineering:

EL tool can start costing inordinate amounts relative to the value gained.
X tool sunsetting Y feature
Adding a new business tool with an unsupported API that needs a singer tap built. This work typically is open-sourced, so eventually, there will be fewer needing singer taps (pray)
Airflow proving to be a headache. According to Slack, 90% of airflow users are using managed services, so less specialisation in airflow will be needed (pray pray)

As an example of this, I’ve recently consulted on the best way to ELT some data from a few API sources unsupported by Fivetran, as well as Stitch/Airbyte. The decision complexity is quite high:

Is an orchestration tool such as Airflow/Prefect needed yet, and if so, which one?
- If Airflow, then the AWS instance, the Astronomer version, or self-host?
- Do we try at the outset to use Kubernetes? Is Airflow stable yet? It still feels overcomplicated.
- If Prefect, will they as a new entrant be more reliable or still have teething issues?
- What level of CI/CD for the tools?
- Would they benefit from Terraform?
Meltano, Airbyte or Singer extractor/tap spec?
- Meltano [1] seems to be making excellent progress, but requires some minor hosting effort, and also requires an orchestrator.
- Airbyte seems to (seems to) be making more of a commitment to quality.
- Both are wrangling with the ways of incentivising community maintainers.

This is just one “component” of the team’s ELT, not even the full picture of their Data Platform, and it is a subtly complex and consuming decision for those familiar.

A great way to frame this is quasi-architectural DataOps flavoured generalist Data Guru role of the Data Platform Engineer (DPE):

DPE are thinking about what data exists, who should have which access, how to make it available for usage by people and tools, how to make it redundant (disaster recovery), how to enable discovery (catalog), etc

Or another spin:

DPE just means that you are the Tech Lead of the Analytics Engineering.

While I don’t necessarily care for the DPE term over DE, I do think DPE aptly captures the key work that many Data Engineers now do, combining and ensuring cooperation between competing tools to build a coherent consumable data platform.

More than anything, the developer experience for most of the necessary Data Platforms tools is just garbage. Airflow is a nightmare, GCP really a frightening pain, and AWS is just so much worse. The correct abstractions over all of this is a huge opportunity and the thing that the DPE needs to keep an eye on.

[1] Worth your time to have a look at the Meltano SDK if you need to build an API extractor. Great team, developer experience and ambition. If you are a Data Engineer (either type), these open source projects are possibly the best intersection of your skills, interests and market demand. I set up the most lightweight way to run a Meltano ELT on AWS, using Terraform, and could use a review!

Closing

In closing, I broadly see the below chart as usefully inflammatory and marginally useful.

As Data Science gave way to Data Engineering enthusiasm, I'll say that Data Engineering enthusiasm possibly will have to give way to Analytics, currently called Analytics Engineering.

Following this will be the traditional Data Analyst role, in whatever new guise, which will make some resurgence.

However, the core Data Engineering skill-set, technological awareness and systems thinking, will remain vitally important, but perhaps not in the historical and existing notion of a Data Engineer.

Appendix

Questions to ponder, hit the comments if you have some thoughts:

Will data science re-emerge now that the data wrangling tooling is getting so much better? What will this do to the hierarchy of data science? Maybe ML Engineer is a better candidate.
Where does MLops sit, it largely has felt disconnected from “Modern Data Stack”?
The enterprise dynamic is entirely different. Enterprise companies will need ETL engineers until the heat death of the sun, and no I don’t want to hear about it.
Is training Data Engineers a lost cause, along with Training Data Scientists, Front-end devs?
Remember that 92% of startups disappear, but while we are stealing fun from tomorrow, we can satisfy ourselves knowing that someone will get it right, but for someone to be right someone else must be wrong.

Please consider subscribing for more on the subject of data systems thinking

What is group by 1

Who is Matt Arderne