
Articles
The Data Scientist’s Dilemma
In this installment of Data Engineering Demystified, I want to talk about the data scientist’s dilemma, and an enduring principle that refuses to go away no matter how advanced our tools get: garbage in, garbage out. See also Part 1: Why Data Engineering Feels Like a Black Box, and Part 2: The Upstream World: Software Engineers and Product Team.

Data scientists, machine learning engineers, and ML researchers are often handed an impossible mandate. They’re expected to wave a magic wand and make revenue go up, conversion rates double, costs disappear, and risks evaporate. They’re asked to uncover hidden value buried deep in the data and to do it fast.
That expectation exists largely because machine learning still feels mysterious to many organizations. The algorithms, the models, the predictive methods, they’re treated like a black box. People don’t really know how they work, only that they’ve heard stories about what they can do.
When Harvard Business Review famously labeled data scientist “the sexiest job of the 21st century,” it poured fuel on that perception. Suddenly, everyone wanted data scientists. Companies rushed to post job descriptions, often without a clear understanding of what they were actually hiring for or what success would require once that person showed up.
When Data Scientists Are Really Data Engineers
Here’s the uncomfortable truth: a lot of data scientists end up doing the job of a data engineer.
This isn’t because they lack skill. It’s because many organizations hire for machine learning before they’ve built the foundations needed to support it. They bring in someone with a strong statistics or modeling background and ask them to build ML models, fine-tune algorithms, and train predictive systems.
Then reality hits. The data isn’t ready.
Instead of modeling, that person spends 80% to 90% of their time cleaning data. Writing SQL. Writing Python. Fixing schemas. Transforming raw inputs into something that’s even remotely usable. This pattern is so common it’s almost a rite of passage.
In fact, many of the earliest data science roles were essentially data engineering roles. Lots of extraction. Lots of transformation. Feature engineering just to get to a baseline where basic descriptive or predictive signals could exist.
That’s honestly how I started, too. I thought I wanted to be a data scientist for a long time. But while doing that work, I realized I really enjoyed the engineering side of it. I liked building the infrastructure. I liked creating the foundation that made everyone else’s work easier. If you do that part well, everything downstream gets better.
Data Scientists Thrive on Prepared Data
In the previous installment, we talked about data analysts. Today, the focus is on data scientists, and the conditions they need to do their best work.
When prepared data assets are already in place, everything changes. When descriptive features already exist, things like transaction counts per customer, behavioral aggregates, meaningful time-based metrics, data scientists can actually do what they’re trained to do.
That’s when the “black magic” starts to look real.
With the right inputs, they can build models that increase conversion, drive revenue, reduce expenses, mitigate risk, and flag fraud before it happens. Those legendary wins you hear about don’t come from chaos. They come from mature data organizations that have already solved the fundamentals.
That’s why this series matters. We’re walking through every role upstream and downstream of the data engineer to show how interconnected they really are. One role without the others is a dead end. Even if an organization doesn’t formally staff every role, the functions still exist, and they serve different purposes.
The Cost of Skipping the Foundations
If a data scientist joins an organization with weak foundations, they need to be clear-eyed about what lies ahead. Before predictive models, there will be data munging. Before forward-looking insights, there will be cleaning, transforming, reporting, and often even dashboarding.
Only after all of that can they begin the work they were hired to do.
When they finally get there, the work looks very different from what most people imagine. Data scientists often go into deep focus for weeks or months at a time. They think like detectives. They examine every possible interaction that might influence an outcome.
If the goal is increasing conversion, they’ll analyze how long a user stays on a page, how small design changes affect behavior, whether an animation or loading indicator buys three extra seconds of attention. Sometimes those three seconds increase completion rates by 15 or 20 percent, which might translate to a few percentage points overall. At scale, that can mean millions of dollars.
But that level of thinking requires space. It requires focus. And it requires not being buried in data cleaning.
Plumbing, Then Cooking
Using the analogy from earlier installments, data engineering is the plumbing for the house. It ensures the water comes out clean, predictable, and usable. The right temperature. The right pressure.
Data scientists are what happens next. They take that water and cook with it. They create dishes you wouldn’t expect. Michelin-star results built from reliable ingredients.
They don’t need to worry about sourcing the water or filtering it themselves. They can focus on making something meaningful, something that solves a real business problem, delights users, increases revenue, or reduces cost.
Strong data engineering unlocks innovation with machine learning. Without it, data scientists get stuck doing everyone else’s job.
The Role Is Changing, but the Rule Isn’t
The data science and ML engineering landscape is changing fast. With the rise of AI and large language models, many practitioners who once focused on forecasting or classical modeling are now fine-tuning LLMs for specific domains like finance, compliance, or manufacturing.
But the foundation hasn’t changed.
Even generative AI runs on data. Clean data. Prepared data. Well-structured inputs. Without that, models get overwhelmed by noise and useless context. A long, rambling input is hard to turn into anything actionable until it’s synthesized and transformed.
That synthesis is still the job of data engineering.
Whether we’re talking about traditional machine learning or modern generative AI, the base remains the same. If you don’t want garbage models and garbage outputs, you have to make sure garbage never goes in.
That’s the data scientist’s dilemma. And that’s why strong data engineering matters more than ever, even as roles evolve, merge, and hybridize in the years ahead.
Thanks for joining us. Hope to see you for the next installment.



