Skip to main content

Industry Trends

What It’s Really Like to Use AI for Data Prep

Robert Rouse
AuthorRobert Rouse

Today’s AI tools can save time and amplify a wide range of everyday tasks—writing code, reshaping data, or even turning a rough outline into a readable post like this one (with help from a talented human editor, of course, to make sure it’s accurate, coherent, and worth reading by other humans.)*

In recent updates, Dataiku rolled out AI-enabled features meant to help analysts build data prep flows more efficiently. I decided to test one: the Generate Steps With AI feature in a Prepare recipe. I wanted to see if it could clean up a messy dataset by automating common prep tasks. Here’s how that went.

[*The human editor made me add that.]

Prompt, Guess, Repeat

I started with World Bank Open Data—structured with years as column headers, a classic Excel-friendly but analysis-hostile format. I needed to convert it from wide to tall format: one column called “Year” with its corresponding values in a “Value” column. If you want to know why that structure matters, this Action video on Tidy Data will help.

My first prompt:

“I want to pivot the columns labeled with numbers into a taller dataset where the numbers are in rows under a field called ‘Year’ and the values are in a field called ‘Value’.”

The result? A confusing stack of “split” actions. None worked.

Next attempt:

“Convert all the columns to the right of the ‘indicator code’ field into rows.”

I expected the AI to “fold” those columns. Instead, it split the indicator code field and folded that. After a couple more tries, I ditched the AI and manually created a fold step. There, I found a “Find with Smart Pattern” feature that correctly used regex to select columns named like years. That worked.

Takeaway: AI won’t help unless you speak the local dialect. What I called “pivot” or “unpivot,” Dataiku calls “fold.” That vocabulary mismatch tripped things up early.

A Win for Column Generation

Once I had a tidy dataset, I needed to prepend “YR” to each year value to match other datasets.

Prompt:

“Create a new column that adds ‘YR’ in front of the values in the ‘Year’ column.”

It nailed it:

"YR" + toString(val("Year"))

Text Splits and Custom Logic

Some values in the dataset were structured like this:

Economy: Employment

Prompt:

“Split the topic column into new columns based on the ‘:’ separator.”

It worked as expected—clean two-column output.

Then I noticed blank values in the Topic field for some emissions metrics. I tried this:

“If the topic is null, check if the indicator name contains ‘emissions.’ If so, assign ‘Environment: Emissions’ as the Topic.”

And got this:


if (isNull(val("Topic")),
if (val("Indicator Name").contains("emissions."),
"Environment: Emissions",
val("Topic")),
val("Topic"))

That would’ve taken a few iterations to get right manually. The AI gave me a solid starting point in one shot.

A Pattern Recognition Game

Finally, I tried to normalize units of measure. The column included all of these:

  • “Percent”
  • “Percentage”
  • “% of Total”
  • Blank/null (when indicator included “Percent”)
  • Just “%”

Prompt:

“If the unit of measure column is null, set it as ‘%’ if the indicator name column contains ‘%.’ If the unit is ‘Percent’, ‘Percentage’, or ‘% of Total’, replace it with ‘%’.”

Result: “We couldn’t find what you meant.”

Once I broke it into steps, it worked. Bonus: Dataiku logs the original prompt next to the step. That made it easy to troubleshoot.

What I Learned

Dataiku’s AI Prepare works well when:
✔️You know what you want but don’t want to dig through menus
✔️You’re creating or manipulating single columns
✔️You’re okay experimenting and retrying
✔️You know the platform’s internal vocabulary

It struggles when:
⚠️The outcome is visual, but your prompt isn’t
⚠️You’re trying to combine too many steps at once
⚠️You expect the AI to grasp the dataset before you do
⚠️You use terms that don’t match platform conventions

AI Isn’t the Point, Clarity Is

At Action, the goal isn’t just to use AI. It’s to build systems that help people think better with data. Sometimes that means using AI. Sometimes it means writing your own Regex.

In both cases, clarity is the killer app. Whether you’re working with a teammate or an AI assistant, ambiguity will always slow you down.

Know what you mean. Say it clearly. And don’t be afraid to fold it manually.