Honda: 2 years of ml vs 1 month of prompting - heres what we learned

← 2 Years of ML vs. 1 Month of Prompting November 7, 2025

Recalls at major automakers cost hundreds of millions of dollars a year. It’s a huge issue. To mitigate it, our company created an analytics department solely focused on categorizing warranty claims into actionable problems.

For decades, this team has relied on SQL queries to classify warranty data. But vehicles—and the language used to describe them—have evolved. SQL struggles with semantics, negations, and contextual nuance. Here’s a fictional example of a claim we might see in the wild:

“Customer reports oil on driveway, thought engine leak. Detailed inspection found no engine leaks. Traced oil to spill during last oil change. Oil on subframe dripping to ground. Cleaned subframe, verified no leaks from engine or drain plug. Customer advised.”

An oversimplified SQL query that might try and capture this scenario:

SELECT claim_id , claim_text , CASE WHEN ( ( LOWER ( claim_text ) LIKE '%leak%' OR LOWER ( claim_text ) LIKE '%leaking%' OR LOWER ( claim_text ) LIKE '%seep%' AND ( LOWER ( claim_text ) LIKE '%oil%' OR LOWER ( claim_text ) LIKE '%fluid%' AND LOWER ( claim_text ) NOT LIKE '%no leak%' AND LOWER ( claim_text ) NOT LIKE '%not leaking%' ) THEN 1 ELSE 0 END AS is_leak FROM warranty_claims ;

What we can gather from this example is that the leak came from a service oil spill—not the vehicle. Yet this query would still flag it as a leak. In production, these types of queries balloon into hundreds—if not thousands—of similar clauses. Over the years, the team created thousands of classification buckets. Many of these legacy buckets still siphon off claims today—creating unnecessary work for analysts and slowing down the detection of new issues.

The classification project

In 2023, the company launched a major initiative to automate warranty classification using supervised models. Here’s how that went:

Data Collection: The first challenge was establishing a ground truth. Each team member had different mental models of how claims should be categorized. After months of discussion, the team finally aligned on a set of core “symptoms” to categorize warranty claims by. Then came the hard part: manually labeling thousands of complex claims per symptom—work that only domain experts could handle. After many months we had labeled just half the symptoms. Preprocessing: Raw warranty text is messy—full of acronyms, error codes, and multilingual input. “cust reports mil on with p0420. tech found a/c compressor clutch noise at idle. checked map sensor, reading normal. replaced cat converter per tsb. dtc cleared, road test ok.” Translation: Customer reports a check engine light. Technician found an unrelated AC compressor issue. Catalyst converter was replaced per technical service bulletin. Problem resolved. We built a 9 stage preprocessing pipeline: text sanitization, concatenation, tokenization, acronym expansion, stop word removal, spell checking, service bulletin extraction, diagnostic code parsing, and translation. That took another 6 months. Fun fact: Translating French and Spanish claims into German first improved technical accuracy—an unexpected perk of Germany’s automotive dominance. Modeling: We tried multiple vectorization and classification approaches. Our data was heavily imbalanced and skewed towards negative cases. We found that TF-IDF with 1-gram features paired with XGBoost consistently emerged as the winner. See the PR curve attached below [1]

... continue reading