What data do I need before building an AI product?

Eighty-seven percent of AI projects never reach production (Gartner, 2022). The usual suspect is the model. The actual culprit, almost every time, is the data feeding it.

Founders pour months into model selection, architecture debates, and vendor evaluations. Then the product launches and the predictions are wrong, the recommendations are irrelevant, or the chatbot hallucinates answers that sound confident and mean nothing. The root cause traces back to the same place: the training data was incomplete, mislabeled, stale, or simply not the right data for the problem.

This article walks through what data you actually need before you write a single line of code, how much of it, how clean it has to be, and where to get it if you are starting from zero.

What types of data can power different kinds of AI products?

Every AI product consumes one of four broad data types: text, numbers, images, or user behavior logs. The product you want to build determines which type you need, and mixing that up is the fastest way to waste six figures on a project that never works.

A customer support chatbot needs thousands of real conversations between your support team and your customers. Not FAQ pages. Not marketing copy. Actual back-and-forth exchanges where a customer describes a problem and a human resolves it. The model learns from that resolution pattern.

A demand forecasting tool needs historical transaction records: what sold, when, how much, and alongside what external factors (weather, holidays, promotions). McKinsey's 2022 analysis found that companies using demand forecasting with clean historical data reduced inventory costs by 20-30%.

A fraud detection system needs labeled examples of both legitimate and fraudulent transactions. The tricky part: fraud is rare. In a typical financial dataset, fewer than 2% of transactions are fraudulent (Nilson Report, 2022). Your model needs enough examples of both to learn the difference.

AI Product Type	Primary Data Needed	Minimum Useful Volume	Example Source
Customer support chatbot	Conversation transcripts	2,000-5,000 exchanges	Help desk software exports
Demand forecasting	Transaction + time-series records	12-24 months of history	POS system, ERP exports
Fraud detection	Labeled transactions (legit + fraud)	10,000+ with at least 200 fraud cases	Payment processor logs
Recommendation engine	User behavior logs (views, clicks, purchases)	50,000+ user interactions	Analytics platform, app events
Document classifier	Categorized documents	500-1,000 per category	Internal file systems, email archives
Image recognition	Labeled photographs	1,000-5,000 per object class	Camera feeds, uploaded photos

The pattern: your data must mirror the decision your product will make. If the product recommends items, you need records of what people chose. If the product classifies documents, you need documents that are already sorted. If the product predicts churn, you need records of customers who stayed and customers who left, with every data point you had about them before they made that choice.

How does a machine learning model turn raw data into predictions?

A machine learning model is a pattern-matching engine. You feed it thousands of examples where you already know the right answer, and it figures out which patterns in the input data predict that answer. Then, when it sees new data without an answer, it applies those patterns.

Think of it like training a new employee. You sit them next to an experienced rep and let them watch 5,000 customer calls. They start noticing patterns: angry tone plus billing keyword usually means refund request. Short call plus tracking number usually means delivery check. After enough examples, they can handle new calls on their own because they have internalized the patterns.

That is what a model does with numbers. It finds correlations between inputs and outputs, weighted by how reliably each correlation predicts the correct answer. Stanford's 2022 AI Index report found that modern models can identify patterns across hundreds of variables simultaneously, something no human analyst could do manually.

Why does this matter for your data strategy? Because the model can only learn patterns that exist in your data. If your sales records do not include weather data, the model will never learn that umbrella sales spike when it rains. If your customer data does not include how long someone has been a customer, the model cannot learn that tenure predicts loyalty. The model is only as smart as the data you train it on.

A 2022 IBM survey found that poor data quality costs organizations an average of $12.9 million per year. For AI products specifically, the cost is higher because bad data does not just produce bad reports. It produces confidently wrong predictions that people act on.

How much data is enough to train or fine-tune a model?

The honest answer: it depends on the complexity of the task. But "it depends" is not useful, so here are concrete ranges based on what works in production.

For classification tasks (sorting things into categories), you need a minimum of 500-1,000 labeled examples per category. A spam filter needs at least 1,000 spam emails and 1,000 legitimate emails. A sentiment analyzer needs at least 1,000 positive reviews and 1,000 negative reviews. Research from Google Brain (2022) showed that model accuracy improves sharply up to about 5,000 examples per category, then gains slow down.

For prediction tasks (forecasting numbers), you need 12-24 months of historical data at minimum. A demand forecasting model trained on three months of data will miss seasonal patterns entirely. Training on two years gives the model a chance to see every season twice and learn which patterns repeat.

For text generation (chatbots, content tools), the situation changed dramatically in late 2022 with the release of large language models. Instead of training from scratch, you can now fine-tune a pre-built model on a much smaller dataset. Fine-tuning a large language model for a specific domain requires as few as 500-2,000 high-quality examples (OpenAI, 2022).

Task Type	Minimum Data	Recommended Data	Why This Range
Binary classification (yes/no)	1,000 labeled examples	5,000-10,000	Model needs enough examples of both classes
Multi-class classification	500 per class	2,000-5,000 per class	Rare classes need oversampling if underrepresented
Numerical prediction	12 months of history	24+ months	Must capture seasonal and cyclical patterns
Fine-tuning a language model	500 domain-specific examples	2,000-5,000	Quality matters more than volume at this scale
Recommendation engine	50,000 interactions	200,000+	Sparse data (few interactions per user) needs volume
Image recognition	1,000 per object class	5,000+ per class	Varies with visual complexity of objects

Here is the practical takeaway: you probably have less data than you think you need and more than you think you have. Most companies sit on years of transaction logs, support tickets, and user behavior data that has never been organized for training purposes. A good data audit often reveals that the raw material is already there. It just needs cleaning and labeling.

What data quality problems cause AI products to fail?

Bad data does not just make models less accurate. It makes them confidently wrong. And a product that gives wrong answers with high confidence is worse than one that admits it does not know.

Missing values are the most common problem. If 30% of your customer records are missing the "industry" field, the model either ignores industry entirely or learns the wrong relationship between industry and whatever you are predicting. MIT researchers found in 2022 that datasets with more than 20% missing values in any single column produced models with 15-40% lower accuracy than datasets with complete records.

Label errors are more dangerous because they are harder to spot. If a human reviewer tagged 5% of your training examples incorrectly, the model learns those mistakes as truth. A 2021 study from MIT found systematic label errors in 10 widely used AI benchmarks, including datasets used to train commercial products. When they corrected the labels, model performance improved by up to 6 percentage points without changing anything else.

Class imbalance trips up nearly every first-time AI builder. If your fraud detection dataset contains 98% legitimate transactions and 2% fraud, a model can achieve 98% accuracy by simply labeling everything as legitimate. It looks great on paper and catches zero fraud. The fix is either oversampling the minority class, undersampling the majority, or using evaluation metrics that account for imbalance.

Quality Problem	How It Breaks Your Model	How to Detect It	Fix
Missing values (>20% in a column)	Model ignores that variable or learns distorted patterns	Run a completeness check on every column	Fill gaps with median values, or drop the column if too sparse
Label errors (>3% mislabeled)	Model learns incorrect patterns as truth	Have two independent reviewers label a sample, compare disagreements	Re-label the disputed examples, audit the full dataset if error rate is high
Class imbalance (>10:1 ratio)	Model predicts the majority class almost every time	Check the distribution of your target variable	Oversample the minority class or use specialized evaluation metrics
Stale data (>6 months old for fast-changing domains)	Model learns patterns that no longer hold	Compare recent real-world outcomes to model predictions	Retrain quarterly or set up continuous learning
Duplicate records	Model overweights repeated examples	Count exact duplicates and near-duplicates	Deduplicate before training

A Gartner study from 2022 found that organizations that invested in data quality before building AI products were 2.5 times more likely to reach production deployment. The founders who skip this step end up rebuilding six months later with twice the budget.

Where can I find usable data if I do not have my own?

Starting from zero is more common than most founders admit. If your company is pre-launch or has been operating without systematic data collection, you have three practical options.

Public datasets are the fastest starting point. Kaggle hosts over 50,000 datasets across every industry. The US government's data.gov publishes economic and demographic data. The World Bank, EU Open Data Portal, and dozens of academic institutions publish freely usable datasets. A 2022 survey by Databricks found that 40% of production AI models incorporated at least one public dataset during initial development.

Synthetic data generation has become a credible alternative for companies that cannot share real customer information. Instead of using real records, you create artificial records that share the statistical properties of real data without containing any actual personal information. Gartner predicted in 2022 that by 2024, 60% of the data used for AI would be synthetically generated. That prediction tracked close to reality. Synthetic data works especially well for training fraud detection models, where real fraud examples are scarce and sensitive.

Data partnerships with complementary businesses can fill gaps that public data cannot. A logistics startup might partner with a weather data provider to improve delivery time predictions. A retail AI company might license anonymized transaction data from payment processors. These partnerships cost $5,000-$50,000 per year depending on the dataset, which is a fraction of the cost of collecting equivalent data yourself.

Data Source	Cost	Best For	Watch Out For
Your own operational data	Free (already collected)	Any AI product built around your business	Gaps, labeling effort, privacy compliance
Public datasets (Kaggle, data.gov)	Free	Prototyping, benchmarking, supplementing your own data	May not match your specific domain closely
Synthetic data generation	$2,000-$15,000 setup	Privacy-sensitive domains, rare event modeling	Must validate that synthetic patterns match real ones
Data partnerships / licensing	$5,000-$50,000/year	Specialized datasets you cannot collect yourself	Contract terms, exclusivity, data freshness
Web scraping	$1,000-$10,000 setup	Market research, competitive intelligence	Legal gray areas, robots.txt compliance, data quality
User-generated (incentivized collection)	$0.10-$2.00 per labeled example	Building labeled datasets from scratch	Quality control, bias from incentive structure

A practical starting path for most founders: begin with whatever internal data you have, supplement it with one relevant public dataset, build a prototype, and use the prototype's performance to identify exactly which data gaps to fill next. Trying to assemble the perfect dataset before writing any code is the planning equivalent of analysis paralysis.

How do I clean and label data so a model can learn from it?

Raw data is never ready for training. Even well-maintained databases contain inconsistencies that will confuse a model. The cleaning and labeling process typically consumes 60-80% of the total time spent on an AI project (CrowdFlower/Appen, 2022). Founders who budget two weeks for data preparation and two weeks for model building should reverse those numbers.

Cleaning means standardizing formats, handling missing values, removing duplicates, and fixing obvious errors. Dates stored as "12/9/2022" in one column and "2022-12-09" in another will confuse any model that tries to learn time-based patterns. Customer names with inconsistent capitalization, addresses with varying abbreviations, currencies mixed between dollars and euros: all of these need standardization before training.

Labeling means telling the model what the correct answer is for each example in your training data. If you are building a support ticket classifier, someone has to read each ticket and tag it: billing issue, technical problem, feature request, complaint. If you are building a sentiment analyzer, someone has to mark each review as positive, negative, or neutral.

The cost of labeling depends on complexity. Simple binary labels (spam/not spam) cost $0.05-$0.10 per example through services like Amazon Mechanical Turk or Scale AI. Complex labels that require domain expertise (medical image annotation, legal document classification) cost $1-$5 per example because you need trained specialists. At 10,000 examples, that is the difference between $500 and $50,000.

A 2022 study from Stanford's HAI found that label quality had 3-5 times more impact on model performance than label quantity. Two thousand perfectly labeled examples consistently outperformed ten thousand examples with 10% label noise. The practical lesson: invest in labeling quality (clear instructions, multiple reviewers, disagreement resolution) rather than racing to label as many examples as possible.

Timespade builds data preparation pipelines alongside AI products, handling the cleaning, labeling workflow setup, and validation before any model training begins. For a mid-size dataset, that preparation costs $3,000-$8,000, while a Western data consultancy charges $15,000-$30,000 for equivalent work. The preparation phase takes 2-4 weeks depending on data complexity.

What legal and privacy concerns apply to collecting training data?

Data privacy law caught up with AI faster than most startups expected. If your training data contains personal information, you are subject to privacy regulations whether you intended to build a "data product" or not.

GDPR (covering the European Union) requires explicit consent for collecting personal data and gives individuals the right to request deletion of their data from your systems, including your training datasets. A model trained on data that someone later requests deleted creates a technical and legal headache. The GDPR fines are not theoretical: regulators issued over 1.6 billion euros in fines during 2022 alone (DLA Piper GDPR Fines Report, 2023).

CCPA (covering California) gives consumers the right to know what personal data a company collects and to opt out of its sale. If your AI product uses California consumer data for training, you need disclosure mechanisms and opt-out functionality built into your product from day one.

Copyright is a less obvious but equally serious concern. Training a model on copyrighted content without permission is legally contested territory as of late 2022. Several lawsuits are pending against companies that trained models on scraped web content. The safest approach: use data you own, data you have licensed, or data that is explicitly public domain.

Practical steps that keep you out of legal trouble: anonymize personal data before training (strip names, emails, phone numbers, and any combination of fields that could identify an individual). Document the source of every dataset and its license terms. Build a deletion pipeline so you can retrain without specific records if someone exercises their right to be forgotten. These steps cost $2,000-$5,000 upfront with a team like Timespade, compared to $10,000-$25,000 at a Western compliance consultancy. The alternative, a GDPR fine, starts at 2% of global annual revenue.

How should I store and version data for ongoing model improvement?

An AI product is not a one-time build. The model degrades over time as the real world changes and the training data grows stale. Proper data storage and versioning are what separate a product that improves month over month from one that slowly breaks.

Version your datasets the same way software teams version code. Every time you add new training data, retrain the model, or correct labels, save that version with a timestamp and a description of what changed. When a new model version performs worse than the previous one, you need the ability to roll back to the exact dataset that produced the last good version. Without versioning, debugging becomes guesswork.

Storage costs are often overestimated. A 2022 Cloudflare analysis found that cloud storage prices dropped 90% over the previous decade. Storing one terabyte of training data costs $20-$25 per month on Amazon S3 or Google Cloud Storage. Even large datasets with millions of records rarely exceed a few terabytes. The expensive part is not storing the data. It is organizing it so your team can find and reproduce any version on demand.

Retrain frequency depends on how fast your domain changes. E-commerce recommendations should retrain weekly because product catalogs and user preferences shift constantly. Fraud detection models should retrain monthly because fraud patterns evolve. Demand forecasting models need retraining at least quarterly to absorb seasonal patterns.

Retraining Frequency	Best For	Why
Weekly	Recommendations, content ranking	User preferences and inventory change fast
Monthly	Fraud detection, pricing optimization	Attack patterns and market conditions evolve
Quarterly	Demand forecasting, churn prediction	Seasonal cycles need fresh data each period
On trigger (when accuracy drops below a threshold)	Any production model	Catches domain shifts that do not follow a calendar

Timespade sets up automated retraining pipelines as part of every AI product build. The system monitors model accuracy in production, flags when performance drops below a threshold you set, and kicks off retraining with the latest data. Setup costs $4,000-$10,000 depending on complexity. A Western consultancy charges $20,000-$40,000 for equivalent monitoring infrastructure. Without this system, most teams discover their model has degraded only when customers complain.

Can I start building with limited data and improve results later?

Yes, and for most startups this is the only practical approach.

Waiting until you have the "perfect" dataset before building anything is the number one reason AI projects stall before they start. A 2022 Harvard Business Review analysis found that companies taking an iterative approach to AI (starting with limited data and expanding) were 3 times more likely to reach production than companies that spent months assembling data before writing any code.

The practical path looks like this. Start with whatever data you have, even if it is only a few hundred examples. Build a prototype. Test it against real scenarios and measure where it fails. Those failure patterns tell you exactly what data you need next. Collect that specific data and retrain.

Pre-trained models have made this approach much more viable. In late 2022, large language models became available that already understand language and common knowledge from training on massive public datasets. Fine-tuning one of these models on 500-2,000 of your own examples can produce a product-ready tool for many use cases. You are not starting from zero. You are starting from a model that already knows how to read and reason, then teaching it the specifics of your business.

The cost of iteration is low. Each cycle of collecting 500-1,000 new labeled examples, cleaning them, and retraining costs $1,500-$4,000 with a team like Timespade. A Western agency charges $8,000-$15,000 per iteration. Over four iterations, that gap compounds: $6,000-$16,000 total versus $32,000-$60,000.

A staged approach also limits financial risk. If your first prototype shows the product concept does not work, you have spent $5,000-$10,000 learning that lesson, not $50,000. If it does work, each iteration makes it better with evidence guiding every improvement.

Stage	Data Volume	Cost (Timespade)	Cost (Western Agency)	What You Learn
Prototype	500-1,000 examples	$5,000-$8,000	$20,000-$35,000	Does the concept work at all?
V1 production	2,000-5,000 examples	$8,000-$15,000	$30,000-$50,000	Where does the model fail in the real world?
V2 improvement	5,000-15,000 examples	$4,000-$8,000	$15,000-$25,000	Is accuracy good enough for paying customers?
Ongoing iteration	Continuous collection	$1,500-$4,000/cycle	$8,000-$15,000/cycle	What edge cases still need attention?

The founders who ship successful AI products in 2023 are not the ones with the most data. They are the ones who started building with imperfect data, learned from the gaps, and improved systematically. The data you collect after launch, from real users interacting with a real product, is always more useful than the data you imagine you need before launch.

Timespade builds AI products across all four stages: prototype through production. The team handles data preparation, model selection, training, deployment, and ongoing monitoring as a single engagement. Book a free discovery call to walk through your data situation and get a concrete plan for what to collect, what to clean, and what to build first.

AI Product Type

Primary Data Needed

Minimum Useful Volume

Example Source

Customer support chatbot

Conversation transcripts

2,000-5,000 exchanges

Help desk software exports

Demand forecasting

Transaction + time-series records

12-24 months of history

POS system, ERP exports

Fraud detection

Labeled transactions (legit + fraud)

10,000+ with at least 200 fraud cases

Payment processor logs

Recommendation engine

User behavior logs (views, clicks, purchases)

50,000+ user interactions

Analytics platform, app events

Document classifier

Categorized documents

500-1,000 per category

Internal file systems, email archives

Image recognition

Labeled photographs

1,000-5,000 per object class

Camera feeds, uploaded photos

Task Type

Minimum Data

Recommended Data

Why This Range

Binary classification (yes/no)

1,000 labeled examples

5,000-10,000

Model needs enough examples of both classes

Multi-class classification

500 per class

2,000-5,000 per class

Rare classes need oversampling if underrepresented

Numerical prediction

12 months of history

24+ months

Must capture seasonal and cyclical patterns

Fine-tuning a language model

500 domain-specific examples

2,000-5,000

Quality matters more than volume at this scale

Recommendation engine

50,000 interactions

200,000+

Sparse data (few interactions per user) needs volume

Image recognition

1,000 per object class

5,000+ per class

Varies with visual complexity of objects

Quality Problem

How It Breaks Your Model

How to Detect It

Fix

Missing values (>20% in a column)

Model ignores that variable or learns distorted patterns

Run a completeness check on every column

Fill gaps with median values, or drop the column if too sparse

Label errors (>3% mislabeled)

Model learns incorrect patterns as truth

Have two independent reviewers label a sample, compare disagreements

Re-label the disputed examples, audit the full dataset if error rate is high

Class imbalance (>10:1 ratio)

Model predicts the majority class almost every time

Check the distribution of your target variable

Oversample the minority class or use specialized evaluation metrics

Stale data (>6 months old for fast-changing domains)

Model learns patterns that no longer hold

Compare recent real-world outcomes to model predictions

Retrain quarterly or set up continuous learning

Duplicate records

Model overweights repeated examples

Count exact duplicates and near-duplicates

Deduplicate before training

Data Source

Cost

Best For

Watch Out For

Your own operational data

Free (already collected)

Any AI product built around your business

Gaps, labeling effort, privacy compliance

Public datasets (Kaggle, data.gov)

Free

Prototyping, benchmarking, supplementing your own data

May not match your specific domain closely

Synthetic data generation

$2,000-$15,000 setup

Privacy-sensitive domains, rare event modeling

Must validate that synthetic patterns match real ones

Data partnerships / licensing

$5,000-$50,000/year

Specialized datasets you cannot collect yourself

Contract terms, exclusivity, data freshness

Web scraping

$1,000-$10,000 setup

Market research, competitive intelligence

Legal gray areas, robots.txt compliance, data quality

User-generated (incentivized collection)

$0.10-$2.00 per labeled example

Building labeled datasets from scratch

Quality control, bias from incentive structure

Retraining Frequency

Best For

Why

Weekly

Recommendations, content ranking

User preferences and inventory change fast

Monthly

Fraud detection, pricing optimization

Attack patterns and market conditions evolve

Quarterly

Demand forecasting, churn prediction

Seasonal cycles need fresh data each period

On trigger (when accuracy drops below a threshold)

Any production model

Catches domain shifts that do not follow a calendar

Stage

Data Volume

Cost (Timespade)

Cost (Western Agency)

What You Learn

Prototype

500-1,000 examples

$5,000-$8,000

$20,000-$35,000

Does the concept work at all?

V1 production

2,000-5,000 examples

$8,000-$15,000

$30,000-$50,000

Where does the model fail in the real world?

V2 improvement

5,000-15,000 examples

$4,000-$8,000

$15,000-$25,000

Is accuracy good enough for paying customers?

Ongoing iteration

Continuous collection

$1,500-$4,000/cycle

$8,000-$15,000/cycle

What edge cases still need attention?

What data do I need before building an AI product?

What types of data can power different kinds of AI products?

How does a machine learning model turn raw data into predictions?

How much data is enough to train or fine-tune a model?

What data quality problems cause AI products to fail?

Where can I find usable data if I do not have my own?

How do I clean and label data so a model can learn from it?

What legal and privacy concerns apply to collecting training data?

How should I store and version data for ongoing model improvement?

Can I start building with limited data and improve results later?

Related questions

How do I build AI workflows that chain multiple steps together?

Can AI handle invoice processing and accounts payable?

How do I automate customer onboarding with AI?

Can AI manage my inbox and respond to emails?

Announce in the next 28 days

What data do I need before building an AI product?

What types of data can power different kinds of AI products?

How does a machine learning model turn raw data into predictions?

How much data is enough to train or fine-tune a model?

What data quality problems cause AI products to fail?

Where can I find usable data if I do not have my own?

How do I clean and label data so a model can learn from it?

What legal and privacy concerns apply to collecting training data?

How should I store and version data for ongoing model improvement?

Can I start building with limited data and improve results later?

Related questions

How do I build AI workflows that chain multiple steps together?

Can AI handle invoice processing and accounts payable?

How do I automate customer onboarding with AI?

Can AI manage my inbox and respond to emails?

Announce in the next 28 days