How much do different AI models cost to run?

Running an AI model is cheap until it is not. At low volumes, the API bill is barely a rounding error. At scale, it can become the largest line item in your infrastructure budget.

The pricing structure is not complicated, but the comparison between commercial APIs and open-source alternatives is almost always framed badly. The raw token cost is only one number. The total cost of ownership, including the engineering time to deploy, manage, and optimize, is what actually determines which option makes sense for a given product.

Here is the full picture.

What factors determine the cost of running an AI model?

Three variables drive almost every AI inference bill: the model you choose, how many tokens you process, and whether you are calling a hosted API or running the model yourself.

Tokens are the unit of measurement. One token is roughly four characters of English text, so a typical paragraph is about 75–100 tokens. Every request you send to an AI model consumes input tokens (the text going in) and output tokens (the text coming back). Most providers charge more for output tokens than input, typically by a factor of two to four.

Model size matters because larger models consume more compute per token. A model with 70 billion parameters needs roughly 140 GB of GPU memory at standard precision. A smaller 7 billion parameter model fits on a single consumer GPU. That size gap translates directly into cost.

The hosting choice is where the tradeoff gets interesting. Commercial APIs handle the infrastructure entirely. Open-source models give you the weights and leave the rest to you. The API path trades money for simplicity. The self-hosting path trades engineering time for lower per-token cost.

According to a16z's 2025 AI infrastructure report, inference costs dropped 10x between 2023 and 2025 across all major providers. What cost $20 per million tokens in early 2024 costs $2 now.

How do token-based pricing models work in practice?

Every major commercial provider charges by the million tokens. The table below shows current rates for the models most founders are actually considering.

Model	Provider	Input (per 1M tokens)	Output (per 1M tokens)	Best For
GPT-4o	OpenAI	$2.50	$10.00	General-purpose, vision tasks
GPT-4o mini	OpenAI	$0.15	$0.60	High-volume, cost-sensitive tasks
Claude 3.5 Sonnet	Anthropic	$3.00	$15.00	Reasoning, long documents
Claude 3 Haiku	Anthropic	$0.25	$1.25	Fast, cheap responses
Gemini 1.5 Pro	Google	$1.25	$5.00	Long context, multimodal
Gemini 1.5 Flash	Google	$0.075	$0.30	Ultra-high-volume workloads

What does this look like in practice? If your product generates 1,000 API calls per day, with an average of 500 input tokens and 300 output tokens per call, that is 500,000 input tokens and 300,000 output tokens daily. Running that on GPT-4o mini costs about $1.50 per day, or $45 per month. Running the same volume on Claude 3.5 Sonnet costs about $56 per month.

But scale that to 100,000 daily calls and the numbers shift. GPT-4o mini runs to roughly $4,500 per month. Claude 3.5 Sonnet runs to about $5,600 per month. At that volume, model selection alone is worth a dedicated conversation.

OpenAI's 2025 usage data shows that the median production application spends under $200 per month on inference. The majority of founders overestimate their AI costs before launch and underestimate them after product-market fit.

Are open-source models actually cheaper than commercial APIs?

Sometimes. The honest answer depends on what you are measuring.

Open-source models like Meta's Llama 3 (70B), Mistral 7B, and Qwen 2.5 are free to download and run. The per-token cost on self-hosted infrastructure drops to $0.10–$0.30 per million tokens once hardware is factored in. Compared to $2.50–$15 for commercial APIs, that looks like an obvious win.

The catch is the denominator. Self-hosting a 70B model requires at least two A100 GPUs with 80 GB of VRAM each. That hardware costs $25,000–$30,000 to buy or $3,000–$4,000 per month to rent from AWS or Google Cloud. You also need an engineer to set up the inference server, handle scaling, manage updates, and debug failures. That engineer costs $120,000–$160,000 per year in the US, or $30,000–$50,000 with a global team.

For a startup processing fewer than 50 million tokens per month, the commercial API is almost always cheaper once engineering cost is included. The breakeven point, where self-hosting becomes financially rational, typically lands somewhere between 500 million and 1 billion tokens per month.

Managed open-source options like Hugging Face Inference Endpoints or Together AI reduce this gap. They let you run open-source models on shared infrastructure for $0.40–$0.80 per million tokens, no GPU management required.

Approach	Cost per 1M Tokens	Setup Time	Maintenance	Breakeven Volume
Commercial API (e.g. GPT-4o mini)	$0.15–$2.50	Minutes	None	N/A
Managed open-source (e.g. Together AI)	$0.40–$0.80	Hours	Low	~100M tokens/mo
Self-hosted on rented GPU	$0.10–$0.30	Weeks	High	~500M tokens/mo
Self-hosted on owned GPU	$0.02–$0.10	Weeks	Very high	1B+ tokens/mo

For a non-technical founder building a first product, self-hosting before product-market fit is a distraction. The engineering time has a higher opportunity cost than the API savings.

How does model size affect both cost and quality?

Bigger models are not always better for your use case. This is the part most AI vendors do not tell you.

A 7 billion parameter model like Mistral 7B handles simple tasks, classification, summarization, extraction, at a quality level that is indistinguishable from GPT-4 for most production workloads. Sequoia's 2025 AI benchmark study found that for structured tasks with clear inputs and outputs, smaller models matched larger ones 78% of the time. The gap appears on open-ended reasoning, multi-step problems, and creative tasks.

The cost difference is not small. A 7B model costs roughly 10–15x less per token than a 70B model on equivalent hardware. For a product where AI handles a defined, repeatable task, routing those calls to the smallest model that gets the job done is one of the highest-leverage cost optimizations available.

Practical model selection by task type:

Summarizing a document, extracting data from a form, classifying customer feedback: a 7B or 13B model is usually sufficient.
Writing a full marketing brief, answering complex customer questions, generating code: a 34B–70B model or a commercial flagship like GPT-4o or Claude 3.5 Sonnet.
Long legal documents, scientific research, tasks requiring reasoning over many steps: Claude 3.5 Sonnet or Gemini 1.5 Pro, both of which handle context windows of 100,000–200,000 tokens.

The cost-quality tradeoff is not linear. You do not get twice the quality from a model that costs twice as much. You get marginally better performance on the hardest tasks, and roughly equal performance on everything else.

What should I budget for AI inference at different scales?

The numbers below assume a product where AI is used for user-facing features, not just internal tools, and where the average request is 500–800 tokens in and 300–500 tokens out.

Stage	Daily API Calls	Monthly Token Volume	Estimated Monthly Cost	Recommended Model
Pre-launch / testing	Under 500	Under 5M	Under $10	GPT-4o mini or Claude Haiku
Early users (1–500 users)	500–5,000	5M–50M	$10–$150	GPT-4o mini or Gemini Flash
Growth (500–10,000 users)	5,000–50,000	50M–500M	$150–$2,500	Mix of mini + flagship
Scale (10,000+ users)	50,000+	500M+	$2,500–$25,000+	Evaluate self-hosting or volume deals

A few cost levers that most founders do not think about until the bill arrives:

Caching repeated responses cuts costs by 20–40% on products where users ask similar questions. If 1,000 users ask a variant of the same support question, you only need to generate the answer once and serve it from cache. OpenAI's prompt caching, launched in late 2024, does this automatically for prompts over 1,024 tokens.

Volume discounts start at relatively low thresholds. Anthropic's enterprise pricing reduces rates by 20–30% at $5,000 per month in spend. OpenAI offers committed-use discounts at similar levels. For a product at growth stage, negotiating a contract is often the cheapest optimization available.

Western AI consultancies typically charge $30,000–$80,000 to build an AI-powered feature, including model selection, prompt engineering, and integration. An AI-native team at Timespade ships the same feature for $8,000–$15,000 in under 28 days, with model routing and cost optimization built into the architecture from day one, not bolted on after the bill surprises you.

The operational cost of running the AI model is only one part of the equation. The larger variable is the engineering cost to integrate it well, build the caching layer, implement fallback logic, and monitor for quality drift. That is where the legacy tax shows up, not on the API invoice.

Book a free discovery call

Model

Provider

Input (per 1M tokens)

Output (per 1M tokens)

Best For

GPT-4o

OpenAI

$2.50

$10.00

General-purpose, vision tasks

GPT-4o mini

OpenAI

$0.15

$0.60

High-volume, cost-sensitive tasks

Claude 3.5 Sonnet

Anthropic

$3.00

$15.00

Reasoning, long documents

Claude 3 Haiku

Anthropic

$0.25

$1.25

Fast, cheap responses

Gemini 1.5 Pro

Google

$1.25

$5.00

Long context, multimodal

Gemini 1.5 Flash

Google

$0.075

$0.30

Ultra-high-volume workloads

Approach

Cost per 1M Tokens

Setup Time

Maintenance

Breakeven Volume

Commercial API (e.g. GPT-4o mini)

$0.15–$2.50

Minutes

None

N/A

Managed open-source (e.g. Together AI)

$0.40–$0.80

Hours

Low

~100M tokens/mo

Self-hosted on rented GPU

$0.10–$0.30

Weeks

High

~500M tokens/mo

Self-hosted on owned GPU

$0.02–$0.10

Weeks

Very high

1B+ tokens/mo

Stage

Daily API Calls

Monthly Token Volume

Estimated Monthly Cost

Recommended Model

Pre-launch / testing

Under 500

Under 5M

Under $10

GPT-4o mini or Claude Haiku

Early users (1–500 users)

500–5,000

5M–50M

$10–$150

GPT-4o mini or Gemini Flash

Growth (500–10,000 users)

5,000–50,000

50M–500M

$150–$2,500

Mix of mini + flagship

Scale (10,000+ users)

50,000+

500M+

$2,500–$25,000+

Evaluate self-hosting or volume deals

How much do different AI models cost to run?

What factors determine the cost of running an AI model?

How do token-based pricing models work in practice?

Are open-source models actually cheaper than commercial APIs?

How does model size affect both cost and quality?

What should I budget for AI inference at different scales?

Related questions

How do I build AI workflows that chain multiple steps together?

Can AI handle invoice processing and accounts payable?

How do I automate customer onboarding with AI?

Can AI manage my inbox and respond to emails?

Announce in the next 28 days

How much do different AI models cost to run?

What factors determine the cost of running an AI model?

How do token-based pricing models work in practice?

Are open-source models actually cheaper than commercial APIs?

How does model size affect both cost and quality?

What should I budget for AI inference at different scales?

Related questions

How do I build AI workflows that chain multiple steps together?

Can AI handle invoice processing and accounts payable?

How do I automate customer onboarding with AI?

Can AI manage my inbox and respond to emails?

Announce in the next 28 days