Running an AI model is cheap until it is not. At low volumes, the API bill is barely a rounding error. At scale, it can become the largest line item in your infrastructure budget.
The pricing structure is not complicated, but the comparison between commercial APIs and open-source alternatives is almost always framed badly. The raw token cost is only one number. The total cost of ownership, including the engineering time to deploy, manage, and optimize, is what actually determines which option makes sense for a given product.
Here is the full picture.
What factors determine the cost of running an AI model?
Three variables drive almost every AI inference bill: the model you choose, how many tokens you process, and whether you are calling a hosted API or running the model yourself.
Tokens are the unit of measurement. One token is roughly four characters of English text, so a typical paragraph is about 75–100 tokens. Every request you send to an AI model consumes input tokens (the text going in) and output tokens (the text coming back). Most providers charge more for output tokens than input, typically by a factor of two to four.
Model size matters because larger models consume more compute per token. A model with 70 billion parameters needs roughly 140 GB of GPU memory at standard precision. A smaller 7 billion parameter model fits on a single consumer GPU. That size gap translates directly into cost.
The hosting choice is where the tradeoff gets interesting. Commercial APIs handle the infrastructure entirely. Open-source models give you the weights and leave the rest to you. The API path trades money for simplicity. The self-hosting path trades engineering time for lower per-token cost.
According to a16z's 2025 AI infrastructure report, inference costs dropped 10x between 2023 and 2025 across all major providers. What cost $20 per million tokens in early 2024 costs $2 now.
How do token-based pricing models work in practice?
Every major commercial provider charges by the million tokens. The table below shows current rates for the models most founders are actually considering.
| Model | Provider | Input (per 1M tokens) | Output (per 1M tokens) | Best For |
|---|---|---|---|---|
| GPT-4o | OpenAI | $2.50 | $10.00 | General-purpose, vision tasks |
| GPT-4o mini | OpenAI | $0.15 | $0.60 | High-volume, cost-sensitive tasks |
| Claude 3.5 Sonnet | Anthropic | $3.00 | $15.00 | Reasoning, long documents |
| Claude 3 Haiku | Anthropic | $0.25 | $1.25 | Fast, cheap responses |
| Gemini 1.5 Pro | $1.25 | $5.00 | Long context, multimodal | |
| Gemini 1.5 Flash | $0.075 | $0.30 | Ultra-high-volume workloads |
What does this look like in practice? If your product generates 1,000 API calls per day, with an average of 500 input tokens and 300 output tokens per call, that is 500,000 input tokens and 300,000 output tokens daily. Running that on GPT-4o mini costs about $1.50 per day, or $45 per month. Running the same volume on Claude 3.5 Sonnet costs about $56 per month.
But scale that to 100,000 daily calls and the numbers shift. GPT-4o mini runs to roughly $4,500 per month. Claude 3.5 Sonnet runs to about $5,600 per month. At that volume, model selection alone is worth a dedicated conversation.
OpenAI's 2025 usage data shows that the median production application spends under $200 per month on inference. The majority of founders overestimate their AI costs before launch and underestimate them after product-market fit.
Are open-source models actually cheaper than commercial APIs?
Sometimes. The honest answer depends on what you are measuring.
Open-source models like Meta's Llama 3 (70B), Mistral 7B, and Qwen 2.5 are free to download and run. The per-token cost on self-hosted infrastructure drops to $0.10–$0.30 per million tokens once hardware is factored in. Compared to $2.50–$15 for commercial APIs, that looks like an obvious win.
The catch is the denominator. Self-hosting a 70B model requires at least two A100 GPUs with 80 GB of VRAM each. That hardware costs $25,000–$30,000 to buy or $3,000–$4,000 per month to rent from AWS or Google Cloud. You also need an engineer to set up the inference server, handle scaling, manage updates, and debug failures. That engineer costs $120,000–$160,000 per year in the US, or $30,000–$50,000 with a global team.
For a startup processing fewer than 50 million tokens per month, the commercial API is almost always cheaper once engineering cost is included. The breakeven point, where self-hosting becomes financially rational, typically lands somewhere between 500 million and 1 billion tokens per month.
Managed open-source options like Hugging Face Inference Endpoints or Together AI reduce this gap. They let you run open-source models on shared infrastructure for $0.40–$0.80 per million tokens, no GPU management required.
| Approach | Cost per 1M Tokens | Setup Time | Maintenance | Breakeven Volume |
|---|---|---|---|---|
| Commercial API (e.g. GPT-4o mini) | $0.15–$2.50 | Minutes | None | N/A |
| Managed open-source (e.g. Together AI) | $0.40–$0.80 | Hours | Low | ~100M tokens/mo |
| Self-hosted on rented GPU | $0.10–$0.30 | Weeks | High | ~500M tokens/mo |
| Self-hosted on owned GPU | $0.02–$0.10 | Weeks | Very high | 1B+ tokens/mo |
For a non-technical founder building a first product, self-hosting before product-market fit is a distraction. The engineering time has a higher opportunity cost than the API savings.
How does model size affect both cost and quality?
Bigger models are not always better for your use case. This is the part most AI vendors do not tell you.
A 7 billion parameter model like Mistral 7B handles simple tasks, classification, summarization, extraction, at a quality level that is indistinguishable from GPT-4 for most production workloads. Sequoia's 2025 AI benchmark study found that for structured tasks with clear inputs and outputs, smaller models matched larger ones 78% of the time. The gap appears on open-ended reasoning, multi-step problems, and creative tasks.
The cost difference is not small. A 7B model costs roughly 10–15x less per token than a 70B model on equivalent hardware. For a product where AI handles a defined, repeatable task, routing those calls to the smallest model that gets the job done is one of the highest-leverage cost optimizations available.
Practical model selection by task type:
- Summarizing a document, extracting data from a form, classifying customer feedback: a 7B or 13B model is usually sufficient.
- Writing a full marketing brief, answering complex customer questions, generating code: a 34B–70B model or a commercial flagship like GPT-4o or Claude 3.5 Sonnet.
- Long legal documents, scientific research, tasks requiring reasoning over many steps: Claude 3.5 Sonnet or Gemini 1.5 Pro, both of which handle context windows of 100,000–200,000 tokens.
The cost-quality tradeoff is not linear. You do not get twice the quality from a model that costs twice as much. You get marginally better performance on the hardest tasks, and roughly equal performance on everything else.
What should I budget for AI inference at different scales?
The numbers below assume a product where AI is used for user-facing features, not just internal tools, and where the average request is 500–800 tokens in and 300–500 tokens out.
| Stage | Daily API Calls | Monthly Token Volume | Estimated Monthly Cost | Recommended Model |
|---|---|---|---|---|
| Pre-launch / testing | Under 500 | Under 5M | Under $10 | GPT-4o mini or Claude Haiku |
| Early users (1–500 users) | 500–5,000 | 5M–50M | $10–$150 | GPT-4o mini or Gemini Flash |
| Growth (500–10,000 users) | 5,000–50,000 | 50M–500M | $150–$2,500 | Mix of mini + flagship |
| Scale (10,000+ users) | 50,000+ | 500M+ | $2,500–$25,000+ | Evaluate self-hosting or volume deals |
A few cost levers that most founders do not think about until the bill arrives:
Caching repeated responses cuts costs by 20–40% on products where users ask similar questions. If 1,000 users ask a variant of the same support question, you only need to generate the answer once and serve it from cache. OpenAI's prompt caching, launched in late 2024, does this automatically for prompts over 1,024 tokens.
Volume discounts start at relatively low thresholds. Anthropic's enterprise pricing reduces rates by 20–30% at $5,000 per month in spend. OpenAI offers committed-use discounts at similar levels. For a product at growth stage, negotiating a contract is often the cheapest optimization available.
Western AI consultancies typically charge $30,000–$80,000 to build an AI-powered feature, including model selection, prompt engineering, and integration. An AI-native team at Timespade ships the same feature for $8,000–$15,000 in under 28 days, with model routing and cost optimization built into the architecture from day one, not bolted on after the bill surprises you.
The operational cost of running the AI model is only one part of the equation. The larger variable is the engineering cost to integrate it well, build the caching layer, implement fallback logic, and monitor for quality drift. That is where the legacy tax shows up, not on the API invoice.
