A split testing setup sounds like something only companies with data science teams can run. It is not. A founder with 500 monthly active users and a free tool can run a legitimate A/B test that changes how they build the next three features.
What makes most early-stage setups fail is not a lack of tools. It is skipping the math before writing a single line of code: running tests that are too short, too small, or measuring the wrong thing. Fix those three problems and A/B testing becomes one of the cheapest ways to make product decisions that are right more often than wrong.
What is A/B testing and why does it matter?
An A/B test shows two different versions of something (a button, a pricing page, an onboarding flow) to two separate groups of real users at the same time. One group sees the original (the control). The other sees the change (the variant). You measure which group does more of what you want: sign up, pay, return the next day.
The reason this matters is that founder intuition has a poor track record. Google ran a famous test in 2009 where they tested 41 shades of blue for a toolbar link color. The winning shade generated an extra $200 million in annual ad revenue. No one on the design team predicted it. That is not a story about Google being obsessive. It is a story about the gap between what people think users prefer and what users actually do.
A 2021 Harvard Business Review study found only 10–30% of A/B tests produce a statistically significant positive result. That number sounds discouraging. What it actually means: if you are changing things based on intuition alone, you are probably wrong 70–90% of the time and never finding out.
How does an A/B test work?
Every A/B test has four components that have to be defined before you start, not after you see the numbers.
Start with a single hypothesis: "Changing the CTA button from 'Sign up' to 'Start free' will increase registrations." One change. One metric. If you change the button text and the button color at the same time, you cannot tell which change caused the result.
Next is a primary metric, the one number that decides the winner. Clicks, sign-ups, first purchase, seven-day retention. Pick one. Looking at fifteen metrics after the fact is how you fool yourself into seeing a win that does not exist.
Third is a minimum detectable effect: how big does the difference need to be before you care? A 0.1% improvement in conversion is not worth building infrastructure around. A 10% improvement changes your unit economics. Decide this threshold before the test runs.
Fourth is a sample size calculated in advance. A free online statistical calculator takes two minutes and tells you how many users each variant needs to see before the result is reliable. Most early-stage apps need 500–5,000 users per variant depending on the baseline conversion rate and the effect size they are trying to detect. Stopping a test the moment you see a lead is one of the most common ways to arrive at a false positive.
Once those four things are set, the technical implementation is straightforward. Your analytics tool or a feature flag system randomly assigns each new user to group A or group B, tracks what each group does, and surfaces the difference.
What does A/B testing cost for a startup?
The real cost question is not which platform to use. It is whether the volume of decisions you are making justifies a dedicated tool at all.
At under 10,000 monthly active users, most startups run fine on free tools. Google Optimize (free until 2023, replaced by open-source alternatives), PostHog's free tier, and GrowthBook's free plan all support basic experimentation without a monthly bill. You lose some convenience features but not statistical validity.
Between 10,000 and 100,000 monthly active users, paid plans start making sense because the manual overhead of free tools exceeds their savings. Tools like PostHog, Statsig, and LaunchDarkly run $50–$500 per month at this scale.
Enterprise platforms (Optimizely, VWO, Adobe Target) start at $1,500–$5,000 per month and are designed for teams running dozens of concurrent experiments across multiple surfaces. A pre-product-market-fit startup paying $3,000 per month for an experimentation platform is burning runway on infrastructure that will not move the needle.
| Stage | Recommended tool | Monthly cost | What you give up |
|---|---|---|---|
| <10k MAU | GrowthBook (free), PostHog free tier | $0 | Advanced targeting, multi-armed bandits |
| 10k–100k MAU | PostHog, Statsig, LaunchDarkly | $50–$500 | Enterprise compliance, dedicated support |
| 100k+ MAU | Optimizely, VWO | $1,500–$5,000 |
Western agencies that set up experimentation infrastructure typically charge $8,000–$20,000 for the initial build plus $2,000–$5,000 per month for ongoing management. A global engineering team builds the same infrastructure for $3,000–$5,000, with open-source tooling covering the platform cost entirely. The 3–5x gap is not in the tools; it is in the hourly rate of the engineers configuring them.
What are the most common A/B testing mistakes?
Most failed tests are not failed by the technology. They are failed in the week before the test runs.
Peeking is the single most common error. A test is running, you check the dashboard on day three, variant B is winning by 15%, and you call it. The problem: at day three, your sample is too small to distinguish a real effect from random noise. Statistical tests assume you collect data until the pre-planned sample size is reached, then look. Looking continuously and stopping early inflates your false positive rate from 5% to 25–40% (Microsoft Research, 2019). A test you called on day three has a coin-flip chance of being wrong.
A second mistake is testing too many things at once. Multi-variate tests (changing the headline, the image, and the CTA simultaneously) require sample sizes that multiply with every additional variable. A test with three variables needs roughly eight times as many users to reach the same confidence level as a single-variable test. At most early-stage apps, that means a test that takes two weeks for one variable takes four months for three.
There is also the novelty effect. Users behave differently when something is new. A fresh redesign of your checkout page will often see a short-term lift simply because users are paying more attention. Tests shorter than two weeks frequently capture the novelty bump, not the real behavioral change.
Then there is the survivorship problem. If your test measures only users who reach a certain page, but the variant itself causes fewer users to reach that page in the first place, you are measuring a biased sample. Always track the full funnel, not just the step you changed.
How long does an A/B test need to run?
Two weeks is the floor for almost every test, regardless of whether you hit your target sample size sooner.
The reason is weekly seasonality. User behavior on a Monday is systematically different from behavior on a Saturday. A test that runs for nine days captures a Monday-through-Tuesday cycle twice and a weekend once, which means the two groups are not getting the same mix of traffic patterns. Running through at least two full weeks evens this out.
Sample size drives the ceiling. A free statistical significance calculator requires four inputs: your baseline conversion rate, the minimum improvement you care about detecting, the confidence level you want (95% is standard), and the statistical power (80% is standard). Plug those in and you get the number of users per variant. Divide by your daily traffic and you know how many days the test needs to run.
For a typical early-stage app with a 3% baseline conversion rate trying to detect a 20% relative improvement (from 3.0% to 3.6%), the required sample is roughly 8,700 users per variant. An app with 500 daily active users split across two variants needs about 35 days. An app with 5,000 daily active users needs about 3.5 days, but still runs for two weeks to clear the seasonality problem.
| Baseline conversion | Minimum effect to detect | Users needed per variant |
|---|---|---|
| 1% | 20% relative (1.0% → 1.2%) | ~26,000 |
| 3% | 20% relative (3.0% → 3.6%) | ~8,700 |
| 10% | 20% relative (10% → 12%) | ~2,400 |
| 30% | 20% relative (30% → 36%) | ~730 |
Conversion rate and traffic volume together determine how fast you can run experiments. An app with a 1% conversion rate and 200 daily active users cannot run a meaningful test in under six months. In that case, the right answer is not a faster tool. It is either improving baseline conversion first or running qualitative research (user interviews, session recordings) until traffic grows enough to support testing.
The operational setup (picking a tool, adding tracking, building the flag logic) takes one to three days for a competent engineering team. The strategy part (what to test, in what order, how to read the results) takes longer and matters more. An engineering team that has built experimentation infrastructure before can compress the setup to a day. A team doing it for the first time should budget a week.
If you want to build out a proper experimentation system (the tooling, the tracking, the result analysis workflow) without pulling your product team off roadmap work, that is a contained infrastructure project. Book a discovery call with Timespade and you can have a scoped plan in your inbox within 24 hours.
