Blazeway

Free Tool

A/B Test Significance Calculator

Same data, two methods, different answers. Enter your experiment numbers and see how chi-squared and Bayesian analysis interpret the results.

Control (A)

Rate: 5.0%

Variant (B)

Rate: 6.5%

Frequentist

Chi-Squared Test

χ² Statistic2.076
p-Value0.1496
Significant (p < 0.05)No
The observed difference could easily occur by chance (p = 0.150). Not enough evidence to declare a winner.

Bayesian

Beta-Binomial Model

P(B beats A)92.4%
Expected Lift+31.9%
95% Credible Interval-9.1% to +85.4%
B is likely better (92.4% chance), but there's still meaningful uncertainty. The true lift could be anywhere from -9.1% to +85.4%.

Why two methods?

Most A/B testing tools show you one number and call it a day. But that number comes from a specific statistical framework, and the framework shapes what the number means. This calculator shows you both so you can see the difference yourself.

Chi-squared: "Can I reject the null hypothesis?"

The frequentist approach asks a narrow question: if there were no real difference between A and B, how likely is data this extreme? If unlikely enough (p < 0.05), you call it significant. If not, you can't conclude anything. It's binary. You either cross the threshold or you don't.

This works well when you need a strict decision rule. But it doesn't tell you how much better B is, or how confident you should be. A p-value of 0.049 and 0.051 are practically identical, but one is "significant" and the other isn't.

Bayesian: "How likely is it that B is better, and by how much?"

The Bayesian approach directly answers the question you actually care about. It gives you a probability that B beats A (e.g. 94.3%) and a credible interval for the lift (e.g. +5% to +25%). No arbitrary threshold. You decide what confidence level is enough for your decision.

This is more useful for product decisions. "There's a 92% chance this improves conversion by 8-20%" is a better basis for a shipping decision than "p = 0.07, not significant."

When they agree, when they don't

With large samples and clear winners, both methods tell the same story. The differences matter at the margins: small samples, ambiguous results, marginal effects. That's exactly where you need the extra nuance that Bayesian analysis provides.

What this calculator uses

The frequentist column uses a standard chi-squared test on a 2x2 contingency table with one degree of freedom. The Bayesian column uses a Beta-Binomial model with a uniform prior (Beta(1,1)) and 100,000 Monte Carlo samples to estimate the posterior distribution. All calculations run in your browser. No data is sent anywhere.

FAQ

Frequently Asked Questions

What is the difference between chi-squared and Bayesian A/B testing? +

A chi-squared test gives a binary answer: significant or not, based on a p-value threshold (typically 0.05). A Bayesian approach tells you the probability that one variant beats the other and the expected size of the difference. Chi-squared answers "can I reject the null hypothesis?" while Bayesian answers "how confident should I be that B is better, and by how much?"

Which method should I use for my A/B test? +

For most product decisions, Bayesian analysis is more useful because it gives you a probability and a range rather than a binary yes/no. Chi-squared is better when you need a strict, conventional significance test, for example in academic research or regulated environments. Both methods converge on the same conclusion with large enough sample sizes.

What is a p-value? +

A p-value is the probability of observing results as extreme as yours if there were no real difference between variants. A p-value of 0.03 means there is a 3% chance of seeing this data if A and B were identical. It does not mean there is a 97% chance that B is better. That distinction is why Bayesian analysis exists.

What is a credible interval? +

A 95% credible interval is the range that contains the true lift with 95% probability. If the interval is +5% to +25%, you can say: there is a 95% chance the real improvement is between 5% and 25%. This is more intuitive than a frequentist confidence interval, which answers a subtly different question.

How many visitors do I need? +

It depends on your baseline conversion rate and the effect size you want to detect. As a rough guide: to detect a 20% relative lift on a 5% conversion rate, you need roughly 3,000 to 5,000 visitors per variant. The Bayesian approach can give useful signals earlier because it provides a probability rather than a binary threshold.

Turn numbers into product decisions

Blazeway helps you document the hypothesis behind every test and the learning behind every result. So significance becomes insight.

Start free