How We Predict F1 Races
A complete guide to our machine learning methodology
“We publish our probability for every driver before every race — then we score ourselves against reality. No hiding, no excuses.”
Why Probability, Not Picks
Most prediction sites give you a single name: "Russell will win at Suzuka." We give you a number: Russell has a 41% chance. The distinction matters.
A 41% probability means Russell is our strongest candidate — but also that there is a 59% chance someone else takes the chequered flag. That is not hedging. That is honesty about how motorsport works. One hydraulic failure, one safety car, one botched pit stop, and the race is rewritten.
Binary picks reward luck. Probabilities reward understanding. If we say a driver has a 15% chance of winning, and he wins three times across twenty races where we gave him 15%, we were right — not wrong. Calibration is the measure of a prediction system, not any single headline result.
This is why we publish a full probability distribution for every driver, every race. And it is why we score ourselves publicly using Brier scores — the same standard used in meteorology and quantitative finance.
The Four Signals
Our model fuses four distinct signals, each capturing a different dimension of performance:
1. Elo Ratings — Historical Strength
Every driver and team carries an Elo rating, updated after every race since 2010. The system is borrowed from chess: when a lower-rated driver beats a higher-rated one, the rating transfer is larger. Our implementation goes further than standard Elo — we maintain separate ratings for drivers, teams, and driver-circuit-type combinations (street circuits, high-speed tracks, technical layouts).
A driver's combined Elo is weighted: 60% global driver rating, 30% team rating, 10% circuit-type specialisation. This means a strong driver in a weak car is rated differently from a strong driver in a strong car — because in Formula 1, the car matters as much as the talent.
2. Force Ensemble — Current Form
Elo is historical. It tells you who has been fast. Our ensemble model tells you who is fast right now. It combines three machine learning algorithms — XGBoost, LightGBM, and Ridge regression — weighted 45/40/15 to produce a "force score" between 0 and 1 for each driver.
The ensemble ingests 52 features: qualifying gaps, practice pace, recent finishing positions, championship standing, circuit history, tyre degradation patterns, weather sensitivity, and more. Each feature is engineered from primary source data — never from third-party APIs.
3. Recent Form — Momentum
The third signal captures short-term momentum: results from the last three to five races. A driver who finished P1, P2, P1 in the last three races carries a different profile from one who finished P1, P8, P12, even if their season average is similar. We map finishing positions to a 0-1 scale and compute a rolling trend line.
4. Team Car Performance — Constructor Strength
Formula 1 is a constructor championship disguised as a driver championship. The car's contribution to lap time dwarfs driver skill in most analyses. Our fourth signal isolates team performance: average finishing position, reliability record, pit crew speed, and upgrade trajectory across the season.
Dynamic Weighting
The four signals are not weighted equally across the season. Early in the year, when we have little data about the current car performance, historical Elo and team strength carry more weight. As the season progresses and we accumulate session data, the ensemble model and recent form become dominant.
The weighting schedule:
Rounds 1-2: Elo 20%, Model 30%, Form 20%, Team 30%
Rounds 3-5: Elo 10%, Model 35%, Form 25%, Team 30%
Round 6 onwards: Elo 5%, Model 40%, Form 30%, Team 25%
This approach means our early-season predictions carry wider uncertainty — which is exactly correct. After two races, we know less than we will after ten. The model acknowledges this rather than pretending otherwise.
Monte Carlo Simulation
Raw force scores are not probabilities. To convert them, we run a Monte Carlo simulation: 10,000 virtual races for every Grand Prix.
Each simulation samples from uncertainty distributions around every driver's performance. It models random events — mechanical failures, safety cars, first-lap incidents — using historical base rates for each circuit. A circuit with a 65% historical safety car rate will produce more safety cars in the simulation than a circuit with a 25% rate.
From 10,000 simulated races, we extract:
- —P(win): How often each driver won across all simulations
- —P(podium): How often they finished in the top three
- —P(points): How often they scored (top ten)
- —P(DNF): How often they retired
- —E[position]: Their average finishing position
- —E[points]: Their average points haul
The simulation also produces a full position distribution — we know not just that a driver's expected position is 4.2, but that they finish P1-P3 in 45% of simulations and P6-P10 in 30%. This richness is what makes Monte Carlo valuable: it captures the shape of uncertainty, not just its centre.
Calibration — Are We Actually Right?
A prediction system is only as good as its calibration. When we say a driver has a 30% chance of winning, that outcome should occur roughly 30% of the time across many predictions.
We measure calibration using the Brier score — the mean squared error between our predicted probabilities and the binary outcomes. A perfect Brier score is 0.000 (impossible in practice). A coin flip on a 20-driver field gives approximately 0.090. Our target is to consistently beat both the grid-position baseline and the championship-standings baseline.
We also decompose the Brier score into three components:
- —Reliability: How well calibrated are our probabilities? (lower is better)
- —Resolution: How much do our predictions differ from the base rate? (higher is better)
- —Uncertainty: The inherent randomness of the sport (fixed, not in our control)
Our calibration curve, updated after every scored race, is published on our Track Record page. When we predict 20%, the actual frequency should cluster around 20%. When we predict 60%, reality should be close to 60%. Deviations tell us where the model needs work.
Feature Engineering — 52 Dimensions of Performance
The ensemble model does not see raw lap times. It sees engineered features — transformations of raw data designed to capture meaningful patterns.
Qualifying features measure grid position and the gap to pole in percentage terms. A driver who qualifies 0.3% off pole at a circuit where the top ten are separated by 0.8% is in a very different position from one who is 0.3% off pole where the spread is 2.5%.
Practice pace features aggregate performance across FP1, FP2, and FP3, weighted by recency and adjusted for fuel loads and tyre compounds. Free practice is noisy, but it contains signal about car performance on a specific circuit.
Circuit history features look at each driver's record at the current circuit over the past three years: average finishing position, best result, and win rate. Some drivers consistently overperform at specific circuits — a pattern the model captures.
Recent form features compute rolling averages and trends over the last three to five races: finishing position, grid position, win rate, podium rate, and points per race. A driver trending upward is weighted differently from one trending downward.
Constructor features capture team-level dynamics: championship position, average team finishing position, and teammate performance. If both drivers from a team are consistently finishing in the top five, the car is clearly strong — and our predictions should reflect that for both drivers.
The full feature set spans 52 dimensions, each normalised and validated against historical data to ensure it carries genuine predictive signal rather than noise.
Data Independence
Every data point in our system comes from a primary source:
- —Live timing and telemetry from livetiming.formula1.com via SignalR
- —Official results, calendars, and driver data from formula1.com
- —Regulations and stewards' decisions from fia.com
- —Historical records from the Ergast CSV archive (a one-time import covering 1950-2024)
We do not depend on any third-party API. This is deliberate. Third-party APIs can change, go offline, or introduce errors we cannot control. By scraping and validating primary sources ourselves, we own the entire data pipeline — from raw timing data to final predictions.
This independence extends to our models. We do not use pre-trained embeddings, external prediction feeds, or crowd-sourced data. Every number on our site is generated from our own pipeline.
Scoring and Accountability
After every race, we score our predictions automatically. The scorer compares our pre-race probability distribution against the actual result and computes Brier scores, log loss, and accuracy metrics.
These scores are published on our Track Record page within hours of each race finishing. We show:
- —Per-race Brier score (how good was this specific prediction?)
- —Cumulative Brier score (how good are we across the season?)
- —Comparison against baselines (do we beat grid position? Do we beat championship standings?)
- —Whether our predicted winner actually won
- —Whether the actual winner was in our top three
We also run a walk-forward backtest against historical data: the model is trained on data up to year N-1 and tested on year N, rolling forward from 2018 to 2025. This backtest validates that our model genuinely outperforms simpler baselines across multiple seasons — not just the current one.
Our commitment is simple: every prediction is published before the race. Every score is published after. If the model underperforms, the data shows it. There is nowhere to hide.
What the Model Cannot Do
Transparency means acknowledging limitations:
- —The model cannot predict mechanical failures for specific drivers. It uses historical DNF rates as a base rate, but whether Verstappen's gearbox survives a particular race is outside the model's knowledge.
- —First-lap incidents are modelled as random noise, not as driver-specific behaviour. A driver with an aggressive start style may gain or lose positions — the model captures the distribution but not the specific outcome.
- —Strategic decisions during the race (when to pit, whether to undercut) are not modelled in advance. Our Monte Carlo simulation explores different strategy scenarios, but real-time strategy depends on tyre degradation, track position, and race director decisions.
- —Regulation changes create prediction cliffs. The 2026 regulation reset is the most significant in a decade. Historical Elo ratings become partially obsolete when the car concept changes fundamentally. We handle this with a dampening factor — pulling all ratings toward the mean by 80% — but the early-season uncertainty is genuinely high.
- —Rookies are difficult. A driver with zero F1 race history has no Elo, no form signal, and no circuit history. We assign a rookie penalty and rely more heavily on team performance and qualifying data, but rookie predictions carry wider uncertainty.
These limitations are not bugs. They are the honest boundary of what statistical modelling can achieve in a sport where a single raindrop can change the outcome.