Validating Leading Indicators: Auditing EHS Data for Predictive Signal

Apr 11, 2026

Abstract visualization of time-series safety data flowing through a validation framework, with some streams marked as validated leading indicators and others fading as noise

EHS leading indicators are routinely adopted by convention, not validation. Toolbox talk completion, near-miss counts, and inspection frequencies are standard metrics primarily because they are administratively easy to count, not because they are predictive. They are inherited from legacy corporate spreadsheets and software vendor defaults without any verification of their local signal.

ISO 45001:2018 Annex A specifically calls for "statistical operations" to reveal relationships, patterns and trends, yet the standard lacks the verification logic to prove predictive value. Consequently, most programs operate on pure institutional habit, rolling over metrics because they "feel important" rather than because they demonstrate statistical utility. This is superstition with a spreadsheet, not a safety strategy.

To solve this, validation must be treated as a system diagnostic. The true value of lag-correlation testing is not finding a handful of perfect proactive metrics, but exposing the reality of the metrics management currently trusts.

You Cannot Copy-Paste a Predictive Signal

An inspection metric that reliably predicts incidents at a Texas refinery will degenerate into useless noise at a plant in São Paulo if the local reporting culture is compromised. The corporate dashboard will display the exact same metric, but the mathematical signal will be dead. The difference between "data" and "indicator" lies in the site-specific interaction of risk, trust, and operational latency. When an enterprise deploys a generic, global metric list without site-level validation, it is managing an illusion.

This article establishes a validation methodology for safety metrics and provides a browser-based Safety Metrics Screener for lag-correlation testing. The Screener is a free browser tool embedded in this article — no installation or server-side data transfer required. The methodology functions as a diagnostic tool. It does not just identify leading indicators, it audits the relationship between every metric and its outcome to expose the true operational structure of your management system.

1. What "Leading" Actually Means
2. What Makes a Leading Indicator Worth Tracking
3. The Validation Framework
4. Validation Constraints
5. Applying the Framework: The Screener
6. The Specimen Dataset: 6 Metrics, 36 Months
7. Results: The Baseline Audit
8. Understanding Concurrent Metrics
9. Understanding Forewarning Signals
10. Fix-or-Retire Review
11. Combined Metric Analysis
12. Decisions After Validation
13. The Foreknowledge of Risk

1. What "Leading" Actually Means

Since the early foundations of Industrial Accident Prevention, safety management has separated lagging indicators (fatalities, injuries) from leading indicators (activities that predict them). The problem is that this classification is almost entirely qualitative. A metric receives the "leading" label simply because it describes a proactive task. If you accept this label without statistical proof, you aren't tracking predictive risk. You are just counting administrative tasks and calling it proactive safety. It creates a dangerous illusion of oversight.

Lagging indicators (incident rates, lost-time injury rates, fatality counts) measure what has already happened. They are essential for reporting and trend analysis, but they cannot prevent the next event. A leading indicator is defined by its relationship to those lagging outcomes: it must move before them and in the opposite direction.

The core question it asks is simple: in the months where this metric was high, were incidents lower one, two, or three months later? This statistical relationship is called lag correlation, and it is what this methodology tests.

The Campbell Institute's expert panel — drawn from EHS leaders at Cummins, Honeywell, ExxonMobil, Fluor, and others — defined a leading indicator as a measure that is simultaneously proactive, preventive, and predictive. All three criteria must be satisfied. A metric that is only predictive — one that moves before incidents but does not prevent them — does not qualify as "leading".

Lag-correlation testing audits two of those three criteria statistically — the predictive and preventive criteria. The third, proactive, cannot be tested by correlation: it is a structural property of the metric itself — whether a drop in the metric can trigger a preventive intervention. That is a management design question, not a data question. It is addressed in Section 2 before the statistical test begins.

The predictive criterion is tested by lag structure: the metric's correlation with future incidents must be strictly stronger than its correlation with concurrent incidents. If the peak correlation occurs at lag 0 (the same month), the metric is concurrent — regardless of its proactive label. The preventive criterion is tested by direction: the correlation must be negative, meaning higher activity predicts fewer incidents. A metric that peaks at a future lag with a positive correlation is predictive but not preventive. It moves before incidents rise, not before incidents fall. That is not a leading indicator — the screener will classify it as a Forewarning signal. Forewarning is a separate classification: it has temporal structure but signals risk accumulation rather than prevention. Acting on it as a control measure will not reduce risk; it tells you the window in which to act before the system fails.

This methodology uses lag-correlation testing to empirically audit these assumptions. But before using math to dismantle false assumptions, candidate metrics must pass four structural tests.

2. What Makes a Leading Indicator Worth Tracking

Before validation comes selection. Most EHS programmes build metric sets by adopting industry defaults or copying previous sites. Approving a vendor’s pre-configured list without testing for a local signal is also the same strategic failure. A generic candidate list tells you what other programmes found useful. It does not confirm predictive utility in your specific operational context.

Four properties determine whether a candidate is even worth validating. The first two — Temporally Prior and Actionable — test whether a metric can theoretically predict and prevent incidents. The final two — Sensitivity and Consistency — are the data prerequisites that determine whether the math can run at all.

Temporally prior

The metric must occur before the outcome it predicts. A metric that moves concurrently with incidents is not a leading indicator. If there is no temporal priority, there is nothing to predict.

Temporal priority is the predictive criterion. But direction is equally required: a metric that rises before incidents rise (such as high overtime spiking before fatigue-related injuries) has temporal priority but fails as a leading indicator. The screener classifies this as a Forewarning signal.

Actionable

A change in the metric must trigger an intervention that can prevent the outcome. This is the proactive criterion: the metric must enable a preventive action, not just signal that conditions are worsening. If inspection completion drops to 60%, deploying supervisors closes the gap. Tracking a metric with no intervention pathway isn't just useless—it's a documented liability.

From a governance perspective, validating a predictive signal shifts the legal threshold of foreseeability. Once a risk precursor is statistically validated, it constitutes "documented knowledge" in your management system. If that signal is then ignored without a recorded intervention, it becomes significantly harder to argue that a subsequent incident was not foreseeable. Validation should follow Governance: do not run the math until you have the authority to pull the operational brake. (See Section 13).

Sensitive

A leading indicator must vary enough over time to produce a detectable signal. If a compliance rate remains at a constant 99% while incident rates fluctuate, the measurement has failed to capture the variability of the field. Mathematically, data without variation provides no predictive value; it cannot identify changes in operational risk because the data series contains no new information.

Consistently measured

Definitional changes, system migrations, collection gaps, and target-chasing all corrupt the time series. A training rate calculated against total headcount one month, then against permanent staff the next, is not the same metric. The denominator changed. The time series is broken. No statistical method corrects for a broken measurement definition. This is the property most frequently violated in real EHS data.

Terminological Bridge: Mapping the Requirements

The following table crystallizes how our operational requirements map directly to Campbell's definitional criteria to ensure the validation math remains grounded in safety theory:

Campbell Criteria	Article Property	Operational Management Test
Proactive	Actionable	Management Test: Does a drop in this metric trigger a mandatory, pre-authorized intervention?
Preventive	Direction	Performance Test: Does an increase in this activity demonstrate a reduction in incident rates?
Predictive	Temporally Prior	Forewarning Test: Does the signal occur early enough to allow for a response before accidents happen?

Resources like the Practical Guide to Leading Indicators from the Campbell Institute provide a useful candidate pool. But they are a starting point, not a validation. Consensus ends exactly at the point where the list meets your data. Whether inspection completion at your site predicts incidents one month later — or not at all — is something only your time series can answer. That is where the validation framework begins.

3. The Validation Framework

The validation framework is a three-stage system diagnostic: candidate selection, statistical testing, and operational confirmation. The Screener implements the mathematical engine in Stage 2, while the framework provides the structural context needed to transform raw correlation into a management finding.

Stage 1 - Selection

Apply the four properties from Section 2 as a prioritization filter. While candidates that lack sensitivity or actionability are technically eligible for screening, they are pre-disqualified from "Leading" status. However, testing these suspected reactive or static metrics is often necessary for providing the empirical evidence needed to authorize the retirement of redundant indicators or the reconfiguration of reporting systems.

Do not test every metric simply because it is available. Testing dozens of variables simultaneously introduces statistical noise and unnecessary administrative burden. The objective is to identify a reliable set of predictive indicators, not to quantify every data point available in your records.

Stage 2 — Statistical Testing

Lag correlation identifies the time-delay between a leading indicator today and an incident outcome in the future. By testing offsets—typically at 1, 2, and 3 intervals—we identify whether a metric truly precedes incidents or simply reacts to them. This identifies the maximum response window: the time you have to intervene before the historical incident pattern repeats.

The analysis validates two requirements: Timing and Direction. Timing confirms that the strongest relationship occurs in the future. Direction confirms whether the relationship is preventive (negative correlation) or represents an escalation of risk (positive correlation). A positive correlation at the best future lag produces a Forewarning classification regardless of the signal's strength. The strength of the relationship is expressed as the correlation coefficient r, where |r| denotes its absolute value — stripped of sign — ranging from 0 (no relationship) to 1 (perfect relationship). Direction is carried by the sign; magnitude is read from |r|.

Test reliability depends on data density. While 24 data points (e.g., 24 months) provide a baseline for identifying seasonal trends, the fundamental requirement is a dataset with enough incident variability to produce a valid signal. For shorter timelines, you can increase statistical power by increasing granularity—shifting from monthly to weekly records—provided the data is clean enough to support the higher frequency.

Stage 3 - Operational Confirmation

Statistical classification is a starting point, not a conclusion. Every result — Leading, Forewarning, Concurrent, or Weak — must be asked two diagnostic questions:

Does it make sense? Ask why the numbers behave this way. If inspections lead to fewer injuries next month, the causal link is clear: finding hazards stops accidents. But if a metric tests as "Concurrent", ask why: Are we only doing safety talks on the same day an incident occurs? Trace the failure back to the process, then fix the process.
Does the timing match the control mechanism? The timing (lag) should reflect how the control works. Engineering controls provide immediate protection; a 2-month lag for a machine guard is statistically suspicious. Behavioral programs require time for habit formation—a delay of 1 to 2 months is plausible. If the time-delay exceeds the credible life-cycle of the intervention (e.g., a 6-month lag for a safety talk), reject the result as a statistical coincidence.

If a metric is classified as "Forewarning", verify whether it tracks a genuine inflation of risk (like overtime-induced fatigue) or is simply a reactive count. This confirmation determines whether you govern it as an early-warning signal or re-evaluate the measurement.

Before running the screener, four technical constraints determine what the results can and cannot tell you.

4. Validation Constraints

The Screener identifies the time-relationship and direction of safety data. Four technical constraints shape what the results can and cannot tell you.

1. Directional Scope. Correlation is tested across 1 to 3 months (Lag 1 to 3) into the future. The current month (Lag 0) is excluded because monthly data is too coarse to distinguish the sequence of events.

If a safety talk on the 1st prevents an incident on the 25th, the monthly record registers both in the same period. The math cannot tell if the talk prevented the accident, or if the accident triggered the talk as a reactive response. Because of this limitation, we treat the current month strictly as a measure of how the system reacts to events. For a metric to be validated as Leading, it must show strength in the months that follow, where a clear preventive sequence exists.

Metrics with inherently longer lead-times (such as annual competency cycles) will appear as Weak. This does not confirm the absence of a signal — it confirms the absence of a short-term one. Validating cycles beyond 90 days requires at least 48–60 months of continuous history.

2. Exposure Rate-Normalisation. Raw counts are measures of volume, not efficiency. A count of inspections or hazard reports tells you how much activity occurred, but it doesn't tell you how well you are managing the risk.

If man-hours increase, both activity counts and incident counts will rise together. This isn't a "leading" signal; it's just the byproduct of having more people on site. This shared exposure is not safety performance. To find a real predictive signal, you must normalize the data (e.g., incidents per 200,000 hours) to measure the efficiency of the control relative to the risk exposure.

3. Time-Series Integrity. Changes to reporting definitions or data collection methods mid-dataset invalidate the results. If a new hazard reporting app was introduced or the definition of "Near Miss" was changed during the period, the statistical relationship will track the change in administrative process rather than the change in risk. Do not run the validation on datasets that combine different reporting methodologies.

4. Sample Volume. A minimum of 12 months of data is required, though reliability requires 24–36 months. On smaller datasets, one or two incidents will skew the results and create false signals. If you have less than 2 years of data, use the findings only to investigate the metric — not as a justification to change your safety strategy.

With these constraints in place, the Screener can be applied directly to your data.

5. Applying the Framework: The Screener

The Safety Metrics Screener runs entirely in your browser—no installation required, no data sent to a server.

Before you upload: minimum data requirements

Format: One row per month, one column per metric. The first column must be a date in YYYY-MM format (e.g., 2024-01). Rows must be in chronological order with no gaps.
Minimum length: 12 consecutive months. Fewer than 12 paired observations per metric are automatically excluded to ensure the correlation calculation is possible.
Outcome columns: Include your lagging indicators (Recordable Incidents, LTIR) in the same file. You select the outcome columns in the interface; the tool tests all other metrics against them.
Consistency: Never mix data from different classification sets. If incident reporting rules changed—for instance, a new definition for a "Recordable Incident"—the results will be corrupted. Trim the file to the consistent period.
Missing values: Treat months with no collection as blank, not zero. A 0% inspection rate means inspections happened but failed. A month with no record means data does not exist. Fill isolated gaps with the column's rolling average. If a metric has more than three gaps in 24 months, exclude it from the math. Use that gap as your first diagnostic finding: an organization cannot claim a proactive safety culture if it cannot maintain contiguous data.
Anomalous periods: Exclude months with known operational breaks (site shutdowns, strikes, or mass demobilisation). These reflect exceptional conditions, not the routine operations the screener evaluates. If you retain them, treat the results as provisional.

To convert daily logs into the required monthly CSV, use SUM for event counts (e.g., total inspections) and AVG for rate-based metrics (e.g., monthly compliance percentage or average headcount) to preserve the proportional signal across disparate month lengths.

All activity metrics are rate-normalised (per 100 employees or per 200,000 hours) — not raw counts — in line with the Exposure Rate-Normalisation constraint in Section 4.

Step 1 — Calculation

The calculation identifies the relationship between every metric and its outcomes at lags of 1 to 3 months into the future. As described in Section 4, Lag 0 (the current month) is ignored for prediction because monthly data cannot separate cause from effect within the same 30-day period.

Before running the calculation, select the methodology that matches your data density. Pearson correlation measures linear relationships and works well for high-volume, proportional activity metrics. However, real incident counts at well-run sites are often "sparse" (e.g., 0, 0, 1, 0, 0, 2). Spearman rank correlation handles these rare events better by looking at the rank of the month rather than the raw number. If your data includes zeroes and sparse outcomes like Lost Time Injuries, use Spearman to prevent outliers from distorting the Pearson result.

Step 2 — Classification

The screener assigns one of four classifications: Leading, Concurrent, Forewarning, or Weak. The result is based solely on the statistical evidence — it tells you the structure of the relationship, not whether the metric belongs in your management system.

1. Leading — peak |r| occurs at lag 1–3, is at least 0.08 stronger than lag 0, the correlation is negative (higher activity predicts fewer incidents), and peak |r| ≥ 0.30. Before treating this result as a validated leading indicator, confirm the Actionability check from Section 2: does a drop in this metric trigger a pre-approved intervention? If not, the statistical result stands, but the metric cannot be governed as a leading indicator. Acting on a signal without a pre-approved response procedure creates the legal exposure described in Section 13.

2. Concurrent — peak |r| occurs at lag 0, or peaks at a future lag but the gain over lag 0 is less than 0.08. In both cases the lead is too weak to distinguish from a reactive pattern — the metric moves with incidents, not reliably ahead of them. These are candidates for the Fix-or-Retire Review (Section 10): examine the reporting process before retiring the metric.

3. Forewarning — peak |r| occurs at lag 1–3, is at least 0.08 stronger than lag 0, the correlation is positive, and peak |r| ≥ 0.30. The metric moves ahead of the outcome — not because it prevents it, but because it tracks its build-up (e.g., overtime rising before a fatigue-related incident cluster).

4. Weak — peak |r| remains below 0.30 across all lags. No meaningful relationship between this metric and future incidents is detectable in the data.

The 0.08 gain threshold is not a fixed standard. On datasets under 24 months, raising it to 0.10–0.12 reduces the risk of classifying random variance as a predictive signal.

The table below shows how |r| magnitude maps to classification and what each range means in practice:

\|r\| range	Classification	Interpretation
0.00 – 0.30	No signal	No predictive utility. The metric is either disconnected from risk drivers or the data collection is too inconsistent to produce a signal.
0.30 – 0.50	Moderate signal	Typical range for real EHS data. If negative (−r): validated leading indicator. If positive (+r): Forewarning — the metric has temporal structure but fails the preventive criterion. It tracks deterioration, not prevention.
0.50 – 0.70	Strong signal	High predictive confidence. If negative (−r): sufficient to justify management intervention based on metric trends alone. If positive (+r): strong Forewarning signal — do not treat as a control measure. Investigate the mechanism before acting.
> 0.70	Suspect — verify data	Statistically improbable for activity metrics tested against incident outcomes. Direction still applies, but the magnitude warrants scrutiny: likely target chasing, data manipulation, or the metric being a direct mathematical derivative of the outcome.

At low-incident sites, |r| = 0.30–0.50 is typically the realistic ceiling. Datasets with long runs of zero-incident months cause Pearson correlation to become unreliable — a handful of non-zero months will dominate the calculation and inflate or distort the coefficient. Use Spearman instead: it ranks months rather than using raw values, so the zero-heavy outcome column does not collapse the result. Also check that the direction of the signal holds consistently across rolling 12-month windows, rather than relying on a single full-period result.

Step 3 — Output

The screener returns a ranked, color-coded table — Green: Leading, Orange: Forewarning, Yellow: Concurrent, Grey: Weak — with the full lag profile for each metric shown inline. Export the results to CSV for management review or audit documentation.

6. The Specimen Dataset: 6 Metrics, 36 Months

The dataset includes six candidate metrics tested against Incident Rate (per 200,000 hours) as the outcome:

Hazard Closure Rate (%)
Near Miss Rate (per 100 employees)
Observation Rate (per 100 employees)
Training Rate (hours per employee)
Total Working Hours
Overtime Rate (hours per employee)

This is not a sanitized example built to demonstrate a clean result. The dataset contains one validated leading indicator, metrics that respond to incidents rather than precede them, and signals too weak to act on — the full diagnostic range you would encounter in a real audit.

7. Results: The Baseline Audit

The screener does not produce a clean list of leading indicators. It classifies each metric by what the data shows — not by what the metric is assumed to do. All findings used Spearman Rank Correlation, the method suited to sparse, zero-heavy incident data (as explained in Section 5). For context, correlations in real EHS data rarely exceed |r| = 0.50 — a result of 0.56 is strong; 0.37, which this dataset returns for two of the six metrics, reflects a reactive pattern, not a predictive one.

Finding 1 — Training Rate: No Signal

Training Rate failed to show a meaningful signal, returning a Weak result across all lags (peak |r| = −0.10). While the negative direction is theoretically preventive, the effect size is negligible, explaining only 1% of incident variance (r² = 0.01). This suggests the training program operates as a pre-scheduled administrative activity rather than a responsive risk control. Because the monthly training volume remains nearly identical regardless of the site's shifting risk profile, it functions as a statistical constant. This lack of independent movement explains the "No Signal" result: a static activity cannot correlate with, or predict, volatile incident outcomes. Apply the Fix-or-Retire Review (Section 10) to the Content layer to align curriculum with field risk drivers.

Finding 2 — Near Miss Rate: Reactive Pattern

Near Miss Rate validated as a Concurrent indicator (r = +0.37 at lag 0, p < 0.05) with no forward predictive power. This identifies a reactive reporting pattern: reporting volume spikes concurrently with incident clusters—rising from a baseline of 93 to 107 during high-incident periods—rather than preceding them. This statistical co-occurrence confirms that near misses are being surfaced in response to heightened supervisory scrutiny following an event, not through a proactive routine. Apply the Fix-or-Retire Review (Section 10) to the Collection layer to investigate reporting suppression.

Finding 3 — Observation Rate: Reactive Reporting

Observation Rate also returned a Concurrent result (r = +0.37, p < 0.05), with the signal collapsing immediately after the current month (Lag 1: r = +0.03). This sharp drop in correlation confirms a total lack of forward predictive structure. Observations are being recorded as a reaction to trouble or administrative pressure rather than as a proactive early warning. This identifies a system where volume-based quotas have decoupled reporting from reality, turning the process into a lagging documentation exercise. Apply the Fix-or-Retire Review (Section 10) to the Incentives layer to restore signal integrity.

Finding 4 — Total Working Hours: Operational Scale

Total Working Hours returned a Weak result (peak |r| = +0.27) and failed to reach statistical significance. The data confirms that headcount is not a proxy for risk: the total volume of work hours explains 7% of the site's incident variance (r² = 0.07). This demonstrates that aggregate hour totals are a measure of operational scale, not risk intensity. Because the metric does not distinguish between a high-risk maintenance shutdown and routine administration, it provides no predictive signal for incident clusters. Total hours should be managed on operational dashboards, as they offer no empirical value for safety forewarning.

Finding 5 — Hazard Closure Rate: Primary Leading Indicator

Hazard Closure Rate is the site’s primary early-warning signal, returning a Leading result (r = −0.56, p < 0.001) at a 60-day lag. The data reveals that administrative speed is a direct driver of field safety: resolving hazards today creates a measurable reduction in incidents two months later. The relationship is negligible in the first month (r = +0.16) but the correlation nearly triples by 60 days (r = +0.16 at lag 0 → r = −0.56 at lag 2). This confirms that closure speed isn't just an office metric—it is a proactive control that protects the site from future risk, explaining nearly one-third of the total incident variance (r² = 0.31).

Finding 6 — Overtime Rate: Deterioration Signal

Overtime Rate validated as a Forewarning indicator (r = +0.55, p < 0.001) at a 30-day lag. The data reveals that the impact of overtime is cumulative: it has almost no relationship to incidents in the first month (r = +0.14) but becomes a strong driver of risk after 30 days. This confirms that fatigue takes time to build. An overtime spike today is not an immediate crisis, but it marks the start of a 30-day "danger zone" where incident clusters become significantly more likely as the system's capacity for strain is reached.

Specimen Diagnostics (sample_metrics.csv)

The table below summarises all findings, sorted by classification result.

Metric Name	Lag (mo)	Correlation (r)	Effect (r²)	Classification
Hazard Closure Rate (%)	2	−0.56	0.31	Leading
Overtime Rate (hrs/emp)	1	+0.55	0.30	Forewarning
Near Miss Rate (per 100 emp)	0	+0.37	0.14	Concurrent
Observation Rate (per 100 emp)	0	+0.37	0.14	Concurrent
Training Rate (hrs/emp)	2	−0.10	0.01	Weak
Total Working Hours	2	+0.27	0.07	Weak

8. Understanding Concurrent Metrics

A Concurrent result means the metric fails the predictive criterion. This happens in two cases: the peak correlation occurs at Lag 0 (the same month as the incident), or it peaks at a future lag but the gain over Lag 0 is too small (below 0.08) to confirm a genuine lead. In both cases the metric is a record of what has already occurred rather than a forecast of what is coming. When a metric labeled "proactive" returns a Concurrent result, it is empirical evidence that the activity is triggered by incidents, not by a prevention routine.

The classification is site-specific, representing an audit of the local reporting culture. If Near Miss Rate is a validated leading indicator at Site A but Concurrent at Site B (as seen in Finding 2), Site A has built a reporting culture where risks surface independently. In contrast, Site B's reporting only spikes in response to trouble. For reactive cultures, the Fix-or-Retire Review (Section 10) provides the diagnostic steps needed to investigate whether the metric can be reconfigured or if the reporting system itself is broken.

While Concurrent metrics cannot predict the next failure, they are essential for verifying that the work is actually getting done. Tracking whether your administration fulfills its immediate promises—such as closing investigation actions on time—is a vital "health check" for safety performance. These metrics prove your management process is functioning, even if they hold no power to forecast the future.

9. Understanding Forewarning Signals

A Forewarning indicator passes the predictive criterion but fails the preventive criterion—the metric moves before incidents, but it rises as risk increases. While the instinctive response is to discard these results, they provide a strategic warning of systemic strain. They identify precisely when the site's operational demand is exceeding its safe capacity.

Consider Finding 6 (Overtime Rate). Every spike in overtime represents a measurable accumulation of fatigue. The data shows that this strain does not cause immediate failure; instead, incident clusters peak one month later. In this context, the metric is an early warning of system exhaustion. It identifies the moment where the site has run out of "safety margin."

Strategic leaders use these results to determine their lead time for intervention. If a Forewarning indicator is validated with a 30-day lag, the system is giving you a one-month head start to act. This is the timeline to increase supervisory presence or suspend non-essential work before the accumulated fatigue results in an injury. These metrics tell a leader exactly how much time they have left to deploy controls before the system fails.

10. Fix-or-Retire Review

A Weak or Concurrent result does not automatically mean the metric is worthless. It may mean the data collection is broken, the reporting culture suppresses the signal, or the metric is measuring the wrong layer of activity. Before retiring a metric, investigate the source of the failure.

Zero-Incident Sites

If a site has recorded zero incidents for 24 months, you cannot validate a metric as leading — there is no outcome variance to correlate against. In this case, apply the same lag-correlation method to the activity chain itself: test whether inspection frequency reliably precedes corrective action closure rates. If it does, the activity chain has temporal structure even if the incident outcome does not.

Diagnostic Checklist

Three structural layers determine whether a metric can carry a signal:

Collection: Is reporting suppressed by a blame culture? If field personnel only report near misses after an incident, the metric will always be concurrent. Fix: anonymous or protected reporting channels that decouple identification from consequence.
Incentives: Are counts tied to bonuses or performance ratings? Volume targets decouple reporting from reality. Fix: separate the act of identifying a hazard from the recognition system.
Content: Are inspection criteria relevant to current risk? Generic checklists produce compliance activity, not risk detection. Fix: review criteria against the last 12 months of incident types.

Fix or Retire

Apply the protocol only if at least one of the following applies:

Regulatory requirement: the metric is mandated by national regulation or a reporting framework.
Contractual requirement: the metric is required by a client or insurance framework.
High-consequence proxy: the metric tracks a high-consequence energy source (e.g., pressure, radiation) where even a weak signal is a required safety control.

If none of these apply and the metric remains Weak after investigation, retire it. A metric that carries no predictive or compliance value is administrative overhead.

11. Combined Metric Analysis

Once individual metrics are classified and weak signals retired, the next question is whether your validated metrics work better together than alone.

A single correlation isolates one metric, but it cannot detect where metrics overlap or miss context. Two metrics might appear strong individually but simply repeat the same information. Conversely, two weak metrics may form a strong predictive signal only when viewed together.

The Combined Metric Analyzer measures how much of the historical variation in a target outcome the selected metrics can collectively explain. The output is a score from 0 to 1: a score of 0.50 means the model accounts for half of the historical ups and downs in that outcome. The result is broken down into relative percentages for each metric; the metric with the highest share is the Primary driver. Any metric contributing less than 1% to the model is redundant, adding no new information.

To measure the gain from combining metrics, compare the combined score against the strongest individual baseline. If the combined score is higher, the signals are complementary. If it matches the single-metric baseline, the additional metrics add no explanatory value — use the simpler model. Always focus your investigation on the Primary driver, as it accounts for most of the model's explanatory power for that specific risk pattern.

Worked Example: Specimen Dataset

Using the specimen data against Incident Rate:

Individual Baseline: Overtime Rate alone explains 33% of the variation in Incident Rate (score: 0.33). Hazard Closure Rate alone explains 27% (score: 0.27).
Combined Result: Combining both metrics raises the score to 0.54 (HIGH) — together they account for 54% of historical incident variation, confirming these signals are complementary.
Redundancy Check: Adding Total Working Hours leaves the score at 0.54. It contributes <1% to the model — redundant in this combination.

Finding: Overtime Rate carries 77% of the total weight. The site leader should investigate the overtime trend and implement fatigue controls before the next incident spike.

12. Decisions After Validation

Validation produces two types of output: signal classifications for each metric and, where multiple leading indicators exist, a combined model identifying the primary driver. Both feed directly into decisions — which metrics to act on, which to reassign, and which to retire. Implementation follows a five-step sequence:

Collect at least 24 months of clean, rate-normalised data.
Run the Screener against your chosen outcome metrics.
Classify every candidate metric by its signal type.
Respond per the decision rules below.
Revalidate on every major system change — workforce expansion, new operations, or significant process redesign.

Global List, Local Validation

The enterprise will maintain a global metric list — and should, for comparing performance across sites. That list is a reporting standard, not a predictive one. A metric that is Leading at a high-hazard manufacturing site may be Weak at a logistics hub with different risk drivers. Which metrics actually predict incidents varies by site — validate locally. The screener output, not the metric list, is what gets localised.

Signal Response Rules

Leading: Act on signal drops within the lag window. Assign to your predictive dashboard and set intervention thresholds.
Forewarning: Investigate why the metric is rising. A rising Forewarning metric means activity is increasing alongside incidents — check whether the activity itself is generating incidents, or whether a third factor — a workforce surge, a new operation — is driving both at once.
Concurrent: Reassign to reactive tracking — use these metrics to confirm that your incident response process is working, not to predict what comes next.
Weak: Reassign to compliance tracking for regulatory or contractual reporting only. If the metric satisfies none of the Essentiality Criteria defined in Section 10, retire it to reduce reporting overhead.
Combined Model: If the combined score exceeds the primary driver's individual baseline, monitor all contributing metrics but prioritize the primary driver for intervention. If the combined score holds steady or drops, the additional metrics are not contributing — act on the primary driver alone.

Intervene immediately on signal drops. If your leading metric has a 2-month lag, a drop today is a 60-day warning. The lag period is your maximum response window (see Section 3). Assign a named owner and a response deadline the same day the drop is detected.

The signal response rules above tell you which metrics to act on and when. They do not tell you how to read the overall health of a site at a glance. That requires combining your validated leading indicators into a single composite — the Safety Performance Index — which translates your screener output into a Green / Amber / Red operational signal. That is covered in the next article in this series: Building the Safety Performance Index.

Validation does not end with classification. Once the screener confirms that a metric reliably precedes incidents, you are no longer operating reactively — you have evidence that you knew risk was rising before the incident occurred. That shift carries legal and ethical weight, which is the subject of Section 13.

13. The Foreknowledge of Risk

Operating a predictive framework brings a specific duty of care. Once you demonstrate that a metric reliably precedes incidents, you possess foreknowledge of risk. The core legal question is never whether you ran the validation, but what you operationally did once the predictive signal was identified.

Legal Obligations Once a Signal is Confirmed

A validated leading indicator constitutes a formal warning light in your management system. The duty of care is triggered once a risk is recognized or becomes mathematically foreseeable.

A recorded management response to a signal is your primary legal defence. The legal exposure sits in the gap between knowing a signal exists and closing the control it points to. This varies by jurisdiction — involving legal counsel is a necessary step. The legal risk is highest in the Interim Period — the window between the moment a signal is confirmed and the moment a control response is documented and deployed. If your validation audit identifies a new leading indicator, your duty of care is triggered immediately. Do not conduct validation audits as "informal research" without a pre-authorized plan to act on any positive results — at minimum, that plan should name who is responsible, what response is required, and within what timeframe.

Protocol Checklist for Safety Leaders

Document the finding. Record every confirmed signal: the metric, lag, correlation strength, and date of confirmation. This is the start of your audit trail and the reference point for any future legal review.
Assign an owner and a deadline. Name the person responsible for the control response and set a date to close the Interim Period. An open-ended ownership assignment is not a defence.
Schedule revalidation on workforce expansion, new operations, significant process redesign, or contractor mix shifts above 20%.

True predictive capability is not a software feature. It begins with the integrity of your data and ends with a decision. How to turn your validated signals into a live operational dashboard — the Safety Performance Index — is covered in the next article in this series: Building the Safety Performance Index.