The EHS Data Refinery: From Crude Data to AI-Ready Intelligence

November 20, 2025

The concept of raw EHS data as "crude oil" highlights a critical reality: it is a substance with immense potential, but one that is messy, unstructured, and unusable in its natural state. An AI cannot learn from data filled with inconsistencies, subjective opinions, and undefined terms.

Attempting to run an AI on raw data yields incorrect predictions (hallucinations) where the AI confidently invents facts because it lacks a clean baseline. This erodes trust and wastes resources. It won't just fail; it will actively mislead you, identifying risks that don't exist while missing the ones that do.

The solution is not the software itself, but what you do with your data. Buying AI-enabled software only creates a container; the intelligence comes from the refinement process. This non-negotiable industrial process — The Refinery — is the critical EHS function for transforming raw data into a standardized, usable asset. This refinery process consists of three core components: governance, cleansing, and labeling.

1. Data Governance: Digital 5S for Your Database

Data governance is a term that causes many leaders to tune out, imagining endless meetings and binders of rules.

To make this practical, we need to reframe it. Think of governance not as IT bureaucracy, but as Digital 5S. Just as you wouldn't tolerate a cluttered, unsafe shop floor where tools are missing or mislabeled, you cannot tolerate a cluttered, unsafe dataset.

Sort: Identify and remove free-text fields in your data collection forms that create confusion.
Set in Order: Organize data into a logical EHS Classification System.
Standardize: Enforce consistency with mandatory, structured dropdowns.

The core of this Digital 5S approach is building an EHS Classification System (Taxonomy) — a rigid ruleset designed to eliminate confusion. Its entire purpose is to ensure that when two people report the same event, they use the same language.

The most effective way to implement this taxonomy is to redesign your data collection forms: remove free-text fields wherever possible and replace them with mandatory, multi-level dropdown menus. This is the practical setup of a standardized recording framework, such as the ESAW methodology.

The Strategic Shift: Separating the Event from the Cause

Most legacy EHS reporting systems share the same fundamental flaw: they mix different types of data together. A frontline manager opens the incident form and sees one dropdown containing event types (Injury, Near Miss), severity levels (LTI, MTC, First Aid), and vague root cause categories (Personnel Factors) all jumbled together. This creates confusion and forces flawed decisions.

The Failure of the Legacy Form:

Here's what happens when you mix these categories: the rare, high-consequence events you actually want to predict (like a Laceration) get buried by the hundreds of low-consequence events (like Near Misses). In data science, this is called Noise vs. Signal (Class Imbalance).

Think of this as trying to hear a whisper (the injury pattern) in a crowded stadium (the thousands of near misses). The AI will naturally listen to the loudest sound — the near misses — and stop looking for the rare triggers that actually lead to injuries. This is why AI models fail in EHS: the machine learns that predicting "Near Miss" is the safe guess 99% of the time. You'll never find the pattern you're looking for.

And it gets worse: fields like "Description" and "Immediate Cause" become text dumps for narratives and notes (unstructured data) where people mix what happened (the mechanical event) with why it happened (the systemic cause). The AI gets noise instead of structure.

The Solution: Implementing a Structured EHS Taxonomy

The fix isn't just a prettier form—it's using a structured Classification System (Taxonomy) based on proven frameworks like ESAW (European), OIICS (North American), or TOOCS (Australian). These frameworks force clarity and objectivity into your data.

Clean Event Separation: Start by forcing a simple choice: Is this an Injury/Illness or a Near Miss? Once they choose, the form immediately routes them to the appropriate classification. This keeps your high-volume leading indicators (Near Misses) separate from your high-consequence lagging indicators (Injuries), preventing the Noise vs. Signal problem that kills AI models.
Focus on Objective Data: Guide the manager through objective data points defined by the framework. Enforce single-selection rules, keeping your categories clean and mutually exclusive. Here's the key: severity (LTI vs. First Aid) should never be a dropdown choice. Calculate it automatically based on objective fields like 'Days Away.' This forces the manager to describe the risk, not just label the result.
The Crucial Separation: Capture the mechanical event in these structured fields, but move root cause analysis to a separate module or another systematic RCA tool. The context isn't lost; it's just moved to where it belongs, in a proper investigation workflow rather than a single text box.

This structured workflow isn't just better for AI — it's faster and easier for the people filling out the forms. It's a guided process where nobody's guessing anymore. When you build good data governance into your form design, people actually use it. And this separation between what happened and why it happened? That's what unlocks diagnostic analytics — the difference between simply recording incidents and actually understanding their root causes.

2. Data Cleansing: Engineering Your "Data Equity"

Governance (Section 1) ensures your future data is clean. But that only stops the bleeding. It does nothing about the massive "Data Debt" (years of poorly structured historical data) sitting in your database right now.

This is where Data Cleansing comes in. This isn't just about tidying up; it's about engineering equity from your existing data.

Your historical data is a strategic asset, but right now it's frozen in a state of disrepair. Data cleansing is how you standardize, repair, and curate that data at scale so you can actually use it to train AI models. You can't build a reliable model on a broken foundation.

Strategy 1: Standardization (Events, Assets, and Locations)

This is your most critical cleansing strategy because it answers two fundamental questions your AI needs to know: what is happening and where is it happening. Without this, your data is just a pile of isolated reports, useless for pattern recognition, and your predictive models will fail.

Part A: Standardizing Events (The "What")

The Problem: Your database contains "slip," "slipped on wet floor," "fall due to water," and "water hazard." An AI sees these as four different things. Ask it "How many slip/trip hazards did we have?" and it can't answer.

The Solution: Create a mapping file—a simple two-column spreadsheet. Column A lists the messy terms (e.g., "slipped on wet floor"). Column B has the clean, standardized term from your taxonomy (e.g., "Slip/Trip - Wet Surface"). Hand this to your IT team. They write a find-and-replace script, and what would've been a six-month manual cleanup becomes a ten-minute automated job.

Part B: Standardizing Assets & Locations (The "Where")

The Problem: One report says "Line 3," another says "L3-Conveyor," a third says "Conveyor, Line 3." The AI sees three different places and can't connect the dots. This is the biggest barrier to predictive modeling.

The Solution: Don't build your location list from scratch. This is where your Interoperability work pays off. Connect to the source of truth: pull the Master Asset List from your CMMS (Maintenance) or the Location Hierarchy from your operations system. Use that as your standardization dictionary.

Example Mapping File:

Messy Term	Standardized ID
"Line 3"	L3-CONVEYOR
"L3"	L3-CONVEYOR
"Conveyor, Line 3"	L3-CONVEYOR
"Line 4"	L4-MIXER

Once all variations map to the same ID (like `L3-CONVEYOR`), the AI can finally see the complete history of that asset across all your data:

Observation: Unsafe Condition (Guarding) on `L3-CONVEYOR`
Near Miss: Object fell from `L3-CONVEYOR`
Incident: Hand Laceration on `L3-CONVEYOR`

Now you've given the AI the context it needs to find patterns. And because both EHS and Maintenance records are standardized to `L3-CONVEYOR`, you can ask high-value questions like: "How many of our "Guarding" observations on this asset were followed by an "Overdue PM" work order from the CMMS?" This is how you transform Safety Risk data into Asset Reliability intelligence—proving that EHS data is a leading indicator for maintenance failure and operational risk.

Strategy 2: Contextual Repair (Handling Missing Data)

After standardization, your next challenge is missing data. A single blank field (like a missing "Department" or "Location") can make an entire report useless for analysis, creating holes that break your models.

Don't guess at missing data. Repair it using context.

Remember the "Common Key" workshop from your interoperability work? This is where it pays off. You don't need to chase down a manager to fill in "Department" or "Tenure." Just pull it from HR using the data bridge you already built. It's a simple query: the system asks the HR bridge "Who is Employee #7781?" and fills in the blanks automatically. This isn't just filling blanks. It's automated data repair that rescues thousands of records from the junk pile.

Strategy 3: Choosing the Right Data (Feature Selection)

This is the most critical, non-negotiable step in the entire refinery process. While the first two strategies are about improving data, this one is about protecting your models from bad data.

In data science, this is called Choosing the Right Data (Feature Selection). The concept is simple: deliberately exclude biased or low-value fields from your analytical datasets.

The Problem: Most legacy EHS systems have a mandatory "Root Cause" field with blame-oriented options like "Personnel Factors," "Management System Issues," or "Physical Hazard/Equipment." This isn't data. It's a guess.
The Danger: This isn't a passive problem. It's active sabotage against your AI program. It's the textbook definition of "Bias In, Bias Out." If you feed an AI thousands of reports where "Employee Error" is marked as the cause, you're training it to be a bad safety professional. It will learn your organization's flawed, blame-based logic and predict "operator error" every time—missing the systemic risks, like a broken guard latch or a misspecified tool, that you need it to find.
The Solution: Don't delete this data (you may need it for compliance or historical reports). Just quarantine it. Exclude that toxic "Root Cause" field from any analytical dataset you feed to your models. Be ruthless. We quarantine this subjective data not to hide the human context, but to ensure the AI learns from a Pure Model of physical triggers first. Once that baseline is established, you can layer the complexity of organizational culture back in. It's better to build a model on 5 reliable, objective fields (like "Event Type", "Location", "Asset", "Activity") than to poison it with a 6th field that's fundamentally flawed. This is the test of whether you're serious about data-driven safety or just automating your existing biases.
Quarantining "Other" Categories: Another common flaw is the "Other" category. While it seems harmless, "Other" is the digital equivalent of sweeping problems under the rug. If 10% of your records fall into "Other," your AI learns to ignore 10% of your unique organizational risk. During cleansing, manually review these records and map them to the most specific category in your taxonomy. Retire the "Other" bucket entirely.

3. The "Data Labeling" Process: Codifying Organizational Expertise

With your data governed (Section 1) and your historical records cleansed (Section 2), your foundation is solid. Now comes the final step: teaching your AI what to look for.

A common misconception is that modern Large Language Models (LLMs) make labeling obsolete because they "understand" text. They don't. They understand statistical patterns in generic data. Data Labeling is how you teach the machine your Organizational Dialect—the specific way your experts define risk at your sites.

This is the process where your EHS experts manually review a sample of narratives and notes (unstructured data) — like observation notes or inspection descriptions—and tag them with the standardized labels from your classification system.

The goal isn't just to predict accidents — it's to prescribe solutions. An AI that only predicts risk is just a fancy reporting tool. But an AI that learns which corrective actions actually work (and which ones fail) becomes a strategic asset. Data labeling is how you make that leap from prediction to prescription.

Don't focus only on incident reports. Your most valuable unstructured data is probably buried in safety observations, inspections, and risk assessments — your leading indicator data. This typically exists in two fields:

The Hazard Description: What was seen? (e.g., "Guard was off on Line 3 conveyor.")
The Recommended Action: What was done about it? (e.g., "Told operator to stop and replace guard," or "Work order submitted to fabricate new guard.")

You need to train the AI to understand both fields and learn the relationship between them.

Think of your labeled dataset as a textbook. You're creating a "gold standard" set of examples that shows the AI what to recognize. For instance: "When you see text like 'guard is missing,' classify that as 'Unsafe Condition: Guarding.' And when you see an action like 'install a hard-guard,' classify that as 'Engineering Control.'"

The Three Steps of the Labeling Process

This labeling sprint is the first time your experts and your AI work together. This is the practical process of "Expert Supervision" — where you ensure the machine learns from your best practitioners, not just your largest datasets. You're not just tagging rows in a spreadsheet; you're codifying your organization's expertise into a format the machine can learn from.

1. Define the Scope and Language

Goal: Establish the boundary for the sprint.

Action: Assemble a team of 3–5 experts. Your first task is to define a Dual Classification System: one set of labels for the hazards (e.g., "Unsafe Condition: Guarding") and a separate one for the actions taken (e.g., "Engineering Control").

Quality matters more than quantity here. Pull a random sample of 500 to 1,000 historical reports. This dataset will serve as the "textbook" your AI uses to learn.

2. Execute the Consensus Sprint

Goal: Generate "gold standard" labels for the sample set.

Action: Have each expert independently label every report in the sample with both a Hazard and an Action classification. The most valuable part remains the disagreements. Bring the experts together to debate classification boundaries like "When an observation says 'operator removed guard to clear jam,' is this primarily a 'Machine Guarding' hazard or a 'Lockout/Tagout' failure?" These debates force your organization to define consistent classification rules. The AI learns from the consensus patterns that emerge.

3. Finalize the Strategic Asset

Goal: Create the deliverable for your technical team.

Action: Once consensus is reached, create a three-column spreadsheet: (1) the original observation text, (2) the agreed-upon Hazard Category label, and (3) the Action Category label. This simple file becomes the "textbook" for your AI that reads text (the model that will eventually scan thousands of reports for you automatically).

The Payoff

You've now trained an AI to do something valuable: read future observations and classify both the hazards and the actions. But here's the thing. Every EHS professional already knows that engineering controls are better than administrative controls. You don't need AI to tell you that.

That's the insight you're after. The real value comes from pattern recognition at scale. The AI can prove which actions are failing by finding the non-obvious correlations. For example: "The data shows that 'Guarding' hazards on the main line spike specifically on Tuesdays — the same day the Preventive Maintenance schedule pulls the lead technician away for audit prep."

The AI isn't reciting theory. It's providing a data-driven business case for investing in more effective engineering solutions by proving exactly when and why the cheaper options are failing.

The Refinery Is a Permanent EHS Function, Not a Project

Building your EHS Data Refinery is the most intensive part of becoming AI-ready, but it's also the most valuable. This work — governance, cleansing, and labeling — is not a one-time cleanup; it is the new permanent infrastructure of your safety department.

Clean, standardized data gives you a reliable view of risk in your organization even before you run a single AI algorithm. That alone is worth the effort. It's also the only foundation that reliable predictive models can be built on.

Don't wait for AI software to arrive before starting this process. You can begin today with Excel and a team of SMEs. The organizations building their refinery now will be the ones predicting injuries next year.