November 20, 2025
The concept of raw EHS data as "crude oil" highlights a critical reality: it is a substance with immense potential, but one that is messy, unstructured, and unusable in its natural state. An AI cannot learn from data filled with inconsistencies, subjective opinions, and undefined terms.
Attempting to run an AI on raw data yields incorrect predictions (hallucinations) where the AI confidently invents facts because it lacks a clean baseline. This erodes trust and wastes resources. It won't just fail; it will actively mislead you, identifying risks that don't exist while missing the ones that do.
The solution is not the software itself, but what you do with your data. Buying AI-enabled software only creates a container; the intelligence comes from the refinement process. This non-negotiable industrial process — The Refinery — is the critical EHS function for transforming raw data into a standardized, usable asset. This refinery process consists of three core components: governance, cleansing, and labeling.
Data governance is a term that causes many leaders to tune out, imagining endless meetings and binders of rules.
To make this practical, we need to reframe it. Think of governance not as IT bureaucracy, but as Digital 5S. Just as you wouldn't tolerate a cluttered, unsafe shop floor where tools are missing or mislabeled, you cannot tolerate a cluttered, unsafe dataset.
The core of this Digital 5S approach is building an EHS Classification System (Taxonomy) — a rigid ruleset designed to eliminate confusion. Its entire purpose is to ensure that when two people report the same event, they use the same language.
The most effective way to implement this taxonomy is to redesign your data collection forms: remove free-text fields wherever possible and replace them with mandatory, multi-level dropdown menus. This is the practical setup of a standardized recording framework, such as the ESAW methodology.
Most legacy EHS reporting systems share the same fundamental flaw: they mix different types of data together. A frontline manager opens the incident form and sees one dropdown containing event types (Injury, Near Miss), severity levels (LTI, MTC, First Aid), and vague root cause categories (Personnel Factors) all jumbled together. This creates confusion and forces flawed decisions.
Here's what happens when you mix these categories: the rare, high-consequence events you actually want to predict (like a Laceration) get buried by the hundreds of low-consequence events (like Near Misses). In data science, this is called Noise vs. Signal (Class Imbalance).
Think of this as trying to hear a whisper (the injury pattern) in a crowded stadium (the thousands of near misses). The AI will naturally listen to the loudest sound — the near misses — and stop looking for the rare triggers that actually lead to injuries. This is why AI models fail in EHS: the machine learns that predicting "Near Miss" is the safe guess 99% of the time. You'll never find the pattern you're looking for.
And it gets worse: fields like "Description" and "Immediate Cause" become text dumps for narratives and notes (unstructured data) where people mix what happened (the mechanical event) with why it happened (the systemic cause). The AI gets noise instead of structure.
The fix isn't just a prettier form—it's using a structured Classification System (Taxonomy) based on proven frameworks like ESAW (European), OIICS (North American), or TOOCS (Australian). These frameworks force clarity and objectivity into your data.
This structured workflow isn't just better for AI — it's faster and easier for the people filling out the forms. It's a guided process where nobody's guessing anymore. When you build good data governance into your form design, people actually use it. And this separation between what happened and why it happened? That's what unlocks diagnostic analytics — the difference between simply recording incidents and actually understanding their root causes.
Governance (Section 1) ensures your future data is clean. But that only stops the bleeding. It does nothing about the massive "Data Debt" (years of poorly structured historical data) sitting in your database right now.
This is where Data Cleansing comes in. This isn't just about tidying up; it's about engineering equity from your existing data.
Your historical data is a strategic asset, but right now it's frozen in a state of disrepair. Data cleansing is how you standardize, repair, and curate that data at scale so you can actually use it to train AI models. You can't build a reliable model on a broken foundation.
This is your most critical cleansing strategy because it answers two fundamental questions your AI needs to know: what is happening and where is it happening. Without this, your data is just a pile of isolated reports, useless for pattern recognition, and your predictive models will fail.
The Problem: Your database contains "slip," "slipped on wet floor," "fall due to water," and "water hazard." An AI sees these as four different things. Ask it "How many slip/trip hazards did we have?" and it can't answer.
The Solution: Create a mapping file—a simple two-column spreadsheet. Column A lists the messy terms (e.g., "slipped on wet floor"). Column B has the clean, standardized term from your taxonomy (e.g., "Slip/Trip - Wet Surface"). Hand this to your IT team. They write a find-and-replace script, and what would've been a six-month manual cleanup becomes a ten-minute automated job.
The Problem: One report says "Line 3," another says "L3-Conveyor," a third says "Conveyor, Line 3." The AI sees three different places and can't connect the dots. This is the biggest barrier to predictive modeling.
The Solution: Don't build your location list from scratch. This is where your Interoperability work pays off. Connect to the source of truth: pull the Master Asset List from your CMMS (Maintenance) or the Location Hierarchy from your operations system. Use that as your standardization dictionary.
Example Mapping File:
| Messy Term | Standardized ID |
|---|---|
| "Line 3" | L3-CONVEYOR |
| "L3" | L3-CONVEYOR |
| "Conveyor, Line 3" | L3-CONVEYOR |
| "Line 4" | L4-MIXER |
Once all variations map to the same ID (like `L3-CONVEYOR`), the AI can finally see the complete history of that asset across all your data:
Now you've given the AI the context it needs to find patterns. And because both EHS and Maintenance records are standardized to `L3-CONVEYOR`, you can ask high-value questions like: "How many of our "Guarding" observations on this asset were followed by an "Overdue PM" work order from the CMMS?" This is how you transform Safety Risk data into Asset Reliability intelligence—proving that EHS data is a leading indicator for maintenance failure and operational risk.
After standardization, your next challenge is missing data. A single blank field (like a missing "Department" or "Location") can make an entire report useless for analysis, creating holes that break your models.
Don't guess at missing data. Repair it using context.
Remember the "Common Key" workshop from your interoperability work? This is where it pays off. You don't need to chase down a manager to fill in "Department" or "Tenure." Just pull it from HR using the data bridge you already built. It's a simple query: the system asks the HR bridge "Who is Employee #7781?" and fills in the blanks automatically. This isn't just filling blanks. It's automated data repair that rescues thousands of records from the junk pile.
This is the most critical, non-negotiable step in the entire refinery process. While the first two strategies are about improving data, this one is about protecting your models from bad data.
In data science, this is called Choosing the Right Data (Feature Selection). The concept is simple: deliberately exclude biased or low-value fields from your analytical datasets.
With your data governed (Section 1) and your historical records cleansed (Section 2), your foundation is solid. Now comes the final step: teaching your AI what to look for.
A common misconception is that modern Large Language Models (LLMs) make labeling obsolete because they "understand" text. They don't. They understand statistical patterns in generic data. Data Labeling is how you teach the machine your Organizational Dialect—the specific way your experts define risk at your sites.
This is the process where your EHS experts manually review a sample of narratives and notes (unstructured data) — like observation notes or inspection descriptions—and tag them with the standardized labels from your classification system.
The goal isn't just to predict accidents — it's to prescribe solutions. An AI that only predicts risk is just a fancy reporting tool. But an AI that learns which corrective actions actually work (and which ones fail) becomes a strategic asset. Data labeling is how you make that leap from prediction to prescription.
Don't focus only on incident reports. Your most valuable unstructured data is probably buried in safety observations, inspections, and risk assessments — your leading indicator data. This typically exists in two fields:
You need to train the AI to understand both fields and learn the relationship between them.
Think of your labeled dataset as a textbook. You're creating a "gold standard" set of examples that shows the AI what to recognize. For instance: "When you see text like 'guard is missing,' classify that as 'Unsafe Condition: Guarding.' And when you see an action like 'install a hard-guard,' classify that as 'Engineering Control.'"
This labeling sprint is the first time your experts and your AI work together. This is the practical process of "Expert Supervision" — where you ensure the machine learns from your best practitioners, not just your largest datasets. You're not just tagging rows in a spreadsheet; you're codifying your organization's expertise into a format the machine can learn from.
Goal: Establish the boundary for the sprint.
Action: Assemble a team of 3–5 experts. Your first task is to define a Dual Classification System: one set of labels for the hazards (e.g., "Unsafe Condition: Guarding") and a separate one for the actions taken (e.g., "Engineering Control").
Quality matters more than quantity here. Pull a random sample of 500 to 1,000 historical reports. This dataset will serve as the "textbook" your AI uses to learn.
Goal: Generate "gold standard" labels for the sample set.
Action: Have each expert independently label every report in the sample with both a Hazard and an Action classification. The most valuable part remains the disagreements. Bring the experts together to debate classification boundaries like "When an observation says 'operator removed guard to clear jam,' is this primarily a 'Machine Guarding' hazard or a 'Lockout/Tagout' failure?" These debates force your organization to define consistent classification rules. The AI learns from the consensus patterns that emerge.
Goal: Create the deliverable for your technical team.
Action: Once consensus is reached, create a three-column spreadsheet: (1) the original observation text, (2) the agreed-upon Hazard Category label, and (3) the Action Category label. This simple file becomes the "textbook" for your AI that reads text (the model that will eventually scan thousands of reports for you automatically).
You've now trained an AI to do something valuable: read future observations and classify both the hazards and the actions. But here's the thing. Every EHS professional already knows that engineering controls are better than administrative controls. You don't need AI to tell you that.
That's the insight you're after. The real value comes from pattern recognition at scale. The AI can prove which actions are failing by finding the non-obvious correlations. For example: "The data shows that 'Guarding' hazards on the main line spike specifically on Tuesdays — the same day the Preventive Maintenance schedule pulls the lead technician away for audit prep."
The AI isn't reciting theory. It's providing a data-driven business case for investing in more effective engineering solutions by proving exactly when and why the cheaper options are failing.
Building your EHS Data Refinery is the most intensive part of becoming AI-ready, but it's also the most valuable. This work — governance, cleansing, and labeling — is not a one-time cleanup; it is the new permanent infrastructure of your safety department.
Clean, standardized data gives you a reliable view of risk in your organization even before you run a single AI algorithm. That alone is worth the effort. It's also the only foundation that reliable predictive models can be built on.
Don't wait for AI software to arrive before starting this process. You can begin today with Excel and a team of SMEs. The organizations building their refinery now will be the ones predicting injuries next year.