How Machine Learning Can Improve Workplace Incident Classification

May 29, 2025

In my previous articles, I discussed the ESAW methodology - a standardized system for classifying workplace safety incidents - and explored the role of diagnostic analytics in enhancing workplace safety. Accurate incident classification is the cornerstone of effective analysis and prevention strategies. Machine learning offers great potential to streamline and improve this critical task.

Machine Learning Incident Classification Triage System

Why Pattern Recognition Beats Keyword Matching

Most EHS software already includes basic automation—keyword-based "auto-tagging" or decision trees that categorize an incident based on specific words like "fall" or "chemical." However, these systems are brittle. They fail when the narrative is nuanced or when a worker uses non-standard terminology.

Machine learning (ML) stops looking for specific "trigger words" and starts looking at patterns. Instead of following a strict rulebook, the system makes a calculated guess based on how thousands of similar incidents were handled in the past. This offers several structural advantages:

Contextual Pattern Recognition: Humans often miss the "hidden" signal in a narrative. For example, a report might be manually tagged as a "Slip/Trip," but an ML model can identify that the description of the uneven surface and the specific tool being carried matches a pattern of "Equipment Design Flaw" seen across other high-risk sites.
Consistent Logic: While a human manager’s classification can change based on their experience or the time of day, an ML model applies the same logic to every record. This makes errors systematic and, therefore, easier to audit and correct.
Scalable Triage: Automating the routine cases allows you to shift from manual data entry to data governance—focusing your expertise on the complex, high-potential incidents that the model flags as ambiguous.

The success of this approach depends on the training data and the choice of the machine learning algorithm. When done correctly, machine learning is a powerful tool for improving both the accuracy and the speed of incident classification.

The Architecture of an Incident Classifier

Developing a classification model is a multi-stage process. While custom models were previously the only option, the industry is now shifting toward **Large Language Models (LLMs)**. These are systems—like the ones behind ChatGPT—that have been trained on vast amounts of text to understand human language contextually. The classic workflow remains essential for understanding how the system learns:

Data Governance: The output is only as good as the field narrative. Cleaning data isn't just about fixing typos; it's about ensuring the training set reflects the operational reality of the site, not just the sanitized version of the report.
Feature Extraction: This is the process of identifying which parts of a report—timestamps, location, or specific phrases—carry the most predictive weight. It is the digital equivalent of an investigator deciding which clues matter most.
Model Selection: Choosing between a specialized classifier or a general-purpose LLM depends on your data privacy requirements and the complexity of your classification framework (e.g., OSHA vs. ESAW).
Continuous Validation: A model is never "finished." It requires ongoing auditing by EHS experts to ensure it hasn't drifted as site operations or reporting cultures change.

Introducing the Incident Classification Tool

To demonstrate the practical application of this process, I've developed an incident classification tool that utilizes machine learning to automate and improve incident classification, which can be accessed at https://incident-classification-tool.streamlit.app

This tool was trained using a publicly available dataset of OSHA accident and injury data from Kaggle (https://www.kaggle.com/datasets/ruqaiyaship/osha-accident-and-injury-data-1517/data). This dataset contains detailed information about workplace incidents, including the nature of the injury, the part of the body affected, the event type, and the environmental factors involved.

How the Tool Works

The tool simplifies incident classification into a few key steps:

Data Input: Users input a textual description of the incident.
Preprocessing: The tool prepares the text for analysis, removing irrelevant information and extracting key features.
Classification: A trained machine learning model analyzes the processed text and predicts the appropriate incident category.

Example Scenario

To illustrate how this works in practice, imagine an incident report describes an employee's hand being caught in a hydraulic press. The tool would analyze this description, considering factors like the environmental factor (pinch point), the nature of the injury (laceration), the body part affected (hand), and the event type (caught in or between). Based on this analysis, it would classify the incident according to the chosen classification system.

The Triage Strategy: Dealing with Accuracy

It is a mistake to view machine learning as a binary success or failure. The effectiveness of a model depends on its **Confidence Score**. This is a percentage that tells you how sure the model is about its own classification—the digital equivalent of a "check engine" light. While a model might achieve 90% accuracy on high-volume categories like "Trips and Falls," it may drop to 40% for rare, complex events.

In a mature EHS data architecture, this 40% isn't a failure—it's a **triage signal**. Instead of a human reading 1,000 reports to find the 5 critical ones, the ML model handles the 600 routine cases with high confidence and flags the remaining 400 for expert review. This ensures your limited time is spent precisely where the data is most ambiguous.

This highlights the inherent challenge: ML does not fix "Garbage In, Garbage Out." If a field worker cannot complete a detailed narrative while wearing cut-resistant gloves in the rain, the model will have no signal to process. The focus must remain on the quality of the field-level reporting system that feeds the model.

From Data Entry to Data Governance

By shifting to an ML-driven triage system, safety professionals move up the value chain. This transition offers three structural benefits:

Targeted Oversight: Automation handles the high-volume "noise," allowing you to focus on the high-potential "signals" that often get buried in the backlog.
Auditable Decisions: Because the model is consistent, you can audit the *logic* of the classification across 10,000 records simultaneously, rather than checking individual rows.
Proactive Risk Signal: Consistent classification is the prerequisite for predictive analytics. You cannot predict the next incident if your historical data is categorized differently by three different managers.

Conclusion

By prioritizing data standardization and using machine learning to filter the noise, we can create safer and healthier workplaces. This technology doesn't replace the safety professional; it moves you from the filing cabinet to the decision table.