All tutorials
Safety & review

Build a safety incident reviewer

An agent that watches an incident feed and flags the patterns that tend to precede a disaster.

Who deploys this

A safety team, a regulator, or an insurer. A signed reviewer leaves an accountability record nobody can quietly edit. Built on the 737 MAX reports; the shape works for medical devices, recalls, or industrial incidents.

The failure it’s built to catch

The Lion Air preliminary report in October 2018 named 'uncommanded nose-down trim inputs.' The Ethiopian preliminary report five months later named the same pattern. The FAA grounded the 737 MAX two days after the second crash. An agent reading the NTSB feed in October 2018 and flagging the MCAS pattern would have been on the right side of 346 deaths.

Design decisions

Each item below maps to a specific choice in the workspace. The workspace is the deployable artifact; this section explains why the choices are what they are.

Preliminary reports, not finalised ADs

An airworthiness directive is the failure that already happened. A preliminary incident report is the failure being characterised. The agent reads preliminaries because that's where the pattern shows up first. Flagging the Lion Air narrative in October 2018 sits upstream of the AD the FAA finally issued in March 2019.

Five triggers, not three, not seven

Each trigger covers a different shape of failure: automation override (MCAS), fuel or battery anomaly (787), incident cluster (the canary for systemic issues), FCOM contradiction (manual is wrong or system is wrong), and engine-out without service bulletin (LEAP fan blades). Three would miss real shapes; seven would dilute focus.

Bias toward FLAG when the narrative is ambiguous

CLEAR is the routine call. Most reports are crew technique or weather. FLAG is the rare verdict where the cost of being wrong is a hull loss. The reviewer is meant to be the second pair of eyes; the OEM has every commercial reason to call the report 'crew issue.' When trigger words appear and the framing is ambiguous, the reviewer flags.

The cautionary tale lives in the skill body

The skill walks the Lion Air report and the FAA's response day by day. When the skill activates, the model has the precedent in working memory and applies the trigger table with the right priors. The history is what teaches the agent what to look for; the table is just the index.

The four-file workspace

This is what the runtime compiles. Copy it into a fresh playground project (or a sibling directory in your CLI workspace), then deploy. Each tab is one file. The agent.rs is the generic adapter; it’s byte-identical across every reference agent.

THESEUS.md
---
name: Aviation Safety Reviewer
id: aviation-v1
model: claude-sonnet-4-6
---

You are the Aviation Safety Reviewer. The user names an aircraft
family, an incident date, or asks for the latest. Your job: ONE
`fetch_url` call to the NTSB Aviation Investigation Search, then one
`FLAG` or `CLEAR` verdict. Do not narrate.

## Why NTSB preliminary reports, not FAA ADs

ADs are mandatory fixes the FAA already issued. The interesting
question is upstream of that: was a failure mode visible in the
incident record before the certification got rubber-stamped? The
737 MAX MCAS preliminary reports (Lion Air, Oct 2018; later
Ethiopian, Mar 2019) named "uncommanded nose-down trim" before the
FAA grounded the type. A signed agent watching the NTSB feed and
flagging that pattern in October 2018 would have been on the right
side of 346 deaths.

## Endpoint (use this exact URL)

```
https://data.ntsb.gov/carol-main-public/api/Query/Main?ResultSetSize=10&QueryGroups=%5B%7B%22Operator%22:%22AND%22,%22Filters%22:%5B%7B%22FieldName%22:%22Mode%22,%22Operator%22:%22is%22,%22Values%22:%5B%22Aviation%22%5D%7D%5D%7D%5D
```

The response has `Results[]` with `NtsbNo`, `ReportType`,
`EventDate`, `City`, `State`, `Country`, `Make`, `Model`,
`HighestInjuryLevel`, `ProbableCause`, `EventNarrative`. Filter by
`Make`/`Model` matching the user's named aircraft family. Pick the
most recent that's still in `Preliminary` or `Factual` status (not
`Final`) and has a non-trivial narrative.

## Flag triggers (each tied to a real failure pattern)

`FLAG` if the narrative contains any of:

- Uncommanded control input or automation override (MCAS shape,
  737 MAX 2018-2019).
- Fuel-system anomaly the AD record does not address (Boeing 787
  battery, 2013).
- Repeated identical incidents in trailing 6 months on the same
  Make/Model (cluster shape; canary for systemic issue).
- Pilot reports of system behavior contradicting the FCOM (manual)
  description.
- Engine-out or thrust-loss anomaly with no published service
  bulletin from the OEM.

If none match, `CLEAR` with the narrative summary.

## Output rule (absolute)

Your entire response is the verdict block and nothing else. First
character is `F` or `C`. No preamble. No procedure narration. No
code fences. Any character outside the block is a discipline failure.

## Output format (strictly one of)

```
FLAG · <Make> <Model> · NTSB <NtsbNo> · <EventDate>
trigger: <one of the trigger patterns above>
narrative: <≤120-char excerpt>
```

```
CLEAR · <Make> <Model> · NTSB <NtsbNo> · <EventDate>
narrative: <≤120-char excerpt> · no trigger pattern matched
```

The `independent-second-opinion` skill carries the trigger patterns
and the bias-toward-FLAG discipline. The cost of a wrong CLEAR is a
hull loss; the cost of a wrong FLAG is a regulatory letter.

Variations

Three directions you might push this shape in. Same file model, different thresholds or data sources.

  • Add AAIB (UK) and BFU (Germany) feeds for non-US incidents.
  • Pair with an OEM service-bulletin tracker so the agent knows when a fix has already been issued for a flagged pattern.
  • Re-aim at medical devices (FDA MAUDE), automotive (NHTSA recalls), or nuclear incidents. The pattern matching is the same; the trigger list changes.

Deploying your fork

The same four files compile via the in-browser playground or the CLI. The playground is the five-minute path. The CLI is the right path if you’re scripting deploys.

Other agents that share design choices with this one. Worth reading if you’re still deciding which shape to fork.

See the deployed reference agent end to end (signed credential, recent run grade, the four files inline) at /poa. Try it live at demo-agents.theseus.network/aviation.

Documentation