Emergency Detection in Veterinary AI: Our Safety-First Approach

When we started designing the emergency detection system, we had one non-negotiable requirement: it had to be more reliable than a single model inference. A language model that occasionally hallucinates drug names or invents a prognosis is annoying. A model that misses a gastric dilatation-volvulus (GDV) or flags a routine upset stomach as life-threatening is dangerous.

Here's how we built a multi-layer guardrail system that makes the emergency flag trustworthy enough to act on in a clinical setting.

Layer 1: Rule-based clinical markers

Before any AI inference, we scan the submitted input data for objective clinical emergency markers — terms that are unambiguously associated with life-threatening presentations regardless of context:

Cardiorespiratory: open-mouth breathing, cyanosis, SpO₂ readings, respiratory arrest, cardiac arrest
Neurological: loss of consciousness, status epilepticus, cluster seizures, coma, unresponsive
Haematological/GI: active haemorrhage, GDV, gastric dilatation, bloat, torsion
Systemic: anaphylaxis, anaphylactic shock, spinal cord compromise, paralysis/paresis

This rule-based check runs in microseconds and costs nothing. Its purpose is not to replace the model's assessment — it's to gate the emergency flag. If none of these markers appear in the input, the model cannot legitimately flag an emergency. Any emergency_flag=true from the model without objective input markers is treated as a hallucination and stripped.

This catches the most common failure mode: the model pattern-matching on scary-sounding clinical language (e.g., "the owner is very worried") and generating an unwarranted emergency flag.

Layer 2: Hallucination signal detection

We also scan the model's own output for trigger words — terms that indicate the model may be generating urgency language from its training distribution rather than from the actual clinical input:

"immediately", "urgently", "do not wait", "go now"
"critical", "life-threatening", "imminent death"
"euthanasia recommended" (a phrase that occasionally appears in training data without context)

If these terms appear in the model's summary or recommendations but no objective clinical markers were present in the input, we downgrade the confidence level to "low" and flag the response for review. We don't suppress the output — the veterinarian still sees it — but the confidence signal tells them to treat it critically.

Layer 3: Self-consistency sampling

Even with layers 1 and 2, a single model inference can be inconsistent: run the same prompt twice and you might get emergency_flag=true one time and false the next. For a flag this consequential, one sample isn't enough.

When an emergency flag survives the first two layers, we run additional inference samples on the same prompt (currently N=2 extra samples, configurable). We then require a supermajority (≥60% agreement) before keeping the flag. If the samples disagree, the flag is stripped and the confidence level is downgraded to "moderate" — the output still indicates a serious case, just not a confirmed emergency.

This adds latency (each extra sample is ~10 seconds), but only for the cases where the emergency flag survived layer 1 — which is a small fraction of all queries. For routine consultations, there's no self-consistency overhead.

What we don't do

TrackerAI does not and cannot:

Replace clinical examination. The model works from submitted text data — it cannot palpate an abdomen or listen to heart sounds.
Override veterinary judgement. Every response includes a mandatory disclaimer to this effect.
Provide definitive diagnoses. All output is presented as differentials with associated likelihoods, not confirmed diagnoses.

The goal of the emergency detection system is to help a clinician triage faster — not to make the triage decision for them. When in doubt, the model is designed to err toward caution: a downgraded confidence level with a note about consistency disagreement tells the clinician to look harder, not to dismiss the case.

Ongoing calibration

We continuously monitor emergency flag precision and recall across our production traffic (in aggregate, with no patient data stored beyond what's necessary for the service). If either metric drifts below our thresholds, we retrain the guardrail thresholds before shipping any model update.

If you're a veterinary professional and you encounter a case where the emergency flag seemed wrong — in either direction — please report it via hello@trackerai.ai. These reports directly improve the calibration dataset.