Prompt Guardrails¶
Statement¶
ClaimGuard's prompt-level controls for the production Gemini integration sit in three places: a path-confined prompt loader on disk so a misconfigured environment variable cannot cause the application to read an attacker-supplied prompt file; a set of standing instructions baked into the prompt template (scope limits, anti-accusation language, "INSUFFICIENT_DATA" defaults); and a structured-output contract so model responses are parsed as JSON rather than rendered as free-form narrative — meaning a successful prompt injection has a limited blast radius.
The control is partial because while the in-prompt instructions are real, reviewed at PR time, and now covered by an automated regression test, there is no input-side prompt-injection sanitizer (claim narratives flow into the prompt as-is) and no red-team eval set that exercises the prompt against representative injection payloads in CI.
Implementation¶
Path-confined prompt loading¶
tools/master_tool/master_api.py (_safe_prompt_path, lines ~113-130)
resolves every prompt-file path it consumes through a confining
helper:
- Allowed roots: the master_tool source directory and the repo's
docs/tree. - The helper rejects anything that resolves outside those roots,
raising a
PathEscapeError. - Three prompt-file paths are configurable via env (
MASTER_PROMPT_PATH,CLAIM_SUMMARY_PROMPT_PATH,EVIDENCE_COMPARISON_PROMPT_PATH); all three pass through the same confining helper.
This closes the path-traversal vector where a misconfigured
deployment (or an env-injection bug) could cause the application to
load a prompt from /etc/, /home/, or any other location the OS
user can read.
Standing instructions in the master prompt¶
tools/master_tool/master_prompt.txt opens with explicit behavior-
shaping rules. Concrete examples:
- No accusatory language. "Do NOT accuse the customer of fraud. Phrase findings as 'inconsistency', 'requires clarification', 'elevated risk', 'strong indicator', etc."
- Single-indicator caution. "A single indicator (especially a probabilistic model) is NOT conclusive. Escalate only when signals are strong or corroborated."
- Insufficient-data fallback. "If tools are missing/errored, say INSUFFICIENT_DATA rather than guessing."
- Scope limitation. "STRICT SCOPE LIMITATION: Do NOT evaluate
insurance coverage alignment. Do NOT speculate on the visual
subject matter of the image beyond what the tools explicitly
detect. Restrict all
key_findingsand cross-checks EXCLUSIVELY to tool-grounded signals: AI detection, metadata anomalies, GPS location consistency, timestamp validity, image provenance, and software fingerprints." - Source-attribution rules that prevent the model from leaking proprietary tool names (C2PA, manifest, dtect_image) into the customer-facing output.
These are reviewed by inspection at PR time; an edit that weakens them is supposed to be caught in code review (see Change management).
Structured-output contract¶
The prompt instructs Gemini to return JSON in a defined shape, and the calling code parses it as JSON. A successful prompt-injection attempt that makes the model emit free-form text outside the JSON shape simply fails at parse time and is logged; it does not silently become user-visible output.
There is also a fallback injection for one specific field
(image_investigation_narrative): if Gemini's response is missing
this field, the application substitutes a deterministic placeholder.
This is documented in the source as an explicit safeguard against
"Gemini occasionally drops the field" rather than as a security
control, but it has the side effect of preventing a missing/empty
field from surprising downstream callers.
Template substitution, not string concatenation¶
The prompt is built by string.replace() on {{PLACEHOLDER}} tokens
rather than f-string concatenation. The placeholders are:
{{CURRENT_DATETIME}},{{POLICY_TYPE}},{{POLICY_JSON}},{{CLAIM_JSON}},{{LAYOVER_RESPONSE_GUIDE_TEXT}},{{LAYOVER_RESPONSE_JSON}}.- All placeholder values are JSON-stringified through
safe_json_dumpsbefore substitution, which preserves the JSON-quoting boundary so injected content reads as a JSON string value rather than as prompt syntax.
This is not a complete defense against prompt injection (a string
inside JSON can still contain text the model decides to follow), but
it does mean a claim narrative with }} or {{ characters cannot
break the template structure.
Operator-side review on prompt edits¶
Every change to a prompt file is reviewed at PR time. The reviewer is expected to:
- Confirm the "IMPORTANT PRINCIPLES" and "STRICT SCOPE LIMITATION" sections remain intact.
- Confirm the structured-output schema is preserved.
- Confirm no new placeholder consumes an unsanitized free-form value.
This is documented under Change management as a security-relevant change class.
Automated regression test¶
server/tests/security/master-prompt-guardrails.test.js reads
tools/master_tool/master_prompt.txt directly and asserts the
load-bearing fragments stay intact: the IMPORTANT PRINCIPLES header,
the no-fraud-accusation rule, the single-indicator caution, the
INSUFFICIENT_DATA fallback, the full STRICT SCOPE LIMITATION block
(including the enumeration of tool-grounded signals), the AI DETECTION
ATTRIBUTION RULES section with the C2PA / manifest non-disclosure
rules, the CRITICAL — NO HALLUCINATED NUMBERS rule, and the
{{CURRENT_DATETIME}} template placeholder. The test runs under the
same vitest pipeline as the rest of the security suite (npm test
--prefix server or vitest run --dir tests/security). A prompt edit
that weakens any of these clauses will fail CI before merge.
Known gaps and what's not in place¶
- No input-side prompt-injection sanitizer. Claim narratives, evidence summaries, and operator-supplied policy fields flow into the prompt as-is via JSON-stringified template substitution. The defense against "user wrote 'Ignore all previous instructions'" relies entirely on the model's adherence to the standing instructions.
- No automated red-team eval set for prompt-injection robustness. The regression test above guards the prompt itself against silent weakening; it does not exercise the prompt against adversarial inputs.
- No per-org prompt customization framework, so any future customer-supplied prompt content is a feature that would need a ground-up sanitization design (not a small add).
- No content-filter or moderation layer between Gemini's output and the user. Gemini's own safety features apply; we do not stack an additional filter.
- Output is shown without extra escaping. If a model output contained HTML / script content, the React UI's default escaping is the only defense. A targeted XSS via a hostile prompt injection that gets into a rendered field would still need to defeat React's escaping.
Status¶
partial — verified 2026-04-29.
What's in place:
- Path-confined prompt loading with explicit allowed roots.
- Standing instructions in the master prompt covering accusation language, single-indicator caution, INSUFFICIENT_DATA defaults, strict scope limitation, and proprietary-source attribution rules.
- Structured-output contract with JSON parsing on the calling side.
- JSON-stringified placeholder substitution (
{{...}}) preserves template integrity against quote-character injection. - PR review on every prompt-file edit, scoped to the items above.
- Automated prompt-edit regression test
(
server/tests/security/master-prompt-guardrails.test.js) that fails CI if a future edit silently weakens the load-bearing guardrail clauses listed above.
Known gaps¶
Documented in detail above. Headline items:
- No input-side prompt-injection sanitizer.
- No automated red-team / eval set for injection robustness.
- No additional content-filter layer between Gemini output and the UI.
Roadmap¶
- Red-team eval set of representative prompt-injection inputs embedded in claim narratives, run as a CI step before any prompt edit ships.
- Output sanitization of any free-form narrative field rendered in the UI, beyond React's default escaping, before AI-generated content lands in any HTML-context render path.
- Pin model version (cross-listed on AI transparency) so guardrail behavior is reproducible.