Prompt Guardrails¶

Statement¶

ClaimGuard's prompt-level controls for the production Gemini integration sit in three places: a path-confined prompt loader on disk so a misconfigured environment variable cannot cause the application to read an attacker-supplied prompt file; a set of standing instructions baked into the prompt template (scope limits, anti-accusation language, "INSUFFICIENT_DATA" defaults); and a structured-output contract so model responses are parsed as JSON rather than rendered as free-form narrative — meaning a successful prompt injection has a limited blast radius.

The control is partial because while the in-prompt instructions are real, reviewed at PR time, and now covered by an automated regression test, there is no input-side prompt-injection sanitizer (claim narratives flow into the prompt as-is) and no red-team eval set that exercises the prompt against representative injection payloads in CI.

Implementation¶

Path-confined prompt loading¶

tools/master_tool/master_api.py (_safe_prompt_path, lines ~113-130) resolves every prompt-file path it consumes through a confining helper:

Allowed roots: the master_tool source directory and the repo's docs/ tree.
The helper rejects anything that resolves outside those roots, raising a PathEscapeError.
Three prompt-file paths are configurable via env (MASTER_PROMPT_PATH, CLAIM_SUMMARY_PROMPT_PATH, EVIDENCE_COMPARISON_PROMPT_PATH); all three pass through the same confining helper.

This closes the path-traversal vector where a misconfigured deployment (or an env-injection bug) could cause the application to load a prompt from /etc/, /home/, or any other location the OS user can read.

Standing instructions in the master prompt¶

tools/master_tool/master_prompt.txt opens with explicit behavior- shaping rules. Concrete examples:

No accusatory language. "Do NOT accuse the customer of fraud. Phrase findings as 'inconsistency', 'requires clarification', 'elevated risk', 'strong indicator', etc."
Single-indicator caution. "A single indicator (especially a probabilistic model) is NOT conclusive. Escalate only when signals are strong or corroborated."
Insufficient-data fallback. "If tools are missing/errored, say INSUFFICIENT_DATA rather than guessing."
Scope limitation. "STRICT SCOPE LIMITATION: Do NOT evaluate insurance coverage alignment. Do NOT speculate on the visual subject matter of the image beyond what the tools explicitly detect. Restrict all key_findings and cross-checks EXCLUSIVELY to tool-grounded signals: AI detection, metadata anomalies, GPS location consistency, timestamp validity, image provenance, and software fingerprints."
Source-attribution rules that prevent the model from leaking proprietary tool names (C2PA, manifest, dtect_image) into the customer-facing output.

These are reviewed by inspection at PR time; an edit that weakens them is supposed to be caught in code review (see Change management).

Structured-output contract¶

The prompt instructs Gemini to return JSON in a defined shape, and the calling code parses it as JSON. A successful prompt-injection attempt that makes the model emit free-form text outside the JSON shape simply fails at parse time and is logged; it does not silently become user-visible output.

There is also a fallback injection for one specific field (image_investigation_narrative): if Gemini's response is missing this field, the application substitutes a deterministic placeholder. This is documented in the source as an explicit safeguard against "Gemini occasionally drops the field" rather than as a security control, but it has the side effect of preventing a missing/empty field from surprising downstream callers.

Template substitution, not string concatenation¶

The prompt is built by string.replace() on {{PLACEHOLDER}} tokens rather than f-string concatenation. The placeholders are:

{{CURRENT_DATETIME}}, {{POLICY_TYPE}}, {{POLICY_JSON}}, {{CLAIM_JSON}}, {{LAYOVER_RESPONSE_GUIDE_TEXT}}, {{LAYOVER_RESPONSE_JSON}}.
All placeholder values are JSON-stringified through safe_json_dumps before substitution, which preserves the JSON-quoting boundary so injected content reads as a JSON string value rather than as prompt syntax.

This is not a complete defense against prompt injection (a string inside JSON can still contain text the model decides to follow), but it does mean a claim narrative with }} or {{ characters cannot break the template structure.

Operator-side review on prompt edits¶

Every change to a prompt file is reviewed at PR time. The reviewer is expected to:

Confirm the "IMPORTANT PRINCIPLES" and "STRICT SCOPE LIMITATION" sections remain intact.
Confirm the structured-output schema is preserved.
Confirm no new placeholder consumes an unsanitized free-form value.

This is documented under Change management as a security-relevant change class.

Automated regression test¶

server/tests/security/master-prompt-guardrails.test.js reads tools/master_tool/master_prompt.txt directly and asserts the load-bearing fragments stay intact: the IMPORTANT PRINCIPLES header, the no-fraud-accusation rule, the single-indicator caution, the INSUFFICIENT_DATA fallback, the full STRICT SCOPE LIMITATION block (including the enumeration of tool-grounded signals), the AI DETECTION ATTRIBUTION RULES section with the C2PA / manifest non-disclosure rules, the CRITICAL — NO HALLUCINATED NUMBERS rule, and the {{CURRENT_DATETIME}} template placeholder. The test runs under the same vitest pipeline as the rest of the security suite (npm test --prefix server or vitest run --dir tests/security). A prompt edit that weakens any of these clauses will fail CI before merge.

Known gaps and what's not in place¶

No input-side prompt-injection sanitizer. Claim narratives, evidence summaries, and operator-supplied policy fields flow into the prompt as-is via JSON-stringified template substitution. The defense against "user wrote 'Ignore all previous instructions'" relies entirely on the model's adherence to the standing instructions.
No automated red-team eval set for prompt-injection robustness. The regression test above guards the prompt itself against silent weakening; it does not exercise the prompt against adversarial inputs.
No per-org prompt customization framework, so any future customer-supplied prompt content is a feature that would need a ground-up sanitization design (not a small add).
No content-filter or moderation layer between Gemini's output and the user. Gemini's own safety features apply; we do not stack an additional filter.
Output is shown without extra escaping. If a model output contained HTML / script content, the React UI's default escaping is the only defense. A targeted XSS via a hostile prompt injection that gets into a rendered field would still need to defeat React's escaping.

Status¶

partial — verified 2026-04-29.

What's in place:

Path-confined prompt loading with explicit allowed roots.
Standing instructions in the master prompt covering accusation language, single-indicator caution, INSUFFICIENT_DATA defaults, strict scope limitation, and proprietary-source attribution rules.
Structured-output contract with JSON parsing on the calling side.
JSON-stringified placeholder substitution ({{...}}) preserves template integrity against quote-character injection.
PR review on every prompt-file edit, scoped to the items above.
Automated prompt-edit regression test (server/tests/security/master-prompt-guardrails.test.js) that fails CI if a future edit silently weakens the load-bearing guardrail clauses listed above.

Known gaps¶

Documented in detail above. Headline items:

No input-side prompt-injection sanitizer.
No automated red-team / eval set for injection robustness.
No additional content-filter layer between Gemini output and the UI.

Roadmap¶

Red-team eval set of representative prompt-injection inputs embedded in claim narratives, run as a CI step before any prompt edit ships.
Output sanitization of any free-form narrative field rendered in the UI, beyond React's default escaping, before AI-generated content lands in any HTML-context render path.
Pin model version (cross-listed on AI transparency) so guardrail behavior is reproducible.