Safety & Ethics Cheatsheet — AI Engineer Interview Prep

01

Prompt injection

Prompt Injection Attacks

Direct injection: user embeds instructions in their input to override system prompt
Indirect injection: malicious instructions embedded in retrieved documents the model processes
Agents are especially vulnerable because they act on retrieved content automatically
Defense: delimit user/document content clearly; run a classifier to detect injection attempts

✓ Treat all external content (user input, retrieved documents, tool outputs) as untrusted

✗ Assume a well-written system prompt is sufficient defense against injection

Privilege Separation for Agents

Instructions from the system prompt should have higher authority than user-supplied content
Never allow user-supplied content to expand model permissions or tool scope
Dual-check: verify high-stakes tool actions against the original system-level policy, not just the current turn

✓ Design agentic systems with explicit trust levels for instruction sources

02

Refusal Design

Hard refusals: never respond to certain categories regardless of context (CSAM, bioweapons)
Soft refusals: decline but offer alternatives or ask for clarification
Refusal calibration: measure false-positive rate on benign queries — over-refusal destroys UX
Explanation in refusals reduces user frustration and supports compliance audits

✓ Specify hard vs soft refusal policies in the system prompt and in the policy document

✗ Use a catch-all refusal for anything remotely sensitive — it makes the product unusable

03

PII Detection and Handling

Detect PII in inputs and outputs: names, email, phone, SSN, credit card, health data
Redact or pseudonymize PII before logging or storing prompts/responses
Inform users when their PII is processed; obtain consent where required by GDPR/CCPA
Test PII detection with adversarial formats (typos, Unicode substitution, split across tokens)

✓ Treat logs containing PII as sensitive data — apply retention limits and access controls

✗ Log raw prompts indefinitely without a PII scrubbing step

04

Usage Policy Design

Define acceptable and prohibited use cases in writing before building the product
Use tiered policies: global rules (safety team), product rules (PM), user-adjustable preferences
Policy should be encoded in system prompts, classifiers, and human review workflows
Regular policy review cadence (quarterly minimum) as capabilities and risks evolve

✓ Document the policy rationale alongside the policy rules — reasoning helps with edge cases

✗ Write policy only in the system prompt — policy also needs to live in classifiers and human review

05

Model Bias Evaluation

Measure disparate outcomes across demographic groups: gender, race, age, nationality
Counterfactual evaluation: change demographic words in prompts and measure output change
Benchmark suites: WinoBias, BBQ, ToxiGen for specific bias types
Bias in training data propagates to outputs — data audits are as important as eval audits

✓ Run bias benchmarks before any model goes to production in a sensitive domain

✗ Assume a commercially available model is free of bias — always verify on your use case

06

Red Teaming LLM Systems

Adversarial testing: attempt to elicit policy-violating outputs through various attack vectors
Automated red teaming: use another LLM to generate diverse adversarial prompts at scale
Domain-specific red teams: security researchers, ethicists, and community members bring different attack perspectives
Document every finding and track remediation in a safety backlog

✓ Red-team before every major model or policy update, not just at initial launch

✗ Consider red teaming complete after one round — new attack techniques emerge continuously