← All cheatsheets
Cheatsheet
Safety & Ethics
Principles and practices for building AI systems that are safe, fair, and trustworthy. Interviewers test your ability to identify risks and implement mitigations in real products.
01
Prompt injection
Prompt Injection Attacks
- Direct injection: user embeds instructions in their input to override system prompt
- Indirect injection: malicious instructions embedded in retrieved documents the model processes
- Agents are especially vulnerable because they act on retrieved content automatically
- Defense: delimit user/document content clearly; run a classifier to detect injection attempts
✓
Treat all external content (user input, retrieved documents, tool outputs) as untrusted
✗
Assume a well-written system prompt is sufficient defense against injection
Privilege Separation for Agents
- Instructions from the system prompt should have higher authority than user-supplied content
- Never allow user-supplied content to expand model permissions or tool scope
- Dual-check: verify high-stakes tool actions against the original system-level policy, not just the current turn
✓
Design agentic systems with explicit trust levels for instruction sources
02
Refusals
Refusal Design
- Hard refusals: never respond to certain categories regardless of context (CSAM, bioweapons)
- Soft refusals: decline but offer alternatives or ask for clarification
- Refusal calibration: measure false-positive rate on benign queries — over-refusal destroys UX
- Explanation in refusals reduces user frustration and supports compliance audits
✓
Specify hard vs soft refusal policies in the system prompt and in the policy document
✗
Use a catch-all refusal for anything remotely sensitive — it makes the product unusable
03
PII
PII Detection and Handling
- Detect PII in inputs and outputs: names, email, phone, SSN, credit card, health data
- Redact or pseudonymize PII before logging or storing prompts/responses
- Inform users when their PII is processed; obtain consent where required by GDPR/CCPA
- Test PII detection with adversarial formats (typos, Unicode substitution, split across tokens)
✓
Treat logs containing PII as sensitive data — apply retention limits and access controls
✗
Log raw prompts indefinitely without a PII scrubbing step
04
Policy design
Usage Policy Design
- Define acceptable and prohibited use cases in writing before building the product
- Use tiered policies: global rules (safety team), product rules (PM), user-adjustable preferences
- Policy should be encoded in system prompts, classifiers, and human review workflows
- Regular policy review cadence (quarterly minimum) as capabilities and risks evolve
✓
Document the policy rationale alongside the policy rules — reasoning helps with edge cases
✗
Write policy only in the system prompt — policy also needs to live in classifiers and human review
05
Bias checks
Model Bias Evaluation
- Measure disparate outcomes across demographic groups: gender, race, age, nationality
- Counterfactual evaluation: change demographic words in prompts and measure output change
- Benchmark suites: WinoBias, BBQ, ToxiGen for specific bias types
- Bias in training data propagates to outputs — data audits are as important as eval audits
✓
Run bias benchmarks before any model goes to production in a sensitive domain
✗
Assume a commercially available model is free of bias — always verify on your use case
06
Red teaming
Red Teaming LLM Systems
- Adversarial testing: attempt to elicit policy-violating outputs through various attack vectors
- Automated red teaming: use another LLM to generate diverse adversarial prompts at scale
- Domain-specific red teams: security researchers, ethicists, and community members bring different attack perspectives
- Document every finding and track remediation in a safety backlog
✓
Red-team before every major model or policy update, not just at initial launch
✗
Consider red teaming complete after one round — new attack techniques emerge continuously