← All cheatsheets
Cheatsheet

Coding

Engineering fundamentals for building AI systems in production Python. Interviewers often give live coding tasks covering async I/O, streaming, type safety, and API integration.

01

Async I/O

Async LLM API Calls

  • Use asyncio + an async HTTP client (httpx, aiohttp) or the SDK's async client
  • asyncio.gather() fans out multiple LLM calls concurrently — critical for parallel retrieval
  • Semaphores limit concurrent requests to respect API rate limits
  • Never mix sync and async in the same call stack without run_in_executor()
Use AsyncOpenAI / AsyncAnthropic clients and asyncio.gather for parallel LLM calls
Call the sync client from inside an async function — it blocks the event loop

Async Patterns for RAG

  • Fan out: embed query and retrieve from multiple sources concurrently with asyncio.gather
  • Async context managers: use async with for database connections and HTTP sessions
  • Background tasks: asyncio.create_task() for fire-and-forget logging or analytics
  • Timeout handling: asyncio.wait_for() wraps any coroutine with a deadline
Parallelize embedding, retrieval, and any independent preprocessing steps with asyncio.gather
02

Streaming

Streaming LLM Responses

  • Server-Sent Events (SSE) or WebSocket stream tokens incrementally to the client
  • SDK streaming: iterate over the stream object and yield each delta chunk
  • Buffer partial tokens before forwarding — some chunks may split multi-byte characters
  • Streaming reduces time-to-first-token perceived by users without changing total latency
Implement streaming for all interactive interfaces — users abandon non-streaming UIs faster
Concatenate the full stream before returning to the client — it defeats the purpose of streaming

Streaming with Tool Calls

  • Tool call arguments stream as partial JSON — do not parse until the stream finishes
  • Accumulate the tool call delta until stop_reason == 'tool_use'
  • Some SDKs expose a stream helper that buffers tool call events automatically
Accumulate tool call argument chunks before JSON parsing — partial JSON will raise a parse error
03

Typing

Type Safety in AI Code

  • Pydantic models validate LLM inputs and outputs at runtime with clear error messages
  • TypedDict and dataclasses document prompt structure; mypy/pyright catch type mismatches statically
  • Use Literal types for constrained string fields (e.g., role: Literal['user', 'assistant'])
  • Pydantic v2 is significantly faster than v1 and has better error messages — prefer it for new projects
Define Pydantic models for all LLM inputs and outputs; never pass raw dicts through the full pipeline
Use Any type annotations to bypass type checking — it defeats the purpose
04

Testing

Testing LLM Application Code

  • Mock LLM API calls in unit tests — don't make real API calls in CI
  • Use recorded responses (cassettes via pytest-recording or VCR.py) for deterministic tests
  • Separate unit tests (prompt construction, parsing) from integration tests (end-to-end pipeline)
  • Parameterize test cases to cover multiple input variants from a single test function
Test prompt construction and output parsing logic independently from the LLM call
Write tests that call the real LLM API — flaky, slow, and expensive in CI

Property-Based Testing

  • Hypothesis generates random valid inputs and finds edge cases automatically
  • Useful for testing parsers, chunkers, and token-count estimators
  • Define invariants: 'chunking then joining should recover the original text'
Apply property-based tests to data transformation functions in your AI pipeline
05

Profiling

Profiling AI Pipelines

  • Time each stage: embedding, ANN search, re-ranking, LLM call, parsing separately
  • cProfile and py-spy for CPU profiling; memory_profiler for RAM
  • Async profiling: pyinstrument supports async code; standard cProfile does not
  • Token count profiling: log input and output tokens per request to find expensive callers
Profile before optimizing — LLM latency almost always dominates and masks other bottlenecks
Micro-optimize Python code before verifying the LLM call is not the bottleneck
06

APIs

LLM API Integration Patterns

  • Exponential backoff with jitter on 429 and 5xx responses
  • Circuit breaker: after N failures, open the circuit and fail fast without hammering the provider
  • Retry budget: limit total retry time (e.g., 30 s) to bound end-user latency
  • Provider abstraction layer: wrap multiple LLM APIs behind a common interface for easy switching
Implement retry with exponential backoff from day one — rate limit errors are routine at scale
Retry indefinitely without a max retry count or total time limit

API Key and Secret Management

  • Never hardcode API keys — use environment variables or a secrets manager (Vault, AWS Secrets Manager)
  • Rotate keys periodically and immediately on suspected compromise
  • Scope API key permissions to minimum required (read-only where possible)
  • Audit API key usage per service and alert on anomalous spend patterns
Store all API keys in a secrets manager and inject at runtime via environment variables
Commit API keys to git even in private repositories — use pre-commit hooks to scan for secrets