← All cheatsheets
Cheatsheet
Coding
Engineering fundamentals for building AI systems in production Python. Interviewers often give live coding tasks covering async I/O, streaming, type safety, and API integration.
01
Async I/O
Async LLM API Calls
- Use asyncio + an async HTTP client (httpx, aiohttp) or the SDK's async client
- asyncio.gather() fans out multiple LLM calls concurrently — critical for parallel retrieval
- Semaphores limit concurrent requests to respect API rate limits
- Never mix sync and async in the same call stack without run_in_executor()
✓
Use AsyncOpenAI / AsyncAnthropic clients and asyncio.gather for parallel LLM calls
✗
Call the sync client from inside an async function — it blocks the event loop
Async Patterns for RAG
- Fan out: embed query and retrieve from multiple sources concurrently with asyncio.gather
- Async context managers: use async with for database connections and HTTP sessions
- Background tasks: asyncio.create_task() for fire-and-forget logging or analytics
- Timeout handling: asyncio.wait_for() wraps any coroutine with a deadline
✓
Parallelize embedding, retrieval, and any independent preprocessing steps with asyncio.gather
02
Streaming
Streaming LLM Responses
- Server-Sent Events (SSE) or WebSocket stream tokens incrementally to the client
- SDK streaming: iterate over the stream object and yield each delta chunk
- Buffer partial tokens before forwarding — some chunks may split multi-byte characters
- Streaming reduces time-to-first-token perceived by users without changing total latency
✓
Implement streaming for all interactive interfaces — users abandon non-streaming UIs faster
✗
Concatenate the full stream before returning to the client — it defeats the purpose of streaming
Streaming with Tool Calls
- Tool call arguments stream as partial JSON — do not parse until the stream finishes
- Accumulate the tool call delta until stop_reason == 'tool_use'
- Some SDKs expose a stream helper that buffers tool call events automatically
✓
Accumulate tool call argument chunks before JSON parsing — partial JSON will raise a parse error
03
Typing
Type Safety in AI Code
- Pydantic models validate LLM inputs and outputs at runtime with clear error messages
- TypedDict and dataclasses document prompt structure; mypy/pyright catch type mismatches statically
- Use Literal types for constrained string fields (e.g., role: Literal['user', 'assistant'])
- Pydantic v2 is significantly faster than v1 and has better error messages — prefer it for new projects
✓
Define Pydantic models for all LLM inputs and outputs; never pass raw dicts through the full pipeline
✗
Use Any type annotations to bypass type checking — it defeats the purpose
04
Testing
Testing LLM Application Code
- Mock LLM API calls in unit tests — don't make real API calls in CI
- Use recorded responses (cassettes via pytest-recording or VCR.py) for deterministic tests
- Separate unit tests (prompt construction, parsing) from integration tests (end-to-end pipeline)
- Parameterize test cases to cover multiple input variants from a single test function
✓
Test prompt construction and output parsing logic independently from the LLM call
✗
Write tests that call the real LLM API — flaky, slow, and expensive in CI
Property-Based Testing
- Hypothesis generates random valid inputs and finds edge cases automatically
- Useful for testing parsers, chunkers, and token-count estimators
- Define invariants: 'chunking then joining should recover the original text'
✓
Apply property-based tests to data transformation functions in your AI pipeline
05
Profiling
Profiling AI Pipelines
- Time each stage: embedding, ANN search, re-ranking, LLM call, parsing separately
- cProfile and py-spy for CPU profiling; memory_profiler for RAM
- Async profiling: pyinstrument supports async code; standard cProfile does not
- Token count profiling: log input and output tokens per request to find expensive callers
✓
Profile before optimizing — LLM latency almost always dominates and masks other bottlenecks
✗
Micro-optimize Python code before verifying the LLM call is not the bottleneck
06
APIs
LLM API Integration Patterns
- Exponential backoff with jitter on 429 and 5xx responses
- Circuit breaker: after N failures, open the circuit and fail fast without hammering the provider
- Retry budget: limit total retry time (e.g., 30 s) to bound end-user latency
- Provider abstraction layer: wrap multiple LLM APIs behind a common interface for easy switching
✓
Implement retry with exponential backoff from day one — rate limit errors are routine at scale
✗
Retry indefinitely without a max retry count or total time limit
API Key and Secret Management
- Never hardcode API keys — use environment variables or a secrets manager (Vault, AWS Secrets Manager)
- Rotate keys periodically and immediately on suspected compromise
- Scope API key permissions to minimum required (read-only where possible)
- Audit API key usage per service and alert on anomalous spend patterns
✓
Store all API keys in a secrets manager and inject at runtime via environment variables
✗
Commit API keys to git even in private repositories — use pre-commit hooks to scan for secrets