Coding Cheatsheet — AI Engineer Interview Prep

01

Async I/O

Async LLM API Calls

Use asyncio + an async HTTP client (httpx, aiohttp) or the SDK's async client
asyncio.gather() fans out multiple LLM calls concurrently — critical for parallel retrieval
Semaphores limit concurrent requests to respect API rate limits
Never mix sync and async in the same call stack without run_in_executor()

✓ Use AsyncOpenAI / AsyncAnthropic clients and asyncio.gather for parallel LLM calls

✗ Call the sync client from inside an async function — it blocks the event loop

Async Patterns for RAG

Fan out: embed query and retrieve from multiple sources concurrently with asyncio.gather
Async context managers: use async with for database connections and HTTP sessions
Background tasks: asyncio.create_task() for fire-and-forget logging or analytics
Timeout handling: asyncio.wait_for() wraps any coroutine with a deadline

✓ Parallelize embedding, retrieval, and any independent preprocessing steps with asyncio.gather

02

Streaming

Streaming LLM Responses

Server-Sent Events (SSE) or WebSocket stream tokens incrementally to the client
SDK streaming: iterate over the stream object and yield each delta chunk
Buffer partial tokens before forwarding — some chunks may split multi-byte characters
Streaming reduces time-to-first-token perceived by users without changing total latency

✓ Implement streaming for all interactive interfaces — users abandon non-streaming UIs faster

✗ Concatenate the full stream before returning to the client — it defeats the purpose of streaming

Streaming with Tool Calls

Tool call arguments stream as partial JSON — do not parse until the stream finishes
Accumulate the tool call delta until stop_reason == 'tool_use'
Some SDKs expose a stream helper that buffers tool call events automatically

✓ Accumulate tool call argument chunks before JSON parsing — partial JSON will raise a parse error

03

Typing

Type Safety in AI Code

Pydantic models validate LLM inputs and outputs at runtime with clear error messages
TypedDict and dataclasses document prompt structure; mypy/pyright catch type mismatches statically
Use Literal types for constrained string fields (e.g., role: Literal['user', 'assistant'])
Pydantic v2 is significantly faster than v1 and has better error messages — prefer it for new projects

✓ Define Pydantic models for all LLM inputs and outputs; never pass raw dicts through the full pipeline

✗ Use Any type annotations to bypass type checking — it defeats the purpose

04

Testing

Testing LLM Application Code

Mock LLM API calls in unit tests — don't make real API calls in CI
Use recorded responses (cassettes via pytest-recording or VCR.py) for deterministic tests
Separate unit tests (prompt construction, parsing) from integration tests (end-to-end pipeline)
Parameterize test cases to cover multiple input variants from a single test function

✓ Test prompt construction and output parsing logic independently from the LLM call

✗ Write tests that call the real LLM API — flaky, slow, and expensive in CI

Property-Based Testing

Hypothesis generates random valid inputs and finds edge cases automatically
Useful for testing parsers, chunkers, and token-count estimators
Define invariants: 'chunking then joining should recover the original text'

✓ Apply property-based tests to data transformation functions in your AI pipeline

05

Profiling

Profiling AI Pipelines

Time each stage: embedding, ANN search, re-ranking, LLM call, parsing separately
cProfile and py-spy for CPU profiling; memory_profiler for RAM
Async profiling: pyinstrument supports async code; standard cProfile does not
Token count profiling: log input and output tokens per request to find expensive callers

✓ Profile before optimizing — LLM latency almost always dominates and masks other bottlenecks

✗ Micro-optimize Python code before verifying the LLM call is not the bottleneck

06

APIs

LLM API Integration Patterns

Exponential backoff with jitter on 429 and 5xx responses
Circuit breaker: after N failures, open the circuit and fail fast without hammering the provider
Retry budget: limit total retry time (e.g., 30 s) to bound end-user latency
Provider abstraction layer: wrap multiple LLM APIs behind a common interface for easy switching

✓ Implement retry with exponential backoff from day one — rate limit errors are routine at scale

✗ Retry indefinitely without a max retry count or total time limit

API Key and Secret Management

Never hardcode API keys — use environment variables or a secrets manager (Vault, AWS Secrets Manager)
Rotate keys periodically and immediately on suspected compromise
Scope API key permissions to minimum required (read-only where possible)
Audit API key usage per service and alert on anomalous spend patterns

✓ Store all API keys in a secrets manager and inject at runtime via environment variables

✗ Commit API keys to git even in private repositories — use pre-commit hooks to scan for secrets