Computer Vision API

An image-in, JSON-out service I actually trust.

I’d been writing notebooks for a while and wanted to feel what it’s like to put a model behind something other people can actually call. The model architecture wasn’t the interesting part. The interesting part was everything around it, preprocessing that doesn’t blow up on weird image dimensions, GPU memory that doesn’t leak by request 200, a worker pool that gracefully handles the cold-start.

TorchScript and quantization gave me most of the latency win. Dynamic batching gave me the rest. None of these are exotic; they’re all in the docs. They’re just the kind of thing you don’t bother with until you’ve actually shipped something and watched the latency tail. There’s a moment in every ML project where you realize the difference between “fast on my benchmark” and “fast on a Tuesday afternoon under real load,” and it’s humbling.

FastAPI was a great default. Pydantic validation alone caught a depressing number of bugs that would otherwise have made it to the model, wrong content types, malformed base64, requests with the wrong field name. Treating the API surface as a strict contract before anything reaches the model was the single best decision I made.

The part I underestimated was monitoring. A traditional API you can monitor with status codes. A model API silently degrades, wrong labels, weird confidences, drift you can’t see in p95. I’d invest more in that next time, earlier. Logging the full distribution of confidences over time turned out to be more useful than any single accuracy number.

The thing I’d do differently is build with ONNX from the start instead of native PyTorch in production. Decoupling the training framework from the serving framework makes it much easier to change either one without rebuilding everything around it. PyTorch is fantastic for research; for serving you want something that doesn’t change its mind every few releases.