We are Bagel Labs, a distributed machine learning research lab working towards open-source superintelligence.
We ignore years of experience and pedigree. If you have high agency, meaning your default assumption is that you can control the outcome of whatever situation you are in, we want to hear from you. Every requirement below is flexible for a candidate with high enough agency and tolerance for ambiguity.
Role Overview
You will design, build, and relentlessly optimize the infrastructure that trains and serves large diffusion models. Your job is to make GPUs go faster, make clusters behave, and make training and inference scale across multiple nodes, regions, and hardware types without turning into a reliability tax.
This role sits at the intersection of systems engineering, performance engineering, and research enablement. You will touch kernels, networking, orchestration, compilers, and model code when needed.
Key Responsibilities
- Build and operate distributed training stacks for diffusion models (U-Net, DiT, video diffusion, world-model variants) across multi-node GPU clusters.
- Implement and tune parallelism strategies for training and inference, including data parallel, tensor parallel, pipeline parallel, ZeRO/FSDP-style sharding, expert parallel, and diffusion-specific tricks (timestep-level scheduling, CFG parallelism, microbatching).
- Profile end-to-end GPU performance and remove bottlenecks across kernels, memory, comms, and I/O (CUDA graphs, kernel fusion, attention kernels, NCCL tuning, overlap of compute and comms).
- Own inference serving for diffusion workloads with high throughput and predictable latency, including dynamic batching, variable resolution handling, caching, prefill/conditioning optimization, and multi-GPU execution.
- Design robust orchestration for heterogeneous and preemptible environments (on-prem, bare metal, cloud, spot), including checkpointing, resumability, and fault tolerance.
- Build observability that is actually useful for diffusion: step-time breakdowns, denoising throughput, VRAM headroom, NCCL health, queueing, tail latency, error budgets, and cost per sample.
- Implement pragmatic quantization and precision strategies for diffusion inference and training, balancing quality, speed, and stability (BF16/FP16/TF32/FP8, weight-only INT8/INT4 where it makes sense, selective quantization of submodules).
- Improve developer velocity through reproducible environments, CI for performance regressions, and automation for cluster bring-up and rollouts.
- Write clear internal docs and occasional public technical deep-dives on blog.bagel.com when it helps the community and hiring.
Who You Might Be
You are the person teammates call when GPUs underperform, distributed training deadlocks, or a “simple” deployment turns into a week of whack-a-mole. You like the ugly truth in traces and profiler timelines. You can move between high-level architecture and low-level debugging without getting lost.
You probably have scars from at least a few of these:
- chasing down NCCL hangs, stragglers, and clock drift
- fixing memory fragmentation and OOMs that should not happen
- turning a 2x slowdown into a 10 percent regression by changing one flag, then learning why
- shipping a system that stays up while people are actively trying to break it
Required Skills (flexible)
- Strong Linux fundamentals, networking basics, and the ability to debug production incidents without panic.
- Deep GPU performance instincts: profiling, memory behavior, kernel-level thinking, and practical CUDA tooling literacy (even if you are not writing CUDA daily).
- Hands-on experience scaling training and/or inference across multiple GPUs and nodes.
- Comfort implementing parallelism and sharding in modern frameworks (PyTorch, NCCL, torch.distributed, FSDP/ZeRO-style systems, or equivalent).
- Experience building reliable deployment pipelines (containers, rollouts, versioning, rollback, secrets, config management).
- The ability to read model code and change it when infrastructure and performance require it.
Bonus Skills
- Contributions to open-source performance or distributed systems projects (PyTorch internals, Triton kernels, xFormers/FlashAttention, NCCL tooling, Ray, Kubernetes operators, etc.).
- Experience with diffusion-specific serving and optimization (Diffusers, ComfyUI, custom schedulers/solvers, distillation, few-step generation, VAE decode optimization, tiled generation).
- TensorRT or compiler experience (torch.compile/Inductor, XLA, CUDA graphs), and a habit of measuring instead of guessing.
- Experience building multi-tenant GPU platforms with isolation, fair scheduling, and predictable QoS.
- Comfort with cost engineering: understanding where dollars burn in GPU clusters and how to reduce it without fragility.
What We Offer
- Top of the market compensation.
- A deeply technical culture where bold frontier ideas are debated, stress-tested, and built.
- High autonomy and direct ownership of critical systems.
- In-person role at our Toronto office.
- Work that can set the direction for decentralized AI.
- Paid travel opportunities to the top ML conferences around the world.