Member of Technical Staff, Training Infra Engineer
Cohere
Posted 2 days ago
Employment Type
Full Time
Location
Dubai
Requirements
Python, JAX / PyTorch, XLA/MLIR, Distributed Training, Kubernetes, Slurm, Ray, Performance Tuning, Systems Debugging, Software Engineering
Required Skills
Job Description
Responsibilities
- Design and implement high-performance, scalable software for large-scale model training.
- Improve training infrastructure, codebase performance, and orchestration for faster iterations.
- Build tools and automation to speed training cycles and improve reliability on supercompute resources.
- Research and prototype infrastructure and data-platform improvements (XLA/MLIR, compilation, I/O).
- Collaborate closely with research scientists and production engineers to ship state-of-the-art models.
- Support distributed training stacks (Kubernetes, Slurm, Ray) and debugging at scale.
- Maintain and document training pipelines, benchmarks, and operational runbooks.
Requirements
- Strong software engineering
- Python proficiency
- JAX / PyTorch
- XLA/MLIR experience
- Distributed training
- Kubernetes / Slurm
- Ray experience
- Large-scale training
- Performance tuning
- Systems debugging
Preferred Qualifications
- Experience training large language models at scale
- Contributions to training tooling or infrastructure
- Publications in top ML/Systems venues (NeurIPS, ICLR, MLSys, etc.)
- Background in compiler/runtime optimization for ML
- Familiarity with supercompute and GPU/TPU fleets
- Experience bridging research and production systems
Benefits
- Competitive health and dental coverage
- Family medical insurance
- Generous paid leave and annual leave allowance
- Annual flight / ticket allowance
- Remote-flexible / hybrid working model with office presence in Dubai
- Parental leave top-up and personal enrichment stipends
About the Company
Cohere builds and ships frontier AI models and infrastructure to scale intelligence for developers and enterprises. We combine world-class research and engineering to power applications like content generation, semantic search, RAG, and agents. The team operates with a high compute-to-engineer ratio and encourages engineers to contribute across research and production. This opening is based in Dubai, UAE (hybrid / remote-friendly) and is ideal for engineers who enjoy working at the intersection of large-scale ML training, tooling, and systems engineering.