Member of Technical Staff, Training Infra Engineer

Cohere · Dubai

Hybrid: DubaiFull TimeInformation Technology
Posted 3 months ago

Job description

Responsibilities

  • Design and implement high-performance, scalable software for large-scale model training.
  • Improve training infrastructure, codebase performance, and orchestration for faster iterations.
  • Build tools and automation to speed training cycles and improve reliability on supercompute resources.
  • Research and prototype infrastructure and data-platform improvements (XLA/MLIR, compilation, I/O).
  • Collaborate closely with research scientists and production engineers to ship state-of-the-art models.
  • Support distributed training stacks (Kubernetes, Slurm, Ray) and debugging at scale.
  • Maintain and document training pipelines, benchmarks, and operational runbooks.

Requirements

  • Strong software engineering
  • Python proficiency
  • JAX / PyTorch
  • XLA/MLIR experience
  • Distributed training
  • Kubernetes / Slurm
  • Ray experience
  • Large-scale training
  • Performance tuning
  • Systems debugging

Preferred Qualifications

  • Experience training large language models at scale
  • Contributions to training tooling or infrastructure
  • Publications in top ML/Systems venues (NeurIPS, ICLR, MLSys, etc.)
  • Background in compiler/runtime optimization for ML
  • Familiarity with supercompute and GPU/TPU fleets
  • Experience bridging research and production systems

Benefits

  • Competitive health and dental coverage
  • Family medical insurance
  • Generous paid leave and annual leave allowance
  • Annual flight / ticket allowance
  • Remote-flexible / hybrid working model with office presence in Dubai
  • Parental leave top-up and personal enrichment stipends

About the Company

Cohere builds and ships frontier AI models and infrastructure to scale intelligence for developers and enterprises. We combine world-class research and engineering to power applications like content generation, semantic search, RAG, and agents. The team operates with a high compute-to-engineer ratio and encourages engineers to contribute across research and production. This opening is based in Dubai, UAE (hybrid / remote-friendly) and is ideal for engineers who enjoy working at the intersection of large-scale ML training, tooling, and systems engineering.

Skills & tools

PythonJAXPyTorchXLAMLIRKubernetesSlurmRayDistributed TrainingML InfrastructureSupercomputeModel TrainingResearch EngineeringData Infrastructure

What the team is looking for

Use this list as a quick fit check before you apply.

  1. 01Python
  2. 02JAX / PyTorch
  3. 03XLA/MLIR
  4. 04Distributed Training
  5. 05Kubernetes
  6. 06Slurm
  7. 07Ray
  8. 08Performance Tuning
  9. 09Systems Debugging
  10. 10Software Engineering