Cohere

Member of Technical Staff, Training Infra Engineer

Cohere

Posted 2 days ago

Employment Type

Full Time

Location

Dubai

Requirements

Python, JAX / PyTorch, XLA/MLIR, Distributed Training, Kubernetes, Slurm, Ray, Performance Tuning, Systems Debugging, Software Engineering

Job Description

Responsibilities

  • Design and implement high-performance, scalable software for large-scale model training.
  • Improve training infrastructure, codebase performance, and orchestration for faster iterations.
  • Build tools and automation to speed training cycles and improve reliability on supercompute resources.
  • Research and prototype infrastructure and data-platform improvements (XLA/MLIR, compilation, I/O).
  • Collaborate closely with research scientists and production engineers to ship state-of-the-art models.
  • Support distributed training stacks (Kubernetes, Slurm, Ray) and debugging at scale.
  • Maintain and document training pipelines, benchmarks, and operational runbooks.

Requirements

  • Strong software engineering
  • Python proficiency
  • JAX / PyTorch
  • XLA/MLIR experience
  • Distributed training
  • Kubernetes / Slurm
  • Ray experience
  • Large-scale training
  • Performance tuning
  • Systems debugging

Preferred Qualifications

  • Experience training large language models at scale
  • Contributions to training tooling or infrastructure
  • Publications in top ML/Systems venues (NeurIPS, ICLR, MLSys, etc.)
  • Background in compiler/runtime optimization for ML
  • Familiarity with supercompute and GPU/TPU fleets
  • Experience bridging research and production systems

Benefits

  • Competitive health and dental coverage
  • Family medical insurance
  • Generous paid leave and annual leave allowance
  • Annual flight / ticket allowance
  • Remote-flexible / hybrid working model with office presence in Dubai
  • Parental leave top-up and personal enrichment stipends

About the Company

Cohere builds and ships frontier AI models and infrastructure to scale intelligence for developers and enterprises. We combine world-class research and engineering to power applications like content generation, semantic search, RAG, and agents. The team operates with a high compute-to-engineer ratio and encourages engineers to contribute across research and production. This opening is based in Dubai, UAE (hybrid / remote-friendly) and is ideal for engineers who enjoy working at the intersection of large-scale ML training, tooling, and systems engineering.

How to Apply