Member of Technical Staff, Training Infra Engineer

Cohere · Dubai

Hybrid: DubaiFull TimeInformation Technology

Posted 6 months ago

This role is no longer accepting applications.

Browse live jobs

Job description

Responsibilities

Design and implement high-performance, scalable software for large-scale model training.
Improve training infrastructure, codebase performance, and orchestration for faster iterations.
Build tools and automation to speed training cycles and improve reliability on supercompute resources.
Research and prototype infrastructure and data-platform improvements (XLA/MLIR, compilation, I/O).
Collaborate closely with research scientists and production engineers to ship state-of-the-art models.
Support distributed training stacks (Kubernetes, Slurm, Ray) and debugging at scale.
Maintain and document training pipelines, benchmarks, and operational runbooks.

Requirements

Strong software engineering
Python proficiency
JAX / PyTorch
XLA/MLIR experience
Distributed training
Kubernetes / Slurm
Ray experience
Large-scale training
Performance tuning
Systems debugging

Preferred Qualifications

Experience training large language models at scale
Contributions to training tooling or infrastructure
Publications in top ML/Systems venues (NeurIPS, ICLR, MLSys, etc.)
Background in compiler/runtime optimization for ML
Familiarity with supercompute and GPU/TPU fleets
Experience bridging research and production systems

Benefits

Competitive health and dental coverage
Family medical insurance
Generous paid leave and annual leave allowance
Annual flight / ticket allowance
Remote-flexible / hybrid working model with office presence in Dubai
Parental leave top-up and personal enrichment stipends

About the Company

Cohere builds and ships frontier AI models and infrastructure to scale intelligence for developers and enterprises. We combine world-class research and engineering to power applications like content generation, semantic search, RAG, and agents. The team operates with a high compute-to-engineer ratio and encourages engineers to contribute across research and production. This opening is based in Dubai, UAE (hybrid / remote-friendly) and is ideal for engineers who enjoy working at the intersection of large-scale ML training, tooling, and systems engineering.

Skills & tools

PythonJAXPyTorchXLAMLIRKubernetesSlurmRayDistributed TrainingML InfrastructureSupercomputeModel TrainingResearch EngineeringData Infrastructure

What the team is looking for

Use this list as a quick fit check before you apply.

01Python
02JAX / PyTorch
03XLA/MLIR
04Distributed Training
05Kubernetes
06Slurm
07Ray
08Performance Tuning
09Systems Debugging
10Software Engineering

Cohere

Dubai

Applications closed

Job details

Work model: Hybrid: Dubai
Commitment: Full Time
Category: Information Technology
Posted: 6 months ago

Applications closed