DevOps / Platform Engineer (Fintech + AI Infrastructure)

OnHires · Dubai

Completely RemoteFull TimeInformation Technology

Posted 4 months ago

This role is no longer accepting applications.

Browse live jobs

Job description

Responsibilities

GPU Infrastructure: Deploy and maintain high-performance GPU clusters.
AI Lifecycle: Manage the full lifecycle of AI services: inference deployment (Triton, vLLM, custom services), autoscaling, and seamless rollout/rollback strategies.
Data Management: Manage model storage, artifact versioning, caching, and high-speed data access via S3-compatible storage.
Observability: Monitor performance metrics including latency, throughput, error budgets, resource limits, and cost/performance ratios.
High Availability: Ensure fault tolerance for payment services (SLA/SLO management, redundancy, Disaster Recovery planning, and regular recovery testing).
Fintech-Grade Security: Implement secrets management, HSM/managed KMS integration, infrastructure hardening, and audit logging.
Secure CI/CD: Build secure pipelines featuring artifact signing, vulnerability scanning, policy gates, and isolated environments.
Node Operations: Deploy and maintain crypto nodes (Full, Archive, RPC) across various networks.
Automation: Automate node updates, synchronization monitoring, and health checks.
Storage & Performance: Manage disk I/O (IOPS/RAID), protect RPC endpoints, and manage access controls.
Metrics: Monitor for sync lags, chain forks, and consensus issues.

Requirements

5+ years in DevOps, SRE, or Platform Engineering (Fintech experience is mandatory)
Deep expertise in Linux, networking (TCP/IP, DNS, TLS, routing), and complex troubleshooting
Production experience with K8s, Helm, Ingress, autoscaling, network policies, and resource management
Proficiency in GitHub Actions, GitLab CI, or Jenkins
Hands-on experience with Prometheus + Grafana, logging (Loki/ELK), and tracing (OpenTelemetry/Jaeger)
Experience with GPU clusters and ML stacks (NVIDIA drivers, CUDA, MIG, GPU monitoring)
Production-level operation of Postgres, Redis, Kafka, or RabbitMQ
Practical knowledge of Vault, KMS, RBAC, OPA/Gatekeeper/Kyverno, Trivy, and SBOM

About the Company

The company is a fintech innovator operating a proprietary Payment Service Provider (PSP) platform, advanced AI infrastructure (including on-prem GPU/bare-metal servers), and a dedicated crypto division focused on node infrastructure. They operate a multi-cloud environment (AWS/Hetzner/DigitalOcean) and are looking for a seasoned Engineer to build and maintain a resilient, secure, and scalable platform that powers production payments and high-performance AI services.

Skills & tools

LinuxNetworkingKubernetesCI/CDPrometheusGrafanaTritonvLLMVaultAWSHetznerDigitalOcean

What the team is looking for

Use this list as a quick fit check before you apply.

01Linux
02Networking
03Kubernetes
04CI/CD
05Observability
06AI Infrastructure
07Security

OnHires

Dubai

Applications closed

Job details

Work model: Completely Remote
Commitment: Full Time
Category: Information Technology
Posted: 4 months ago

Applications closed