DevOps / Platform Engineer (Fintech + AI Infrastructure)

OnHires · Dubai

Completely RemoteFull TimeInformation Technology
Posted 1 months ago

Job description

Responsibilities

  • GPU Infrastructure: Deploy and maintain high-performance GPU clusters.
  • AI Lifecycle: Manage the full lifecycle of AI services: inference deployment (Triton, vLLM, custom services), autoscaling, and seamless rollout/rollback strategies.
  • Data Management: Manage model storage, artifact versioning, caching, and high-speed data access via S3-compatible storage.
  • Observability: Monitor performance metrics including latency, throughput, error budgets, resource limits, and cost/performance ratios.
  • High Availability: Ensure fault tolerance for payment services (SLA/SLO management, redundancy, Disaster Recovery planning, and regular recovery testing).
  • Fintech-Grade Security: Implement secrets management, HSM/managed KMS integration, infrastructure hardening, and audit logging.
  • Secure CI/CD: Build secure pipelines featuring artifact signing, vulnerability scanning, policy gates, and isolated environments.
  • Node Operations: Deploy and maintain crypto nodes (Full, Archive, RPC) across various networks.
  • Automation: Automate node updates, synchronization monitoring, and health checks.
  • Storage & Performance: Manage disk I/O (IOPS/RAID), protect RPC endpoints, and manage access controls.
  • Metrics: Monitor for sync lags, chain forks, and consensus issues.

Requirements

  • 5+ years in DevOps, SRE, or Platform Engineering (Fintech experience is mandatory)
  • Deep expertise in Linux, networking (TCP/IP, DNS, TLS, routing), and complex troubleshooting
  • Production experience with K8s, Helm, Ingress, autoscaling, network policies, and resource management
  • Proficiency in GitHub Actions, GitLab CI, or Jenkins
  • Hands-on experience with Prometheus + Grafana, logging (Loki/ELK), and tracing (OpenTelemetry/Jaeger)
  • Experience with GPU clusters and ML stacks (NVIDIA drivers, CUDA, MIG, GPU monitoring)
  • Production-level operation of Postgres, Redis, Kafka, or RabbitMQ
  • Practical knowledge of Vault, KMS, RBAC, OPA/Gatekeeper/Kyverno, Trivy, and SBOM

About the Company

The company is a fintech innovator operating a proprietary Payment Service Provider (PSP) platform, advanced AI infrastructure (including on-prem GPU/bare-metal servers), and a dedicated crypto division focused on node infrastructure. They operate a multi-cloud environment (AWS/Hetzner/DigitalOcean) and are looking for a seasoned Engineer to build and maintain a resilient, secure, and scalable platform that powers production payments and high-performance AI services.

Skills & tools

LinuxNetworkingKubernetesCI/CDPrometheusGrafanaTritonvLLMVaultAWSHetznerDigitalOcean

What the team is looking for

Use this list as a quick fit check before you apply.

  1. 01Linux
  2. 02Networking
  3. 03Kubernetes
  4. 04CI/CD
  5. 05Observability
  6. 06AI Infrastructure
  7. 07Security