
DevOps / Platform Engineer (Fintech + AI Infrastructure)
OnHires · Dubai
Completely RemoteFull TimeInformation Technology
Posted 1 months ago
Job description
Responsibilities
- GPU Infrastructure: Deploy and maintain high-performance GPU clusters.
- AI Lifecycle: Manage the full lifecycle of AI services: inference deployment (Triton, vLLM, custom services), autoscaling, and seamless rollout/rollback strategies.
- Data Management: Manage model storage, artifact versioning, caching, and high-speed data access via S3-compatible storage.
- Observability: Monitor performance metrics including latency, throughput, error budgets, resource limits, and cost/performance ratios.
- High Availability: Ensure fault tolerance for payment services (SLA/SLO management, redundancy, Disaster Recovery planning, and regular recovery testing).
- Fintech-Grade Security: Implement secrets management, HSM/managed KMS integration, infrastructure hardening, and audit logging.
- Secure CI/CD: Build secure pipelines featuring artifact signing, vulnerability scanning, policy gates, and isolated environments.
- Node Operations: Deploy and maintain crypto nodes (Full, Archive, RPC) across various networks.
- Automation: Automate node updates, synchronization monitoring, and health checks.
- Storage & Performance: Manage disk I/O (IOPS/RAID), protect RPC endpoints, and manage access controls.
- Metrics: Monitor for sync lags, chain forks, and consensus issues.
Requirements
- 5+ years in DevOps, SRE, or Platform Engineering (Fintech experience is mandatory)
- Deep expertise in Linux, networking (TCP/IP, DNS, TLS, routing), and complex troubleshooting
- Production experience with K8s, Helm, Ingress, autoscaling, network policies, and resource management
- Proficiency in GitHub Actions, GitLab CI, or Jenkins
- Hands-on experience with Prometheus + Grafana, logging (Loki/ELK), and tracing (OpenTelemetry/Jaeger)
- Experience with GPU clusters and ML stacks (NVIDIA drivers, CUDA, MIG, GPU monitoring)
- Production-level operation of Postgres, Redis, Kafka, or RabbitMQ
- Practical knowledge of Vault, KMS, RBAC, OPA/Gatekeeper/Kyverno, Trivy, and SBOM
About the Company
The company is a fintech innovator operating a proprietary Payment Service Provider (PSP) platform, advanced AI infrastructure (including on-prem GPU/bare-metal servers), and a dedicated crypto division focused on node infrastructure. They operate a multi-cloud environment (AWS/Hetzner/DigitalOcean) and are looking for a seasoned Engineer to build and maintain a resilient, secure, and scalable platform that powers production payments and high-performance AI services.
Skills & tools
LinuxNetworkingKubernetesCI/CDPrometheusGrafanaTritonvLLMVaultAWSHetznerDigitalOcean
What the team is looking for
Use this list as a quick fit check before you apply.
- 01Linux
- 02Networking
- 03Kubernetes
- 04CI/CD
- 05Observability
- 06AI Infrastructure
- 07Security

OnHires
Dubai
Job details
- Work model
- Completely Remote
- Commitment
- Full Time
- Category
- Information Technology
- Posted
- 1 months ago