Senior Software Engineer, Data Processing

Protege

Completely RemoteFull TimeEngineering & Architecture
Posted Today

Job description

Responsibilities

  • Design, build, and operate ingestion systems that process large volumes of multimodal data into structured, AI-ready datasets
  • Own the end-to-end ingestion path, including data validation, processing, tracking, and downstream availability
  • Build modality-specific processing steps for imaging, audio, video, and other unstructured data formats
  • Develop parsers, validators, and normalization logic to handle messy and high-variance source data
  • Optimize systems for high throughput, reliability, and cost-efficiency using distributed and parallel compute
  • Implement rigorous data quality checks and security protocols to handle sensitive and regulated data (e.g., PHI)
  • Partner with Product and Data Lab teams to standardize reusable processing patterns and internal tooling

Requirements

  • 5+ years of experience building and operating production backend or data systems
  • Proven experience designing and running large-scale data pipelines
  • Strong programming skills in Python
  • Experience with distributed data processing
  • Strong proficiency with AWS
  • Ability to thrive in high-ambiguity environments with messy, high-volume data

Preferred Qualifications

  • Experience processing specific modalities such as medical imaging (DICOM), text, audio, or video
  • Background working with regulated data environments (HIPAA, healthcare compliance, PHI)
  • Experience with workflow orchestration tools like Airflow or Dagster
  • Experience with GCP or Azure
  • Prior experience as an early engineer in a startup environment
  • Familiarity with ML, NLP, or LLM-based systems, including embeddings and fine-tuning

About the Company

Protege is building a platform to solve AI's biggest unmet need: access to high-quality training data. We facilitate the secure, efficient, and privacy-centric exchange of AI training data, connecting organizations with high-value data to the AI builders who need it. We are a lean, fast-moving team of builders obsessed with velocity and impact.

Skills & tools

PythonAWSData Pipelines

What the team is looking for

Use this list as a quick fit check before you apply.

  1. 015+ years building production backend or data systems
  2. 02Experience designing large-scale data pipelines
  3. 03Strong Python programming skills
  4. 04Experience with distributed data processing
  5. 05Proficiency with AWS