Skip to content
Antino
All case studies

SaaS / B2B Software · Containerized Dataset Registry At Scale

Ethara.ai

Human-in-the-loop AI data infrastructure platform.

Ethara.ai
30
Weeks operated
3k+
Hours of work
10
Engineers

The brief

Accelerate development of aligned, safe, and powerful AI by building the highest-quality human-in-the-loop data infrastructure at scale — encompassing annotator training and certification, multi-project annotation execution across multiple AI lab clients, containerized reproducible dataset environments, and the internal tooling needed to manage annotation workforce quality and throughput at industrial scale.

What we built

A full-stack AI data infrastructure platform serving as the operational backbone for human-in-the-loop model alignment work. Phase 1 (Oct–Dec 2025): Annotator onboarding on Outlier platform, RLHF/SFT training data generation for Claude Sonnet, TTS model annotation (Guitar Pinstripe/Riff), and UI annotation (Meadow UI). Phase 2 (Mar–May 2026): Industrial-scale Dockerized dataset registry (5,000+ Docker images across multiple programming languages pushed to AWS ECR), large-scale agent trajectory generation for Claude and Gemini models (Jaeger/Jaeger 2.0 projects), ARC-AGI-style reasoning game suite (Arc Agents), VINDEX response quality evaluation, KAIJU repository validation, Leviathan PRD generation for web-building AI agents, and an ETP (Ethara Task Platform) internal workflow tool with rubric-based QC flows. Throughout: production EKS infrastructure with GPU/CPU Karpenter autoscaling, Istio service mesh, multi-model AWS Bedrock hosting (Claude, Qwen, Kimi K2.5, MiniMax, GLM), and full GitOps CI/CD via ArgoCD and GitHub Actions.

Live in production

Production AWS EKS cluster with Bedrock-hosted multi-model inference (Claude, Kimi, MiniMax, GLM) is live. 5,000+ Dockerized dataset environments pushed to ECR. 10,000+ annotation tasks delivered to clients. Trajectory datasets for Claude and Gemini LHT benchmarks delivered. ARC-style reasoning game suite (158+ games) QC'd and delivered. ETP Task Platform UI designs fully delivered for development implementation. Ethara's application stack (arc agents, ETP backend/frontend) deployed on production and staging EKS clusters.

Delivery timeline

How it was built, phase by phase.

8 workstreams across 30 weeks of operated delivery.

  1. buildWeek 1–7 (Oct–Nov 2025)

    RLHF & SFT Model Training / Rubric Engineering

    Deep hands-on work training LLMs via Supervised Fine-Tuning and Reinforcement Learning from Human Feedback.

    Trained annotators and engineers on rubric-based evaluation, SFT fine-tuning with Claude Sonnet, and RLHF preference data generation.

    Claude SonnetPythonRLHF frameworksSFT pipelinesRubric frameworks
  2. discoverWeek 1–4 (Nov 2025)

    Outlier / Third-Party AI Data Platform Onboarding

    Multiple engineers onboarded to Outlier (Scale AI contractor platform), completing structured training sessions on LLM post-training, prompt techniques, SFT, RLHF, and evals.

    Engineers certified on Outlier platform, enabling Ethara to route AI data annotation and model training tasks through an established contractor.

    Outlier platformLLM evaluation toolsPython
  3. buildWeek 4–7 (Nov–Dec 2025)

    AI Data Project Execution (Guitar Pinstripe, Guitar Riff, Meadow UI, Happy Robots)

    Execution of multiple named AI data generation and UI annotation projects. Guitar Pinstripe/Riff focused on TTS (Text-to-Speech) model prompt/response correction and optimization.

    Delivered curated prompt-response pairs for TTS model training, UI annotation data via Figma.

    FigmaOutlier platformTTS model evaluation toolsPython
  4. buildWeek 18–26 (Mar–May 2026)

    Dockerized Dataset Registry & LHT Data Pipeline

    Large-scale Phase 2 effort to build a multi-language Docker image registry for AI training datasets.

    5,000 Docker images built and pushed to ECR, covering multiple programming language environments for LHT dataset reproducibility at scale.

    DockerAWS ECRPythonGitHubHarness
  5. buildWeek 18–26 (Mar–Apr 2026)

    ARC-AGI / Grid-Based Reasoning Game Development (Arc Agents)

    A dedicated engineer (Arshia Parmar) spent 4+ weeks building a suite of grid-based reasoning games resembling ARC-AGI tasks—including symmetry detection, pathfinding, spatial reasoning, object detection.

    Full suite of ARC-style reasoning games with QC validation, delivered with metadata and scoring infrastructure for AI model evaluation benchmarking.

    PythonBFS algorithmsRule enginesScoring frameworksQC scripts
  6. testWeek 18–27 (Mar–Apr 2026)

    VINDEX / KAIJU Repository Validation & Response Evaluation

    Deeksha Pathak executed two sustained annotation projects: VINDEX (response quality evaluation using rubrics on a 1-6 scale.

    Thousands of response ratings and repository validations delivered.

    GitPythonRubric frameworksEvaluation spreadsheets
  7. buildWeek 22–30 (Apr–May 2026)

    Trajectory Generation for Claude & Gemini Models (Jaeger Project)

    Systematic generation of agent trajectories for Claude and Gemini models under the Jaeger and Jaeger 2.0 projects.

    Large-scale trajectory datasets generated for Claude and Gemini across LHT benchmarks, delivered as structured JSONL for downstream model training.

    Claude (Anthropic)Gemini (Google)HarnessDockerPythonAWS ECR
  8. buildWeek 26–28 (Apr–May 2026)

    Leviathan Project — AI Website Training Data (PRD Generation)

    Deeksha Pathak worked on generating high-quality training data for AI agents that build award-winning websites by creating detailed Product Requirements Documents (PRDs).

    Structured PRD training corpus created for web-development AI agents, contributing to agentic coding capability training data.

    PythonPRD templatesLLM evaluation tools

09 · Run a function

Stop renting hours. Start running functions.

Pick the function you want off your plate. We'll map the brain and name the outcome we'd commit to — before you do.