SaaS / B2B Software · Containerized Dataset Registry At Scale
Ethara.ai
Human-in-the-loop AI data infrastructure platform.
The brief
Accelerate development of aligned, safe, and powerful AI by building the highest-quality human-in-the-loop data infrastructure at scale — encompassing annotator training and certification, multi-project annotation execution across multiple AI lab clients, containerized reproducible dataset environments, and the internal tooling needed to manage annotation workforce quality and throughput at industrial scale.
What we built
A full-stack AI data infrastructure platform serving as the operational backbone for human-in-the-loop model alignment work. Phase 1 (Oct–Dec 2025): Annotator onboarding on Outlier platform, RLHF/SFT training data generation for Claude Sonnet, TTS model annotation (Guitar Pinstripe/Riff), and UI annotation (Meadow UI). Phase 2 (Mar–May 2026): Industrial-scale Dockerized dataset registry (5,000+ Docker images across multiple programming languages pushed to AWS ECR), large-scale agent trajectory generation for Claude and Gemini models (Jaeger/Jaeger 2.0 projects), ARC-AGI-style reasoning game suite (Arc Agents), VINDEX response quality evaluation, KAIJU repository validation, Leviathan PRD generation for web-building AI agents, and an ETP (Ethara Task Platform) internal workflow tool with rubric-based QC flows. Throughout: production EKS infrastructure with GPU/CPU Karpenter autoscaling, Istio service mesh, multi-model AWS Bedrock hosting (Claude, Qwen, Kimi K2.5, MiniMax, GLM), and full GitOps CI/CD via ArgoCD and GitHub Actions.
Production AWS EKS cluster with Bedrock-hosted multi-model inference (Claude, Kimi, MiniMax, GLM) is live. 5,000+ Dockerized dataset environments pushed to ECR. 10,000+ annotation tasks delivered to clients. Trajectory datasets for Claude and Gemini LHT benchmarks delivered. ARC-style reasoning game suite (158+ games) QC'd and delivered. ETP Task Platform UI designs fully delivered for development implementation. Ethara's application stack (arc agents, ETP backend/frontend) deployed on production and staging EKS clusters.
Delivery timeline
How it was built, phase by phase.
8 workstreams across 30 weeks of operated delivery.
- buildWeek 1–7 (Oct–Nov 2025)
RLHF & SFT Model Training / Rubric Engineering
Deep hands-on work training LLMs via Supervised Fine-Tuning and Reinforcement Learning from Human Feedback.
Trained annotators and engineers on rubric-based evaluation, SFT fine-tuning with Claude Sonnet, and RLHF preference data generation.
Claude SonnetPythonRLHF frameworksSFT pipelinesRubric frameworks - discoverWeek 1–4 (Nov 2025)
Outlier / Third-Party AI Data Platform Onboarding
Multiple engineers onboarded to Outlier (Scale AI contractor platform), completing structured training sessions on LLM post-training, prompt techniques, SFT, RLHF, and evals.
Engineers certified on Outlier platform, enabling Ethara to route AI data annotation and model training tasks through an established contractor.
Outlier platformLLM evaluation toolsPython - buildWeek 4–7 (Nov–Dec 2025)
AI Data Project Execution (Guitar Pinstripe, Guitar Riff, Meadow UI, Happy Robots)
Execution of multiple named AI data generation and UI annotation projects. Guitar Pinstripe/Riff focused on TTS (Text-to-Speech) model prompt/response correction and optimization.
Delivered curated prompt-response pairs for TTS model training, UI annotation data via Figma.
FigmaOutlier platformTTS model evaluation toolsPython - buildWeek 18–26 (Mar–May 2026)
Dockerized Dataset Registry & LHT Data Pipeline
Large-scale Phase 2 effort to build a multi-language Docker image registry for AI training datasets.
5,000 Docker images built and pushed to ECR, covering multiple programming language environments for LHT dataset reproducibility at scale.
DockerAWS ECRPythonGitHubHarness - buildWeek 18–26 (Mar–Apr 2026)
ARC-AGI / Grid-Based Reasoning Game Development (Arc Agents)
A dedicated engineer (Arshia Parmar) spent 4+ weeks building a suite of grid-based reasoning games resembling ARC-AGI tasks—including symmetry detection, pathfinding, spatial reasoning, object detection.
Full suite of ARC-style reasoning games with QC validation, delivered with metadata and scoring infrastructure for AI model evaluation benchmarking.
PythonBFS algorithmsRule enginesScoring frameworksQC scripts - testWeek 18–27 (Mar–Apr 2026)
VINDEX / KAIJU Repository Validation & Response Evaluation
Deeksha Pathak executed two sustained annotation projects: VINDEX (response quality evaluation using rubrics on a 1-6 scale.
Thousands of response ratings and repository validations delivered.
GitPythonRubric frameworksEvaluation spreadsheets - buildWeek 22–30 (Apr–May 2026)
Trajectory Generation for Claude & Gemini Models (Jaeger Project)
Systematic generation of agent trajectories for Claude and Gemini models under the Jaeger and Jaeger 2.0 projects.
Large-scale trajectory datasets generated for Claude and Gemini across LHT benchmarks, delivered as structured JSONL for downstream model training.
Claude (Anthropic)Gemini (Google)HarnessDockerPythonAWS ECR - buildWeek 26–28 (Apr–May 2026)
Leviathan Project — AI Website Training Data (PRD Generation)
Deeksha Pathak worked on generating high-quality training data for AI agents that build award-winning websites by creating detailed Product Requirements Documents (PRDs).
Structured PRD training corpus created for web-development AI agents, contributing to agentic coding capability training data.
PythonPRD templatesLLM evaluation tools
More case studies
Related work
09 · Run a function
Stop renting hours. Start running functions.
Pick the function you want off your plate. We'll map the brain and name the outcome we'd commit to — before you do.
