Data Scientist / ML Engineer— New York[ available full-time · january 2027 ]

Arjun Varma

I build production ML and agentic systems — pipelines, evals, and answers that cite their sources.

Data science intern at Novo Nordisk, working on the rare-disease and Wegovy/Ozempic portfolios. M.S. in Data Science at Columbia. Before that, three years at ZS building ML for healthcare clients.

Email GitHub LinkedIn Resume

01/ SELECTED WORK — four systems, told properly

Airbnb Data Analyst Agent

Five specialized agents that plan, write, validate, chart, and narrate SQL analytics — every number cited back to source rows.

5 agents on a typed message bus·×3 retry budget, exponential backoff·100% numbers cited to source rows

FastAPI · LangChain · DuckDB / Postgres / Snowflake · OpenAI function calling · matplotlib · pytest

replay a recorded run

Biliary Tract Cancer Early Detection

A production model that flags likely BTC patients ~45 days before claims data confirms them — scoring 250M patient-claims every month.

~45d earlier identification·250M patient-claims scored / month·PMSA '25 methodology presented

PySpark · XGBoost · SHAP · K-means / GMM · NLP clustering · MLflow

SunCulture Transaction Intelligence

A hybrid rules + LLM + retrieval pipeline that standardized 7M+ farmer transactions into the credit signal behind microloans.

99% classification accuracy·−95% manual-review volume·7M+ transactions standardized

Python · RAG · LLM classification · REST

Financial RAG Chatbot

SEC-filing Q&A with line-level citations — multi-stage retrieval over ChromaDB, evaluated with Claude as judge, live on Cloud Run.

hrs→s time-to-insight·100% answers grounded in retrieved context·3 eval dimensions, LLM-as-judge

FastAPI · ChromaDB · LangChain · text-embedding-3-large · Streamlit · GCP Cloud Run

02/ MORE SHIPPED WORK — live links, no slides

World Cup 2026 Forecast2026

GAFFER — a team-strength model for the 2026 World Cup: Elo + a Dixon-Coles goal model + Transfermarkt squad value, run through 50,000 Monte Carlo simulations for live title odds that update as the games are played.

live ↗repo ↗

ClassPulse2026

Live classroom theme extraction — students answer via QR, an LLM clusters responses into themed cards every 10s over SSE, with a 5-model fallback chain.

live ↗repo ↗

SeanceAI2025

Conversations with 60+ historical figures under era-appropriate knowledge boundaries; Dinner-Party mode runs 2–5 figure multi-agent dialogue.

live ↗repo ↗

Citation Format Checker2026

Narrow-scope chatbot that flags APA 7 / MLA 9 / Chicago 17 violations with rule-IDs and quoted evidence; three-method eval suite, 30+ test cases.

live ↗repo ↗

Tweet Bot2026

Chrome extension that drafts X replies in three rhetorical angles — image-aware context extraction, voice learning from selections, streaming output.

repo ↗

Video Speed Controller2025

Fine-grained playback control (0.1×–16×) that persists per-site and survives player resets; published on the Chrome Web Store.

live ↗repo ↗

03/ EXPERIENCE

Novo Nordisk

New York · Summer 2026

current

Data Science Intern — Commercial Data Science

Summer 2026

Commercial Data Science team, working across the Rare Disease and Wegovy/Ozempic portfolios.

ZS Associates

Pune · Feb 2022 — Jun 2025

Advanced Data Science Associate Consultant

Feb 2025 — Jun 2025

Shipped the biliary tract cancer early-detection model — 250M patient-claims scored monthly, in production.
Unified 5+ data sources into an org-wide analytics platform used by 100+ stakeholders on a $10B oncology portfolio; cut weekly reporting from days to minutes.

Decision Analytics Associate Consultant

Jul 2024 — Jan 2025

Built and deployed Positive-Unlabeled learning models that lifted customer-journey coverage from ~40% to ~95% in medical transaction data.
Implemented feature and prediction drift monitoring plus CI unit tests for production pipelines; led a 5-member team modernizing legacy business rules (~50 hrs/mo saved, >99% first-pass quality).

Decision Analytics Associate

Feb 2022 — Jun 2024

Engineered PySpark/SQL ETL across healthcare sources covering millions of patients for $4B+ oncology drug analytics.
Defined audit-ready patient cohort inclusion/exclusion logic robust to missing and miscoded fields.

Promoted to Associate Consultant in 4 cycles (typical: 5). Expert Associate and Insight Illuminator awards.

Columbia University

New York · 2025 — present

Graduate Teaching Assistant

2025 — present

TA for Business Analytics II (Foundations of AI) and Hollywood & Big Data at Columbia Business School.

04/ ABOUT

I work on the part of machine learning that starts after the demo.

Pipelines, evals, drift, citations — the work that keeps a model trustworthy after it ships. That’s where I’ve spent most of my time. Right now I’m building agents that plan, query, and cite their sources.

Education

Columbia University

M.S. Data Science

Aug 2025 — Dec 2026

TA — Business Analytics II (Foundations of AI) · Hollywood & Big Data, Columbia Business School

Vellore Institute of Technology

B.Tech, Electronics & Communication Engineering

Jul 2018 — May 2022

Special Achiever Award · Merit Scholarship

Toolbox — Languages · ML / DS · LLM / Agents · Data / Cloud · Workflow

PythonSQLC++RPyTorchscikit-learnXGBoostpandasNumPySHAPMLflowRAGLangChainChromaDBOpenRouter

Vertex AIEvalsPySparkDatabricksAWS (S3, EMR, Athena, SageMaker)GCP Cloud RunDockerGitCI/CDJupyterStreamlitFastAPICursorClaude Code