Arjun Varma
Data Scientist & ML Engineer
MS Data Science @ Columbia University
About Me
Passionate about transforming data into actionable insights
My Story
Currently pursuing my Master's in Data Science at Columbia University, with 3+ years of prior experience at ZS Associates building ML platforms and analytics solutions for Fortune 500 healthcare clients. I also TA at Columbia Business School, teaching AI foundations and data-driven decision-making.
My coursework spans Applied Machine Learning, Agentic AI for Analytics, Statistical Inference and Modeling, and Probability and Statistics. I'm drawn to the areas where engineering meets real-world problem solving.
I'm looking for roles where I can make a genuine impact, environments that challenge me, push my thinking, and let me build things that matter.
Quick Facts
Education

Columbia University
New York, NY
Master of Science in Data Science
Aug 2025 - Dec 2026
- Coursework: Applied Machine Learning, Agentic AI for Analytics, Statistical Inference and Modeling, Probability and Statistics
- Teaching Assistant, Columbia Business School: Business Analytics II (Foundations of AI) and Hollywood and Big Data

Vellore Institute of Technology
Vellore, India
B.Tech in Electronics & Communication Engineering
Jul 2018 - May 2022
- Special Achiever Award | Merit Scholarship
Work Experience
3+ years of building data-driven solutions at scale
Advanced Data Science Associate Consultant
ZS Associates
Pune, India • Feb 2025 - Jun 2025
- Built and deployed an org-wide analytics and ML platform (Spark/SQL, dashboards) that unified 5+ data sources into territory and product KPIs used by 100+ stakeholders supporting a $10B oncology portfolio
- Cut weekly reporting time from days to minutes, replacing Excel workflows with automated pipelines and self-serve dashboards
Decision Analytics Associate Consultant
ZS Associates
Pune, India • Jul 2024 - Jan 2025
- Led a 5-member team to modernize legacy business rules; saved ~50 hrs/mo and improved first-pass quality to >99%
- Built and deployed Positive-Unlabeled learning models to infer missing categorical labels in medical transaction data, increasing customer-journey analytics coverage from ~40% to ~95% with consistent performance across tumor types and territories
- Implemented drift monitoring (feature + prediction drift) and CI unit tests for production pipelines, reducing silent failures
- Placed in the top ~10% in a company-wide hackathon and earned selection for a lateral transfer into the Data Science vertical
Decision Analytics Associate
Fast TrackZS Associates
Pune, India • Feb 2022 - Jun 2024
- Engineered PySpark/SQL ETL pipelines integrating multiple healthcare data sources covering millions of patients for $4B+ oncology drug performance analytics
- Defined patient cohort inclusion and exclusion logic robust to missing and miscoded fields, enabling audit-ready reporting
- Delivered ad hoc analyses identifying care gaps and market opportunities to inform brand strategy across multiple launches
- Promoted to Associate Consultant in 4 cycles (typical: 5) and received Expert Associate and Insight Illuminator awards
Featured Projects
From ML models predicting cancer to LLM-powered chatbots
Biliary Tract Cancer (BTC) Early Detection
Predictive Analytics & NLP
ZS AssociatesJan 2025 - May 2025
Developed an early detection model across 250M patient claims, enabling ~45-day earlier identification compared to standard diagnosis lag. Engineered a hybrid feature pipeline combining clinical risk factors, K-means and GMM segmentation, and Transformer-based NLP clustering on diagnosis narratives.
- 250M patient claims analyzed
- Hybrid clinical + NLP feature pipeline
- Presented at PMSA 2025; adopted for territory planning
Built an LLM-powered RAG chatbot answering company financial questions from SEC filings with line-level citations. Implemented semantic retrieval with ChromaDB and text-embedding-3-large plus automatic ticker and period parsing.
- Line-level source citations
- Claude Opus evaluation pipeline
- Live demo on Streamlit Cloud
Built an AI chatbot enabling conversations with 60+ historical figures using multi-model LLM support and streaming responses. Implemented era-appropriate prompt engineering and "Dinner Party" mode for multi-figure conversations; deployed on Railway.
- 60+ historical figures
- Multi-model LLM support & streaming
- Deployed on Railway
Agricultural Product Standardization and Risk Detection
RAG and Classification System
SunCulture (Internship/Co-op)Aug 2025 - Oct 2025
Built a RAG-augmented classification system at SunCulture (Series B Agtech) categorizing 7M+ farmer transactions across 500+ product categories to support creditworthiness assessment for microloans in East Africa. Achieved 99% accuracy on a 10,000-item holdout set using hybrid rule-based and LLM-assisted classification, reducing manual review volume by 95% and accelerating loan decisioning.
- 7M+ farmer transactions classified
- 99% accuracy on 10K holdout set
- 95% reduction in manual review
Built a Chrome extension for fine-grained video playback speed control across all websites. Features persistent speed memory, keyboard shortcuts, and works with YouTube, Netflix, Udemy, and more.
- Works on all major platforms
- 0.1x to 16x speed range
- Persistent speed memory
AI-powered Chrome extension that generates tweet replies, quote tweets, and threads using Claude via OpenRouter. Features tone control, image understanding, voice learning that adapts to your style, and real-time streaming responses.
- 3 distinct suggestions with rhetorical strategy tags
- Voice learning adapts to your style
- Multi-model support (Opus, Sonnet, Haiku)
Academic citation format checker chatbot supporting APA 7th, MLA 9th, and Chicago 17th editions. Powered by Vertex AI (Gemini 2.0 Flash Lite) and FastAPI, it identifies specific formatting violations with rule IDs and quoted evidence. Deployed on GCP Cloud Run.
- Supports APA 7th, MLA 9th, Chicago 17th
- Rule-ID based violation detection
- 30+ eval test cases across 3 methods
Built a real-time classroom feedback tool where professors post a question, students submit answers via QR code or link, and an LLM automatically summarizes responses into 4-6 themed cards with student attribution. Uses FastAPI with SSE for live updates and OpenRouter with a 5-model fallback chain; deployed on Railway as a single service.
- Real-time theme extraction every 10 seconds
- 5-model LLM fallback chain via OpenRouter
- Single service: FastAPI + React on Railway
Technical Skills
Technologies and tools I use to bring ideas to life
Programming
Core languages I work with daily
Analytics & ML
ML frameworks and data tools
Big Data & MLOps
Scalable infrastructure tools
Tools & Platforms
Development environment
Also experienced with
Get in Touch
Interested in collaborating or have a question? Feel free to reach out!
