Apply now »

Digital-Senior

Location: Kolkata

Other locations: Anywhere in Country

Salary: Competitive

Date: Jun 25, 2026

Job description

Requisition ID: 1719010

At EY, you’ll have the chance to build a career as unique as you are, with the global scale, support, inclusive culture and technology to become the best version of you. And we’re counting on your unique voice and perspective to help EY become even better, too. Join us and build an exceptional experience for yourself, and a better working world for all.

EY- Assurance – Senior – Digital

Role: GenAI / Agentic AI Evaluation Engineer (Quality, Safety & Reliability)

Position Details

As part of EY GDS Assurance Digital, you will help design, build, and scale a standardized evaluation capability, focused on evaluating GenAI, RAG-based, and Agentic AI solutions before deployment.

This role sits at the intersection of AI evaluation engineering, Responsible AI, and GenAI security/red teaming. The primary objective is to ensure GenAI/agentic systems are safe, reliable, robust, and fit-for-purpose, by designing evaluation strategies, building repeatable test harnesses, and generating auditable evidence that supports go/no-go decisions.

You will work with global stakeholders (product teams, solution architects, risk & compliance, and assurance leadership) to define evaluation requirements, request test datasets from product teams, execute rigorous evaluations (functional + non-functional), and recommend mitigations and controls to reduce risk.

This is a core full-time role that requires a hands-on AI Development mindset, strong evaluation mindset, and the ability to translate risk concerns into practical testing strategies and measurable acceptance criteria.

Responsibilities

Define and operationalize evaluation strategies for GenAI systems across use cases like Q&A assistants, summarization, extraction, drafting, agentic systems, and multi-step workflows.
Translate business use-cases into a structured evaluation plan: scope, assumptions, success criteria, datasets, metrics, red-team scenarios, thresholds, and reporting requirements.
Drive standardization: reusable evaluation templates, test case libraries, scoring rubrics, and reporting formats across product teams.
Design structured dataset requirements for product teams and ensure coverage across:
- Core user journeys and primary business intents
- Edge cases (rare prompts, ambiguous queries, incomplete context)
- Adversarial cases (malicious prompts, jailbreak attempts, prompt injections)
- Bias & fairness cases (sensitive demographic proxies, protected attributes, stereotyping patterns)
Define guidance for dataset sufficiency and statistical coverage (e.g., minimum samples, distribution balance, scenario matrices, stratification by intent/risk).
Build reusable evaluation pipelines for:
- Answer quality (correctness, relevance, completeness, clarity)
- Grounding & faithfulness (RAG-specific: faithfulness, context precision/recall, hallucination rate, citation quality)
- Agentic behavior (tool-call accuracy, tool misuse, goal completion, step correctness, unnecessary actions, loop detection, safety of tool outputs)
- Operational quality (latency, cost/token budget, throughput, stability, retries, failure recovery)
Combine LLM-as-judge and human evaluation in a calibrated way (rubric design, sampling plans, agreement checks).
Implement automated evaluation harnesses in Python (preferred), enabling:
- batch runs on scenario suites
- configurable metric definitions
- reproducible runs with run IDs and artifacts
- storage of traces and outputs for auditability
Execute structured red teaming aligned to OWASP Top 10 for LLM Applications, covering (examples):
- Prompt injection (direct + indirect) and tool hijacking
- Sensitive data disclosure / PII leakage
- Insecure output handling (downstream injection)
- Training data leakage / memorization probes
- Model denial-of-service / denial-of-wallet patterns
Integrate evals into development lifecycle: pre-release regression gates, CI checks, benchmark comparisons across model versions/prompts/tools/retrievers.
Perform adversarial testing for agentic workflows:
- tool misuse / over-permissioned tool access
- unauthorized action execution
- exfiltration via tools/connectors
- prompt injection via retrieved documents (RAG poisoning)
Recommend mitigations: input validation, retrieval filtering, tool sandboxing, least-privilege permissions, guardrails, policy prompting, refusal logic, output encoding, monitoring alerts.
Produce high-quality evaluation reports that are auditable and decision-ready, including:
- methodology, datasets, metrics, thresholds
- quantitative results
- qualitative results
- risk assessment summary and recommended control actions
Present findings to stakeholders in a crisp, risk-informed manner; clearly explain residual risk, limitations, and rationale for go/no-go.

Key Requirements/Skills & Qualification:

Excellent academic background, including at a minimum a bachelor’s or a master’s degree in data science, Statistics, Engineering, Operational Research, or other related field with strong focus on modern data architectures, processes, and environments.
4–7+ years of relevant experience in one or more areas:
- ML/AI/GenAI/Agentic engineering (NLP/LLMs), evaluation engineering, applied research
- security testing / red teaming
- building and designing evaluation harness that ensures safety, reliability and robustness.
Strong hands-on Python for building evaluation harnesses (data processing, metric computation, orchestration, reporting pipelines).
Practical understanding of GenAI system architectures: RAG, embeddings/vector search, prompt orchestration, tool calling, multi-agent systems, memory, routing.
Experience designing metrics and evaluation methods (rubrics, automated scoring, sampling strategy, regression design).
Familiarity with LLM risks and mitigations, especially for enterprise contexts (data leakage, hallucinations, prompt injection, unsafe content, bias).
Security / Red Teaming Skills (Strong Preference)
Understanding of OWASP Top 10 for LLM Applications and how to translate it into test cases and controls.
Experience with adversarial testing approaches: jailbreak prompts, injection patterns, tool misuse scenarios, retrieval poisoning patterns.
Familiarity with secure-by-design practices for LLM apps: least privilege, safe tool invocation, output encoding/validation, monitoring.
Evaluation frameworks and tooling: RAGAS, DeepEval, LangSmith, Phoenix/Arize, custom eval harnesses.
Experimentation practices: A/B testing mindset, baseline comparisons, statistical rigor for sample sizes.
Observability/tracing: structured logging, OpenTelemetry, Langfuse-style traces, dashboards.
Basic DevOps practices: Git, CI/CD, containerization (Docker), reproducible environments.
Strong written communication to produce clear evaluation plans and reports for technical + non-technical stakeholders.
Ability to challenge assumptions constructively (“effective challenge”) and influence engineering teams toward remediation.
Comfort operating in ambiguity with fast-evolving GenAI tooling and risk landscape.

Preferred / Nice-to-Have

Experience in Assurance/Finance/Regulatory environments (model validation, risk acceptance workflows, audit evidence mindset).
Familiarity with responsible AI frameworks (NIST AI RMF, ISO/IEC 42001, EU AI Act concepts).
Experience evaluating multilingual systems or domain-heavy enterprise assistants.
Hands-on with Azure ecosystem (Azure OpenAI, AI Search, Function Apps, App Insights, Key Vault).

Additional skills requirements:

Excellent written, oral, presentation and facilitation skills
Ability to coordinate multiple projects and initiatives simultaneously through effective prioritization, organization, flexibility, and self-discipline.
Must have demonstrated project management experience.
Knowledge of firm’s reporting tools and processes.
Proactive, organized, and self-sufficient with ability to priorities and multitask.
Analyses complex or unusual problems and can deliver insightful and pragmatic solutions.
Ability to quickly and easily create/ gather/ analyze data from a variety of sources.
A robust and resilient disposition able to encourage discipline in team behaviors

What we look for

A Team of people with commercial acumen, technical experience, and enthusiasm to learn new things in this fast-moving environment
An opportunity to be a part of market-leading, multi-disciplinary team of 7200 + professionals, in the only integrated global assurance business worldwide.
Opportunities to work with EY GDS Assurance practices globally with leading businesses across a range of industries

What working at EY offers

At EY, we’re dedicated to helping our clients, from startups to Fortune 500 companies — and the work we do with them is as varied as they are.

You get to work with inspiring and meaningful projects. Our focus is education and coaching alongside practical experience to ensure your personal development. We value our employees, and you will be able to control your own development with an individual progression plan. You will quickly grow into a responsible role with challenging and stimulating assignments. Moreover, you will be part of an interdisciplinary environment that emphasizes high quality and knowledge exchange. Plus, we offer:

Support, coaching and feedback from some of the most engaging colleagues around
Opportunities to develop new skills and progress your career
The freedom and flexibility to handle your role in a way that’s right for you

EY | Building a better working world

EY exists to build a better working world, helping to create long-term value for clients, people and society and build trust in the capital markets.

Enabled by data and technology, diverse EY teams in over 150 countries provide trust through assurance and help clients grow, transform and operate.

Working across assurance, consulting, law, strategy, tax and transactions, EY teams ask better questions to find new answers for the complex issues facing our world today.

Apply now »

Provider	Description	Enabled
AddThis	Google Analytics is a web analytics service offered by Google that tracks and reports website traffic. Cookie Information Privacy Policy Terms and Conditions	Consent to cookies from provider AddThis
LinkedIn	LinkedIn is an employment-oriented social networking service. We use the Apply with LinkedIn feature to allow you to apply for jobs using your LinkedIn profile. Opting out of LinkedIn cookies will disable your ability to use Apply with LinkedIn. Cookie Policy Cookie Table Privacy Policy Terms and Conditions	Consent to cookies from provider LinkedIn
Google Analytics	Google Analytics is a web analytics service offered by Google that tracks and reports website traffic. Cookie Information Privacy Policy Terms and Conditions	Consent to cookies from provider GoogleAnalytics
Google Tag Manager	Google Tag Manager is a tag management system for conversion tracking, site analytics, remarketing, and more. Privacy Policy Terms and Conditions	Consent to cookies from provider GoogleTagManager