How We Score AI Financial Services Apps: Our Testing Methodology

The five axes we use to evaluate every financial services app on ChatGPT, Claude, and Gemini, and why each one matters for AI distribution.
Financial services are being distributed through AI platforms right now. Insurance quotes on ChatGPT. Tax calculators on Gemini. Savings tools on Claude. These are live products, used by real consumers, handling real money decisions.
But nobody is systematically evaluating them. There is no shared standard for what “good” looks like in AI distribution for financial services. No framework that says: this app converts, this one misleads, this one dead-ends.
WaniWani builds the infrastructure that powers these apps. We work with carriers, brokers, and financial services providers to make their products distributable through conversational AI. That means we have a clear view of what separates a working distribution channel from a demo that goes nowhere.
We are publishing our methodology so the industry has a benchmark. Every app we review on this blog is scored using the same five axes, the same protocol, the same scale. Here is how it works.
Platform vs. App: How We Think About Scoring
These apps run on third-party platforms. The builder controls the tool: what data it returns, what disclaimers it includes, how the widget renders, where the handoff goes. The builder does not control the platform layer: how ChatGPT summarizes tool output, whether it adds recommendations, whether it introduces competitors.
We score the app for what the builder controls, but we hold builders accountable for how robust their integration is. Some tools are designed so the platform cannot easily bypass them. Neptune’s quoting widget fires on every address input and returns data ChatGPT cannot ignore. Tuio’s policy term search returns actual documentation that grounds ChatGPT’s answers. Other tools get bypassed routinely: TaxDown’s qualifying questions are skipped by ChatGPT in every session we tested. If your tool is easy to bypass, that is a design problem, not just a platform limitation.
When the platform overrides the tool (adding competitors, hiding tool responses, stripping context), we note it. But we do not excuse poor tool design by blaming the platform.
The Five Axes
Every app is scored from 1 to 5 on five axes, for a maximum of 25 points. Each axis targets a different dimension of what makes AI distribution effective (or broken).
1. Product Depth
How much can you actually do?
This axis measures whether the app delivers real, personalised value or just surfaces generic content.
- 1 = FAQ only, no personalised output
- 2 = Generic estimate, no customisation
- 3 = Personalised output based on real inputs
- 4 = Personalised output with contextual follow-ups and edge case handling
- 5 = Near-complete journey: personalised output, parameter iteration, coverage design, and meaningful follow-up capability
Most apps cluster around 2 or 3. They can produce a number, but the number lacks the specificity a consumer needs to act on it. A score of 5 means the tool delivers a full interactive experience: it takes real inputs, returns a meaningful output, handles parameter changes and re-quoting, and supports genuine follow-up questions with real data.
2. Compliance Rigor
What safeguards did the builder put in, and do they survive the platform?
This axis measures the regulatory safeguards the builder designed into the tool, and how robustly those safeguards reach the user. A tool that includes disclaimers, refuses grey-area topics, and asks qualifying questions before answering scores well. A tool that includes all of those but designs them so ChatGPT skips them every time has a compliance design problem, not just a platform problem.
- 1 = No disclaimers, loose language, binding-sounding statements, crosses the advice boundary
- 2 = Minimal disclaimers, inconsistent language, advice boundary unclear
- 3 = Clear disclaimer language, present but generic; estimate vs. quote distinction exists
- 4 = Jurisdiction-aware, consistent placement, no binding language, licensing info present
- 5 = Jurisdiction-specific handling, licensing and regulatory info surfaced, proactive estimate vs. quote language, advice boundary clearly maintained throughout
We are not auditing for legal compliance. We are evaluating whether the builder designed regulatory awareness into the tool and whether those safeguards actually reach the user. A tax tool whose qualifying questions get skipped by ChatGPT in every session has a compliance design problem. An insurance app whose disclaimers appear on every widget render and whose refusal logic ChatGPT respects has strong compliance design.
3. Conversation Quality
Does it actually understand what it is selling?
An AI app distributing financial products needs to handle the same range of questions a knowledgeable human would face. This axis measures how well the conversation serves the user, and whether the quality comes from the tool’s data or from the platform improvising on its own.
- 1 = Breaks after one turn or returns canned responses
- 2 = Handles basic follow-ups but fails on edge cases
- 3 = Solid on the happy path, handles parameter changes, stumbles on adversarial questions
- 4 = Handles edge cases gracefully, admits limits, gives accurate domain context
- 5 = Feels like talking to a knowledgeable specialist; adapts, clarifies, educates
The gap between 3 and 5 is where most apps fall short. They can walk a user through a standard flow, but the moment someone asks a question the builder did not anticipate, the quality drops sharply. When the tool fires, is the data good enough to ground an accurate answer? When it does not fire, does the platform improvise well or poorly? The best apps provide data deep enough that the platform rarely needs to improvise.
4. Commercial Effectiveness
Does it bring business back to the company that built it?
This axis goes beyond the click-through moment. It evaluates the full commercial picture: conversion path, brand preservation, context carry-over, and whether the platform undermines the business goal. An app that generates great answers but sends users to competitors, strips the brand, or loses all conversational context is commercially ineffective regardless of how good the product is.
- 1 = No conversion path, no CTA, brand diluted or absent
- 2 = Generic link to homepage, brand present but no context carried
- 3 = Link to relevant page with some context, conversion path exists but incomplete
- 4 = Deep link with context preserved, brand prominent, clear conversion funnel
- 5 = Seamless conversion: pre-filled form, brand intact, platform behavior does not undermine the business goal
Commercial effectiveness is where most AI financial services apps reveal whether they were built with a distribution strategy or just as a proof of concept. An app that ends with “visit our website for more information” has no distribution strategy. An app that carries context into a pre-filled application form, preserves the brand throughout, and does not leak users to competitors does.
5. Transparency
Does the user understand what they are looking at and where it comes from?
Financial services outputs carry weight. Consumers make decisions based on the numbers they see. This axis measures two things: whether the output itself is clear (price breakdown, methodology, limitations), and whether the user can tell where the information comes from (tool data vs. AI-generated content).
- 1 = Numbers appear from nowhere. No breakdown, no methodology. User cannot tell if data comes from the tool or the AI.
- 2 = Mentions it is an estimate. No breakdown. No source distinction.
- 3 = Some component visibility. High-level explanation of output basis. Source partly distinguishable (e.g., branded widget for quotes, but follow-up answers unmarked).
- 4 = Clear breakdown of components, inclusions and exclusions stated. User can generally tell when the tool provides data vs. when the AI is filling gaps.
- 5 = Full breakdown: components, factors, exclusions, methodology, limitations. Clear source attribution throughout. User always knows where the information comes from.
A 5 on transparency means the user knows exactly what they are looking at and who provided it. They understand which inputs drove the output, what was excluded, how the number was calculated, what the output cannot tell them, and whether the answer came from the company’s tool or from the AI platform. This builds the kind of trust that converts.
How We Test
Every app goes through the same protocol, regardless of category.
We start with a realistic first request (the happy path), then follow the conversation wherever it leads. Each subsequent turn is chosen based on what the previous response revealed and which scorecard axes still need coverage. We test parameter changes mid-conversation, edge cases, compliance-sensitive questions, the conversion path, and transparency. The conversation typically runs 2 to 7 turns, depending on the app’s depth.
Depending on the category, we add relevant scenarios. For insurance apps, these might include coverage questions, estimate vs. quote language, price verification, or jurisdiction handling. For tax apps: deduction calculations, advice boundaries on sensitive topics, or filing guidance. For banking and savings apps: offer freshness, personalization, or commercial transparency.
We test each app as a normal user would, on the same platform, with no special access or insider knowledge.
Every response is documented. We record what the tool returned (including hidden tool responses where visible) and what the platform displayed to the user.
All five axes are scored on every app, always. The maximum score is 25.
What We Are Looking For
This is not a product review. We are evaluating whether AI distribution works as a channel.
That means we are looking beyond the app itself at the distribution strategy behind it.
Does the app have a distribution strategy? Some apps are clearly built to convert: they collect meaningful inputs, produce actionable outputs, and push the user toward a next step. Others are content tools repackaged as apps, with no path from conversation to transaction.
Can it convert? The full-funnel question. Can a user go from “I need home insurance” to a real quote, a real application, or a real next step? Or does the journey dead-end after a generic estimate?
Does the platform help or hurt? ChatGPT, Claude, and Gemini each introduce their own behaviors. The platform may add recommendations, introduce competitors, hide tool responses, or strip disclaimers. We note these behaviors and evaluate whether the builder designed their tool to be robust against them. Some apps are built so the platform reinforces the experience. Others are built in a way the platform easily overrides.
The invisible boundary. This is the most important thing we track. Every AI app sits on top of a large language model. At some point, the structured tool stops and the LLM starts improvising. When that boundary is invisible to the user, risks multiply: hallucinated numbers, fabricated policy details, invented regulatory claims. The best apps are designed so this boundary rarely matters: the tool fires often enough, and the data is deep enough, that the platform has less room to improvise. We score this under Transparency.
Where We Publish
We publish results on the WaniWani blog, one app at a time. Each review includes the full score breakdown across all five axes, specific evidence from the testing session, and our assessment of the app’s distribution viability.
The goal is straightforward: give the industry a shared reference point for evaluating AI distribution in financial services. Providers building these apps deserve to know how they compare. Consumers using them deserve to know what they are getting. And the platforms hosting them need a quality signal that goes beyond download counts.
We will keep testing. The methodology stays the same. Every audit is dated so readers know exactly when a given app was tested; results reflect the app’s state at that point in time. The scores speak for themselves.