🧠 Will AI Take Your Job? Watch the Benchmarks

🧠 The Benchmark Will Eat Finance

AI will transform whatever we can measure first. And the first measures are here. The discourse tends towards AI displacing jobs, or AI not delivering ROI. But the benchmarks are getting better, creating a ladder. And when there’s a ladder, AI starts to climb it.

The CEO of Citadel said a few weeks ago that “tasks once requiring teams of PhD-level professionals are now being completed by AI in a matter of days rather than months.” Ken Griffin is not alone. Most CEOs have had, or are about to have, that moment. And that was before Anthropic dropped Fable 5 (the public release of Mythos).

For every “holy crap AI can do an entire person’s job now” moment, there’s a hangover where in production, outside of test conditions, it can’t do the job at all, and worse, it pretends it can and is confidently wrong. For example: The best frontier LLM gets 1 in 3 financial crime risk ratings wrong. The other three models tested ranged from 46% to 53% accuracy.

So which is it? A productivity miracle or an overhyped slop machine?

The answer is neither, and it turns on one idea that runs through everything below.

Anything you can benchmark, the labs will benchmaxx. Give them a ladder, and they climb it, fast, every time. We watched it happen with code. It's now happening with finance.

So the real question was never "is AI good at finance?" Rather, it’s “which parts of your work sit on a ladder someone is already building a benchmark for, and which parts will never fit on a ladder at all?”

Today:

The base models are not yet expert enough at most expert finance workflows
But specialist companies can get much better performance today
The labs are getting better at every task, starting with coding
When we have benchmarks, the AI labs will race up the ladder of performance
So we’ll have a two-tier market - what the base models can do & what the specialists add by benefiting over the top of those models.
The slop outputs and job cutting are lazy short-term narratives that hide a bigger shift. Not being as good today isn’t the same as not being good ever.

1. Base models aren’t good enough at finance workflows (yet)

The industry standard for operational deployment of automation in fraud workflows is achieving above 90% risk ratings being correct according to CoveLabs new FinCrimeBench, the first independent benchmark testing frontier AI on fraud and AML tasks.

The methodology is fascinating. They took the 4 leading models, 110 expert-authored scenarios, and ran 342 evaluations between January and April 2026. The prompts were paired with expert answer keys and draws on guidance from OCC, FFIEC, FATF, FinCEN, and NIST.

The findings make grim reading if you were hoping the AI labs would fix AML and fraud: The best model topped out at 67%, while the collective average was 53.8%. And perhaps the most important finding: When models got it wrong, they did it with confident, structured rationale. If this becomes a tool in an institution, you can imagine a senior investigator spots the flaw, but a junior analyst trusting the tool as a copilot might not.

Consider that Canadian and US institutions spend $60B a year on Financial Crimes compliance in 2024, according to LexisNexis. The automation pressure is enormous. Nobody doubts AI will transform financial crime operations, but you have to question whether anyone is measuring readiness independently.

CoveLabs says that closing the performance gap requires two layers of training that are very familiar to my Sardine days:

Domain-specific: financial crime has its own typologies, terminology, and reasoning chains that general-purpose models haven't absorbed.
Company-specific: every institution has a different threat surface, channel mix, and risk appetite. The model needs to learn your policies, not generic ones.

FinCrimeBench is the first ladder anyone's built for financial crime. The score is ugly today. But now there’s a ladder for the models to climb.

2. Specialist tools can achieve much better results.

Fraud vendors and banks have significantly outperformed their existing operational setups using agentic workflows and fine-tuned LLMs.

Revolut was able to lift credit scoring 130% and fraud recall 65% metrics against their existing baseline using their custom LLM (I broke it down when it landed.)
Sardine* is helping a bank client resolve 95+% of false positive alerts on sanctions screening automatically, correctly.
Bretton, the YC company that used to be called Greenlite, cut compliance review times at First Internet Bank by 87% with its AI agents.
Hebbia found that 63% of finance professionals at Centerview and Blackstone save over 6 hours a week (and 27% save over 10 hours a week) with their product.

These are production use cases of agentic AI performing at a level that is as good as, or better than, the human staff, and lifts the overall capability of the organization. It’s also well in advance of what the labs can do.

None of these companies has a better base model than OpenAI or Anthropic. What they have is the stuff the labs structurally can't reach. Your data, your workflow, and increasingly, the scoreboard itself.

The CEO of Sardine Soups laid it out in a LinkedIn post:

❝

There are two aspects to building an Agentic AI product.

The model the product works on top of.
The evals, the data, the governance, the workflows, the rules, and the ground truth the Agent is trained on. (the domain-specific and client-specific stuff CoveLabs talked about)

For great products, these two work hand in hand to create a product more useful than either could independently. Most LLMs, when left alone to themselves, think everything is fraudulent. You get a ton of false positives. [...] the most critical aspect is rigorously validating if the Agent is doing the right thing. We label true positives as true, false positives as false, and this way the Agent improves its performance over time. An AI Agent without human oversight is not useful. You need expertise to know what good looks like

Soups Ranjan - CEO of Sardine

If you deconstruct that, you see why the specialists win. They have the evals (benchmarks), governance, workflows, and, critically, data to understand what good looks like. But will this moat last?

3. The labs are getting better at every task, including finance.

This is benchmaxxing in the open. A ladder appeared for code, the labs put their full weight on it, and the capability jumped. In December, anyone who writes code noticed a marked jump in the capability of coding agents. This has only extended since Anthropic launched Fable 5, where 9 or even 12-hour tasks are becoming the norm.

Did you have the oh shit moment yet?

This is coming to every economic task where a benchmark exists like: GDPval, OSWorld-Verified, BrowseComp, MCP Atlas, FinanceAgent, Humanity’s Last Exam, and APEX-Agents; these measures have helped models improve at all kinds of knowledge work, computer use, web research, tool use, financial analysis, frontier reasoning, and professional-services tasks.

And now these benchmarks exist; the labs are racing up the ladder. From the Fable 5 release blog: “On Hebbia’s Finance Benchmark for senior-level reasoning, Fable 5 has the highest score of any model,” understanding documents, tables, and solving problems across them better than anything before it.

We got Claude Mythos before GTA 6 - But then they took it away from non-US users :(

There’s another tidbit in the Fable 5 release blog that relates to finance.

❝

IMC noted that Fable 5 aced their trading analysis evaluations nearly across the board, including factual lookup, conceptual reasoning, root cause analysis, and expected value analysis.

Anthropic

AI is getting really good at your job.

Or more accurately, it’s getting really good at some tasks, in some companies, where benchmarks exist.

Finance Agent v2 is one of Anthropic's priority benchmarks. It’s not surprising that Ken Griffin is feeling the shock, because it's that kind of non-deterministic, complex analysis where AI is starting to make real leaps.

4. If you can benchmark it, you can benchmaxx it

Benchmarks matter.

But they’re worth understanding, especially if you want to know what these models can, and cannot do. Not all benchmarks measure the same thing.

For example, Harvey, the legal tech start-up taking that industry by storm, has its own LAB (Legal Agent Benchmark), based on 1,250 tasks in a legal practice's distribution, focused on transactional, advisory, regulatory, and litigation work.

Legal tasks are getting eaten, slowly, then suddenly.

By creating this benchmark, Harvey, the entire industry, and the labs themselves have an objective set of measures to improve against, just as we had with coding. The reason the FinCrimeBench caught my attention was that it was the first time I’d seen an independent financial services benchmark outside of trading or quant analysis.

Most of finance doesn't have that number yet. And the few benchmarks that do exist are being built by the specialists, not the labs. Hebbia graded the models because Hebbia has the workflows and the customers fussy enough to demand it. CoveLabs graded financial crime because somebody finally had to. Whoever writes the benchmark gets to define what good even means, and that is a quietly enormous amount of power to be holding.

Now play it forward a couple of years. Imagine every corner of financial services that leans on expert judgment gets its own FinCrimeBench. Underwriting, surveillance, disputes, treasury, each one turned into a scored exam with an answer key. The moment that exam exists, the labs have something concrete to optimize against, and history says they will. As a bonus, you finally get to hold vendor marketing up against an independent number instead of a wall of customer logos.

Give a model a clear, expert-graded target and reinforcement learning does the rest. Right now there's almost no optimization pressure on financial-crime performance, because there was no ladder. FinCrimeBench is the first rung.

Which surfaces the rule underneath all of this, the one worth tattooing somewhere.

A benchmark commoditizes whatever it can measure. The value flows to whatever it can't.

The benchmark measures raw model capability, so raw model capability is exactly what gets cheap. The labs will climb that ladder until the base layer is shockingly good and nearly free. What no benchmark can measure is your data, your workflow, your ground truth, your sense of what good even looks like. That's the layer the score can't reach, so that's where the margin goes. It's why the specialists clear the bar today, and it's why they don't get wiped out tomorrow. They never owned a better model. They own the part nobody can download.

Does that mean specialist AI firms are cooked?

No.

5. Four approaches to winning in the age of F-AI-nance.

The frontier coding models got dramatically better over the last year, and Cursor kept growing regardless, because Cursor owns the harness, the data, and the loop the model runs inside. The model is the engine. Cursor is the car you drive.

Remember Soups talked about needing 1) the model and 2) the evals and data? Cursor has both.

Which leaves every incumbent, tech company, and fund standing in front of the same choices.

If you already have an enterprise subscription, riding the base layer costs nothing (except maybe long-term contract lock-in), and will start to eat more tasks over time. With specialists, you rent their data, governance, and know-how, and you get agentic AI benefits next quarter instead of in two years. But now you have a supplier dependency.

With forward deployment a lab, or a specialist's engineers, come in and build a custom agent on your own stack. You own the IP and get good results, but now you arguably have even greater supplier dependency.

And the hard mode option is building your own model. If you have the data, the talent, and the patience, it is a real and durable moat. It is also the door most people should walk straight past (unless they’re massive and great with data), precisely because it's the most seductive one in the room. Most teams will overshoot their actual capabilities here.

Today, options 2 or 3 are valid trade-offs, but there will come a point where the base-layer models get so good that you can’t ignore them. They might never be as good as the specialists, but for some people, good enough is good enough.

Every choice is really a bet on the ladder. How fast does the base layer climb toward my use case, and what do I do in the meantime?

6. AI Doomers and AI cheerleaders are both wrong.

"AI can't do financial crime prevention today" is true, and temporary.

"AI is coming for the jobs" is loud, and lazy.

The quieter, more useful insight is that there are several useful ways to get more from AI depending on your internal capability. Buy from a specialist? A lab? Build your own model?

What made Ken Griffin speak out is that he has already chosen his path. Citadel sits on exactly the kind of proprietary data the base models are starving for. Everyone clipped his quote as a threat.

It reads better as a map.

The data you own and the work you wrap around the model is the game, because it's the one part nobody else can download.

Find your moat. Recognize your weaknesses. Play to your strengths.

4 Fintech Companies 💸

1. Shatterdome Energy - Power Supply Hedging Platform for AI data centers

Shatterdome energy bundles renewables like solar panels, batteries, wind farms, into one virtual power plant, for flexible electricity users (like data centres that can dial load up or down. They do this by creating structured derivatives tied to power prices, volatility, and system stress. This means buyers or sellers can get a fixed consistent price, while traders can capitalize on any volatility.

🧠 Imagine buying a “power rate swap” where you can lock in a fixed price as a data center, but the market can trade the variable price if they see an opportunity to monetize the gaps. AI data centres switching on and off plus weather-dependent renewables means the grid sometimes pays you to take power, and sometimes charges $1,000+/MWh. Will solar and battery owners actually hand control of pricing their assets to a startup, or stick with the incumbents?

2. Farther - Tech-first RIA for breakaway advisors

Farther recruits advisors away from big “wirehouses” like Morgan Stanley, UBS, and Merrill onto a modern RIA platform like tax-loss harvesting, direct indexing, private markets access, document hub, and back-office automation built in. The pitch is higher payouts than wirehouses pay, automation that takes the admin grunt-work off the advisor's plate, and a polished UHNW client experience. They claim a 1-3% after-tax return improvement just from the tax-intelligent layer.

🧠At $23B in recruited assets and a fresh Series D, Farther is well past "is this a company or a feature." Direct indexing and tax-loss harvesting are table stakes provided by the competition Schwab Personalized Indexing, Goldman PWM, all do something similar. Can Farther continue to win over RIA’s or will the incumbents fight back in time with similar features?

3. Squid Router - Cross-chain swap router for wallets and apps

Squid lets users (and the apps embedding it) swap any token on any chain for any token on any other chain in a single transaction, e.g ETH on Ethereum → USDC on Solana, or EURC → USDC across Base, Avalanche, and Celo. Powering MetaMask, Ripple, Ledger, MiniPay, and Circle's EURC ↔ USDC stablecoin rails, with $6B+ routed since 2022 across 100+ blockchains.

🧠Cross-chain bridges have been the worst category in crypto by realized losses because of hacks like Wormhole ($325M), Ronin ($625M), Nomad ($200M+). So $6B routed since 2022 is noteworthy. Can they win institutions? Cross-chain stablecoin FX (EURC ↔ USDC across chains) is the use case that could gain traction. But when those institutions build direct access to the top 4 or 5 chains, do routers become only useful for the long tail?

4. Truflation - Real-time inflation data, 45 days ahead of official figure.

Used by hedge funds, DeFi protocols, and macro retail traders to get accurate price data on millions of specific SKU level items. Truflation aggregates daily price data into 338+ live indexes including a real-time CPI that claims 0.99 correlation with the Beuraux of Labor and Statistics monthly print.

🧠 SKU level data is more valuable than the headline figure for many market actors. Inflation varies wildly by sector. If you’re a trader, knowing how prices are changing in real-time is an edge, vs prediction markets which tend to track larger compound figures, with simpler resolutions.

Things to know 👀

1. US Judge OK’s Visa & Mastercard $38bn swipe fee settlement

A Brooklyn judge gave preliminary approval to a $38bn deal with ~12m merchants, ending a case that started in 2005 over the fees merchants pay to accept cards. Under the terms, Visa and Mastercard lower fees by 0.1 percentage point for five years, and standard consumer rates get capped at 1.25% for eight. Judge Brian Cogan called it "fair, reasonable, and adequate" and signalled final approval is likely. (Reuters)

🧠 There's a huge catch. Honor All Cards is gone. It's the rule that forces your corner coffee shop to take a Chase Sapphire Reserve (a Visa Infinite, one of the priciest cards to accept) at the same till as a basic debit card even though its more expensive. Interchange funds the rewards. Rewards drive the spend. (Merchants can’t block entire issuers, just the card types, so other Chase cards would likely work even if they didn’t like Sapphire).

🧠 Merchants picking and choosing could break the UX. Cards are now split into categories merchants can accept or decline: commercial, premium consumer (including most rewards cards), and standard consumer. If a merchant decided tomorrow not to accept your card, that promise of “when you see this logo, your card will work” gets broken.

🧠No merchant turns away their best customers at the register. In practice, nearly 90% of credit card spend sits on rewards cards (Bankrate). Which is exactly why the networks could afford to hand it over.

🧠 That 10bps cut might never reach merchants. On flat pricing (Square, Stripe), the processor keeps the spread. Only interchange-plus (really big) merchants feel it. The little guy everyone lobbies for loses.

🧠 Rewards won't die, they'll change shape. Squeeze premium interchange and issuers shift from points to annual-fee-and-credit models. Sapphire Reserve and Amex Platinum already did.

🧠 This is a war that's never really over. The 2013 settlement got approved too. The appeals court threw it out in 2016. It drags on, and merchants are never happy. Lawfare is a reality in payments.

2. Kalshi to require employer disclosures before trading

Kalshi will require users to disclose their employer before trading in selected markets judged especially vulnerable to insider information or manipulation, including contracts tied to corporate performance, product launches, national security and geopolitical events. The exchange will risk score markets, screen out likely insiders, add easier whistleblower reporting and investigate employment claims when suspicious trading emerges.

🧠 Kalshi has seen high profile cases in the press and wants to be on the front foot. The policy follows recommendations from an independent surveillance committee and mounting scrutiny after Kalshi referred more than 20 cases to federal authorities in early 2026, including accounts linked to George Santos and military spouses.

🧠 Kalshi wants to argue regulated onshore > offshore prediction markets. If they demonstrate they can police insider information with compliance infrastructure and Polymarket does not, it puts clear distance between the two.

🧠 It also avoids the bigger concern many of us have. Everything is gambling now. And for vulnerable people, gambling is bad. I worry our consumer protection & frankly humanity gets lost in the debate about the blurred line between prediction & gambling, or the legal tussles between state and federal.

Good Reads 📚

1. Why stablecoin FX never took off

This piece argues that non-USDC or USDT liquidity doesn’t have enough momentum, and rather than trying to build a market in multiple local currencies, the better path is to reserve USDC/T and denominate in local fiat. It also says “single currency neobanks don’t get traction,” pointing to Wise, Revolut, and Airwallex. The stablecoin equivalent would use synthetic fiat derivatives, pegged to the real fiat rate. This synthetic hedge could then be honored by an off-ramp/liquidity provider just as FX swaps work today in TradFi. The merchant gets stability, and the liquidity provider gets alpha opportunity and / or off ramping fees.

🧠 While I don’t agree single currency neobanks don’t succeed (Nubank did pretty well), I do think the core argument here is solid.

🧠 If you’re offering someone a multi-currency account (like Stripe or Ramp does onchain), then you can help them hedge local FX risk with synthetic fiat derivatives onchain.

Tweets of the week 🕊

— # (#)

That's all, folks. 👋

Remember, if you're enjoying this content, please do tell all your fintech friends to check it out and hit the subscribe button :)

Want more? I also run the Tokenized podcast and newsletter.

(1) All content and views expressed here are the authors' personal opinions and do not reflect the views of any of their employers or employees.

(2) All companies or assets mentioned by the author in which the author has a personal and/or financial interest are denoted with a *. None of the above constitutes investment advice, and you should seek independent advice before making any investment decisions.

(3) Any companies mentioned are top of mind and used for illustrative purposes only.

(4) A team of researchers has not rigorously fact-checked this. Please don't take it as gospel—strong opinions weakly held

(5) Citations may be missing, and I’ve done my best to cite, but I will always aim to update and correct the live version where possible. If I cited you and got the referencing wrong, please reach out

🧠 The Benchmark Will Eat Finance