repyh-labs/delta-mandate-shopify-benchmarks

Analysis updated 2026-05-18

★ 1Audience · pm founderComplexity · 1/5Setup · easy

Mindmap

mindmap
  root((Shopify benchmarks))
    Benchmark setup
      100 purchase intents
      Easy to hard difficulty
      Shopify UCP CLI
    Agents compared
      Standard Shopify agent
      delta Mandate agent
    Results
      27.3% purchase error rate
      0% with verification
      42.9% on hard intents
    Error types
      Missing constraint evidence
      Product-type assumptions
      Incorrect catalog values
    Contents
      intents.md
      Results markdown files
      Comparison analysis

mindmap root((Shopify benchmarks)) Benchmark setup 100 purchase intents Easy to hard difficulty Shopify UCP CLI Agents compared Standard Shopify agent delta Mandate agent Results 27.3% purchase error rate 0% with verification 42.9% on hard intents Error types Missing constraint evidence Product-type assumptions Incorrect catalog values Contents intents.md Results markdown files Comparison analysis

Click or tap to explore — scroll the page freely

What do people build with it?

USE CASE 1

Read the comparison analysis to understand why AI agents fail on multi-constraint shopping tasks and use the findings to design better purchase verification in your own AI shopping flow.

USE CASE 2

Use the 100 purchase intents as a test set to evaluate how well your own shopping agent handles constrained product searches.

USE CASE 3

Reference the error categorization to build prompts or guardrails that prevent AI agents from asserting constraint satisfaction without explicit evidence.

What is it built with?

MarkdownShopify UCP CLI

How does it compare?

	repyh-labs/delta-mandate-shopify-benchmarks	195516184-a11y/esp32-mcp-parenting-robot	a-bissell/unleash-lite
Stars	1	1	1
Language	—	—	Python
Setup difficulty	easy	moderate	hard
Complexity	1/5	3/5	4/5
Audience	pm founder	general	researcher

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · easy Time to first run · 5min

This is a data-only repository of markdown files, no code to install or run.

No standard open-source license, benchmark data is published for verification and reproducibility, with questions directed to delta.

In plain English

This repository contains the data and analysis from a benchmark test comparing two approaches to AI-assisted shopping on Shopify: a standard AI agent using Shopify's own tools versus the same agent with an additional verification layer called delta Mandate added in front of it. The setup was straightforward. One hundred purchase requests were created, ranging from simple ones like "bamboo cutting board under $30" to complex ones with five or more requirements at once, such as a leather journal cover in a specific color, pattern, and material, under a certain price, with a particular closure type. Both agents were given access to Shopify's product search and told to find the right product or pass if none existed. The core finding is about purchase errors: of the products that the standard Shopify agent actually bought, 27.3 percent turned out to violate at least one of the stated requirements. For the hardest requests with five or more constraints, that error rate reached 42.9 percent. The delta Mandate agent, which checks every candidate product's details against the requirements before approving the purchase, had a 0 percent error rate across all 56 products it bought. The README explains what went wrong in the failed cases. The standard agent mostly asserted that a requirement was satisfied without any actual evidence from the product data. In 12 of the 18 errors, the constraint was simply not present anywhere in the product listing and the agent said it was fine anyway. A handful of cases involved the agent making assumptions based on product type rather than reading the catalog data. The repository does not contain code. It is a collection of markdown files with the full list of purchase intents, both agents' detailed results, and a comparison analysis including a confusion matrix. The benchmark data is published for verification and reproducibility. No open-source license is specified, the repository says to contact delta for questions.

Copy-paste prompts

Prompt 1

I am building an AI shopping assistant. Based on the delta Mandate Shopify benchmark findings, what types of purchase constraints cause the most errors, and how should I structure my verification step?

Prompt 2

Using the error categories from this Shopify benchmark, write a prompt instruction that tells an AI agent it must cite explicit evidence from product data before claiming a constraint is satisfied.

Prompt 3

I want to reproduce this Shopify AI shopping benchmark. Given the methodology described in the README, help me write a test harness that takes a list of purchase intents and evaluates an agent's responses.

Frequently asked questions

What is delta-mandate-shopify-benchmarks?

Benchmark data showing a standard Shopify AI agent buys the wrong product 27% of the time on constrained purchase requests, versus 0% errors with delta Mandate's verification layer added.

What license does delta-mandate-shopify-benchmarks use?

No standard open-source license, benchmark data is published for verification and reproducibility, with questions directed to delta.

How hard is delta-mandate-shopify-benchmarks to set up?

Setup difficulty is rated easy, with roughly 5min to a first successful run.

Who is delta-mandate-shopify-benchmarks for?

Mainly pm founder.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Scan in gitsafehub Deploy in gitdeployhub repyh-labs on gitmyhub

Verify against the repo before relying on details.