Analysis updated 2026-05-18
Read the comparison analysis to understand why AI agents fail on multi-constraint shopping tasks and use the findings to design better purchase verification in your own AI shopping flow.
Use the 100 purchase intents as a test set to evaluate how well your own shopping agent handles constrained product searches.
Reference the error categorization to build prompts or guardrails that prevent AI agents from asserting constraint satisfaction without explicit evidence.
| repyh-labs/delta-mandate-shopify-benchmarks | 195516184-a11y/esp32-mcp-parenting-robot | a-bissell/unleash-lite | |
|---|---|---|---|
| Stars | 1 | 1 | 1 |
| Language | — | — | Python |
| Setup difficulty | easy | moderate | hard |
| Complexity | 1/5 | 3/5 | 4/5 |
| Audience | pm founder | general | researcher |
Figures from each repo's GitHub metadata at analysis time.
This is a data-only repository of markdown files, no code to install or run.
This repository contains the data and analysis from a benchmark test comparing two approaches to AI-assisted shopping on Shopify: a standard AI agent using Shopify's own tools versus the same agent with an additional verification layer called delta Mandate added in front of it. The setup was straightforward. One hundred purchase requests were created, ranging from simple ones like "bamboo cutting board under $30" to complex ones with five or more requirements at once, such as a leather journal cover in a specific color, pattern, and material, under a certain price, with a particular closure type. Both agents were given access to Shopify's product search and told to find the right product or pass if none existed. The core finding is about purchase errors: of the products that the standard Shopify agent actually bought, 27.3 percent turned out to violate at least one of the stated requirements. For the hardest requests with five or more constraints, that error rate reached 42.9 percent. The delta Mandate agent, which checks every candidate product's details against the requirements before approving the purchase, had a 0 percent error rate across all 56 products it bought. The README explains what went wrong in the failed cases. The standard agent mostly asserted that a requirement was satisfied without any actual evidence from the product data. In 12 of the 18 errors, the constraint was simply not present anywhere in the product listing and the agent said it was fine anyway. A handful of cases involved the agent making assumptions based on product type rather than reading the catalog data. The repository does not contain code. It is a collection of markdown files with the full list of purchase intents, both agents' detailed results, and a comparison analysis including a confusion matrix. The benchmark data is published for verification and reproducibility. No open-source license is specified, the repository says to contact delta for questions.
Benchmark data showing a standard Shopify AI agent buys the wrong product 27% of the time on constrained purchase requests, versus 0% errors with delta Mandate's verification layer added.
No standard open-source license, benchmark data is published for verification and reproducibility, with questions directed to delta.
Setup difficulty is rated easy, with roughly 5min to a first successful run.
Mainly pm founder.
This repo across BitVibe Labs
Verify against the repo before relying on details.