Documentation

About This Benchmark

We believe the best way to evaluate AI models is through real-world usage, not synthetic benchmarks. Our data comes from actual agentic tasks performed on the Zip platform.

Our Mission

To provide the most accurate and transparent benchmark data for AI models by leveraging real-world usage patterns from production environments. We want to help developers and businesses make informed decisions about which AI models to use for their specific needs.

Why This Matters

Traditional benchmarks often use curated datasets that don't reflect real-world usage patterns. Models that perform well on synthetic benchmarks may struggle with actual production tasks. By testing with real agentic tasks from Zip, we provide insights that matter for real applications.

Section 1

Methodology

How we collect, process, and present benchmark data from the Zip platform.

How We Obtain Data

Senrie Zip Benchmark doesn't test with sample data. The data are results of actual use in Zip. Every benchmark result comes from real user interactions on the Zip platform, not curated test sets.

What We Use

  • Telemetry Data

    Aggregated usage metrics including token quantities, API response times, and task completion rates.

  • Token Metrics

    Input token count, output token count, and total token consumption per task type.

  • Performance Metrics

    API latency measurements (TTFT, P50, P95), throughput rates, and reliability scores.

  • Task Classification

    Token consumption patterns based on task type (coding, research, analysis, planning).

What We Never Use

  • Your Inputs & Outputs

    We never access, store, or process the actual content of your prompts or AI responses.

  • Business Logic & Code

    Your proprietary code, business logic, and intellectual property are never used for benchmarks.

  • Personal Data

    No personally identifiable information is collected or used in our benchmark calculations.

  • Private Context

    Project context, team discussions, and confidential business information remain completely private.

Intelligence Index

Composite score from four weighted evaluation categories based on real-world agentic tasks.

Agentic Coding 30%
Agentic Research 25%
Agentic Analysis 25%
Agentic Planning 20%

Speed Measurement

Measured in output tokens per second using OpenAI tokens as a standard unit for fair comparison.

TTFT Time to first token
P50 Median latency
P95 95th percentile
tok/s Tokens per second

Price Calculation

Prices based on provider-listed rates with a blended calculation for easy comparison.

Cache Hit 70%
Input 20%
Output 10%
Currency USD / 1M tokens
Section 2

Data Safety & Transparency

Our commitment to protecting your data while providing accurate benchmarks.

Our Transparency Guarantee

Our benchmarks represent a transparent view of token usage efficiency and data safety across AI models. We measure how efficiently models use tokens for different task types — not what you're building or what you're saying.

The data we present is verifiable by providers. AI model providers can independently validate our benchmark results against their own telemetry data, as our metrics are based on standard API usage patterns and publicly available performance characteristics.

Privacy First

Your prompts, responses, code, and business logic are never accessed, stored, or used in our benchmarks.

Provider Verifiable

Our metrics use standard API patterns that providers can independently verify against their own data.

Full Transparency

We openly document our methodology, data sources, and calculation methods for complete transparency.

Zip Privacy Policy

For complete details on what data we collect and how we use it, please review the Zip platform privacy policy.

View Privacy Policy
Section 3

Definitions

Key terms used throughout this benchmark and their precise meanings.

01
Model
A large language model (LLM), including proprietary, open source and open weights models.
02
Model Creator
The organization that developed and trained the model. For example, OpenAI is the creator of GPT-4 and Meta is the creator of Llama 3.
03
Endpoint
A hosted instance of a model that can be accessed via an API. A single model may have multiple endpoints across different providers.
04
System
A dedicated compute environment for running AI models, typically provisioned infrastructure such as a VM, which can be benchmarked for performance under load.
05
Provider
A company that hosts and provides access to one or more model endpoints or systems. Examples include OpenAI, AWS Bedrock, Together.ai and more. Companies are often both Model Creators and Providers.
06
Serverless
Cloud service provided on an as-used basis, in relation to LLM inference APIs generally means priced per token of input and output. Serverless cloud products do still run on servers!
07
Open Weights
A model whose weights have been released publicly by the model's creator. We refer to 'open weights' or just 'open' models rather than 'open-source' as many open LLMs have been released with licenses that do not meet the full definition of open-source software.
08
Token
Modern LLMs are built around tokens — numerical representations of words and characters. LLMs take tokens as input and generate tokens as output. Input text is translated into tokens by a tokenizer. Different LLMs use different tokenizers.
09
OpenAI Tokens
Tokens as generated by OpenAI's GPT-3.5 and GPT-4 tokenizer, generally measured for benchmarking with OpenAI's tiktoken package for Python (o200k_base tokenizer). We use OpenAI tokens as a standard unit of measurement to allow fair comparisons between models. All 'tokens per second' metrics refer to OpenAI tokens.
10
Native Tokens
Tokens as generated by an LLM's own tokenizer. We refer to 'native tokens' to distinguish from 'OpenAI tokens'. Prices generally refer to native tokens.
11
Price (Input/Output)
The price charged by a provider per input token sent to the model and per output token received from the model. Prices shown are the current prices listed by providers.
12
Price (Blended)
To enable easier comparison, we calculate a blended price assuming a 7:2:1 ratio of cache hit, input, and output tokens.

Ready to explore the data?

View our comprehensive leaderboard to see how different models perform across all metrics.