About This Benchmark
We believe the best way to evaluate AI models is through real-world usage, not synthetic benchmarks. Our data comes from actual agentic tasks performed on the Zip platform.
Our Mission
To provide the most accurate and transparent benchmark data for AI models by leveraging real-world usage patterns from production environments. We want to help developers and businesses make informed decisions about which AI models to use for their specific needs.
Why This Matters
Traditional benchmarks often use curated datasets that don't reflect real-world usage patterns. Models that perform well on synthetic benchmarks may struggle with actual production tasks. By testing with real agentic tasks from Zip, we provide insights that matter for real applications.
Methodology
How we collect, process, and present benchmark data from the Zip platform.
How We Obtain Data
Senrie Zip Benchmark doesn't test with sample data. The data are results of actual use in Zip. Every benchmark result comes from real user interactions on the Zip platform, not curated test sets.
What We Use
- Telemetry Data
Aggregated usage metrics including token quantities, API response times, and task completion rates.
- Token Metrics
Input token count, output token count, and total token consumption per task type.
- Performance Metrics
API latency measurements (TTFT, P50, P95), throughput rates, and reliability scores.
- Task Classification
Token consumption patterns based on task type (coding, research, analysis, planning).
What We Never Use
- Your Inputs & Outputs
We never access, store, or process the actual content of your prompts or AI responses.
- Business Logic & Code
Your proprietary code, business logic, and intellectual property are never used for benchmarks.
- Personal Data
No personally identifiable information is collected or used in our benchmark calculations.
- Private Context
Project context, team discussions, and confidential business information remain completely private.
Intelligence Index
Composite score from four weighted evaluation categories based on real-world agentic tasks.
Speed Measurement
Measured in output tokens per second using OpenAI tokens as a standard unit for fair comparison.
Price Calculation
Prices based on provider-listed rates with a blended calculation for easy comparison.
Data Safety & Transparency
Our commitment to protecting your data while providing accurate benchmarks.
Our Transparency Guarantee
Our benchmarks represent a transparent view of token usage efficiency and data safety across AI models. We measure how efficiently models use tokens for different task types — not what you're building or what you're saying.
The data we present is verifiable by providers. AI model providers can independently validate our benchmark results against their own telemetry data, as our metrics are based on standard API usage patterns and publicly available performance characteristics.
Privacy First
Your prompts, responses, code, and business logic are never accessed, stored, or used in our benchmarks.
Provider Verifiable
Our metrics use standard API patterns that providers can independently verify against their own data.
Full Transparency
We openly document our methodology, data sources, and calculation methods for complete transparency.
Zip Privacy Policy
For complete details on what data we collect and how we use it, please review the Zip platform privacy policy.
Definitions
Key terms used throughout this benchmark and their precise meanings.
Ready to explore the data?
View our comprehensive leaderboard to see how different models perform across all metrics.