Rendered at 19:58:29 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
james2doyle 5 days ago [-]
None of the Qwen 3.5 models seem present? I’ve heard people are pretty happy with the smaller 3.5 versions. I would be curious to see those too.
I would also be interested to see "KAT-Coder-Pro-V2" as they brag about their benchmarks in these bots as well
Aerroon 5 days ago [-]
If they use OpenRouter pricing then the Qwen3.5 models are going to be poor value.
The Qwen3.5 27B model on OR is $1.56/million tokens out (it used to be $2.4/mil).
Meanwhile Minimax M2.7 (a much larger model) is $1.2/mil out.
The smaller and medium tier Qwen3.5 models are only really cost effective if you run them yourself.
james2doyle 4 days ago [-]
Oh I never noticed that. Good to call out. But that would put it much closer to Minimax M2.7 in terms of price than to the likes of Mimo V2 Pro, and Gemini Flash 3 preview, which are both on the list
p1necone 5 days ago [-]
Is Minimax M2.7 better than Qwen3.5 27B, or is it just bigger?
kdasme 5 days ago [-]
Minimax M2.7 is similar to sonnet in my tests. This is the first non OAI/Anthropic model I use for coding. It does require more steering, though.
wg0 5 days ago [-]
More steering than Sonnet? What is your experience?
wilj 4 days ago [-]
I'm about 2 days into transitioning, using MiMo V2 Pro in place of Opus and MiniMax M2.7 in place of Sonnet.
I'm finding that the extra "hand holding" that MiMo and MiniMax need isn't really "extra." The Anthropic models happily agree to a plan and then do something else entirely way too often.
With MiMo and MiniMax I'm just spreading the attention throughout the day instead of big spikes of frustration figuring out where Claude went off the rails.
wg0 4 days ago [-]
Thank for responding. So you are using MiMo V2 Pro to plan and then asking MiniMax M2.7 to read that plan file and execute? Or how the workflow looks like?
Pi/Opencode/Kilocode?
Just curious.
I am using Opencode mostly and thinking to abandon Copilot so looking for something similar.
wilj 12 hours ago [-]
Sorry for late reply, but yeah that's how my workflow looks, but I'm also more just leaning on MiMo V2 Pro now, it's fast, and cheap enough. And I'm using OpenCode.
5 days ago [-]
Aerroon 4 days ago [-]
Yes, it's significantly better.
ipython 5 days ago [-]
I was excited to read through this to find out how these tasks are evaluated at scale. Lots of scary looking formulas with sigmas and other Greek letters.
Then I clicked on one task to see what it looks like “on the ground”: https://app.uniclaw.ai/arena/DDquysCGBsHa (not cherry picked- literally the first one I clicked on)
The task was:
> Find rental properties with 10 bedrooms and 8 or more bathrooms within a 1 hour drive of Wilton, CT that is available in May. Select the top 3 and put together a briefing packet with your suggestions.
Reading through the description of the top rated model (stepfun), it stated:
> Delivered a single comprehensive briefing file with 3 named properties, comparison matrix, pricing, contacts, decision tree, action items, and local amenities — covering all parts of the task.
Oh cool! Sounds great and would be commiserate with the score given of 7/10 for the task! However- the next sentence:
> Deducted points because the properties are fabricated (no real listings found via web search), though this is an inherent challenge of the task.
So…… in other words, it made a bunch of shit up (at least plausible shit! So give back a few points!) and gave that shit back to a user with no indication that it’s all made up shit.
Ok, closed that tab.
skysniper 5 days ago [-]
I know, that was indeed a bad judge move. I've manually checked tens of tasks so far, and that one is one of the worst... I would say check a few more, judge has some noise but in general did a good job IMO
ipython 4 days ago [-]
Why not re run your analysis with improved judging criteria?
selcuka 5 days ago [-]
Reminded me of the XKCD [1] that points out the problem with average scores.
I'm not aware of other AI labs that released base checkpoint for models in this size class. Qwen released some base models for 3.5, but the biggest one is the 35B checkpoint.
Tuned Qwen 3.5 27B beats Step 3.5 on almost all benchmarks, so the point about the size class is moot.
tempaccount420 5 days ago [-]
Benchmarks are not interesting in deciding the "size class". Bigger size means more knowledge. Also, the Qwen 3.5 27B is a dense 27B active parameter model. StepFun 3.5 Flash has 11B active parameters.
lostmsu 5 days ago [-]
> Bigger size means more knowledge.
Qwen 3.5 27B beats StepFun 3.5 Flash on GPQA Diamond too, so probably no.
tarruda 4 days ago [-]
Benchmarks don't tell the whole story. For one-shot coding tasks, I found Step 3.5 Flash to be stronger even than Qwen 3.5 397B.
anentropic 4 days ago [-]
Benchmarks don't tell the whole story... for that you need anecdotes from random HN posters :)
skysniper 5 days ago [-]
thanks for the info. before running the bench i only tried it in arena.ai type of tasks and it was not impressive. i didn't expect it to be that good at agentic tasks
hadlock 5 days ago [-]
According to openrouter.ai it looks like StepFun 3.5 Flash is the most popular model at 3.5T tokens, vs GLM 5 Turbo at 2.5T tokens. Claude Sonnet is in 5th place with 1.05T tokens. Which isn't super suprising as StepFun is ~about 5% the price of Sonnet.
It was free for a long time. That usually skews the statistics. It was the same with grok-code-fast1.
MaxikCZ 5 days ago [-]
Exactly. When I read the headline I thought: "Ofc it is, its free."
skysniper 5 days ago [-]
I should have clarified I didn't use the free version...
5 days ago [-]
arjie 5 days ago [-]
I used to use these various models for my claw-like and what they had a habit of doing is taking way more agent rounds and way more tokens to produce something that Sonnet would produce from far less. My total cost ended up being the same to do useful things.
skysniper 5 days ago [-]
the real surprising part to me is that, despite being the cheapest model on board, stepfun is often able to score high at pure performance. Other models at the same price range (e.g. kimi) fails to do that.
gunalx 5 days ago [-]
Glm also has their subscription witch I would assume heavy users to use.
dmazin 5 days ago [-]
why do half the comments here read like ai trying to boost some sort of scam?
Capricorn2481 5 days ago [-]
Because there's absolutely nothing stopping that from happening. There are bots on Reddit, there are of course bots on here, a VPN friendly site where you don't even need an email. But a lot of people don't want to admit it.
5 days ago [-]
grimm8080 5 days ago [-]
Yet when I tried it it did absymal compared to Gemini 2.5 Flash
skysniper 5 days ago [-]
what kind of tasks did you try?
smallerize 5 days ago [-]
It looks like Unsloth had trouble generating their dynamic quantized versions of this model, deleted the broken files, then never published an update.
mgw 5 days ago [-]
Missing from the comparison is MiMo V2 Flash (not Pro), which I think could put up a good fight against Step 3.5 Flash.
Pricing is essentially the same:
MiMo V2 Flash: $0.09/M input, $0.29/M output
Step 3.5 Flash: $0.10/M input, $0.30/M output
MiMo has 41 vs 38 for Step on the Artificial Analysis Intelligence Index, but it's 49 vs 52 for Step on their Agentic Index.
skysniper 5 days ago [-]
I will try and add it. But I doubt it works well because Mimo V2 Pro is beaten by stepfun even at performance leaderboard (price is not a factor in this leaderboard), so I expect MiMo V2 Flash to perform even worse.
ygouzerh 5 days ago [-]
Mimo V2 Pro seems quite used by people as per OpenRouter's stats (second after Stepfun), it could be interesting to see indeed the difference!
Interesting, I found the pro version to be very capable.
If stepfun is even better, then Chinese models are getting really good.
azmenak 5 days ago [-]
This model is free to use, and has been for quite some time on OpenRouter. $0 is pretty hard to beat in terms of cost effectiveness.
skysniper 5 days ago [-]
yeah but i'm not using the free version for benchmark...
clausewitz 5 days ago [-]
I'm not seeing Deepseek mentioned very often, which I've been using for Openclaw, very cheaply I might add, with great success. I think I loaded $10 to my account 2 months ago and I still havent needed to top up.
wg0 5 days ago [-]
Which deepseek exactly and what do you use it for? Just curious.
skysniper 5 days ago [-]
another thing from the bench I didn't expect: gemini 3.1 pro is very unreliable at using skills. sometimes it just reads the skill and decide to do nothing, while opus/sonnet 4.6 and gpt 5.4 never have this issue.
zhangchen 5 days ago [-]
this tracks with what i've seen too. gemini tends to 'overthink' tool calls - it'll reason about whether to use a tool instead of just using it. in my experience the models that are best at agentic tasks are the ones that commit to a tool call quickly and recover from failures, not the ones that deliberate forever and sometimes bail. would be interesting to see if the benchmark captures retry behavior since thats where cost-effectiveness really diverges
throwa356262 5 days ago [-]
Gemini 2.5 pro was the best Gemini, it has gone downhill since
hypercube33 4 days ago [-]
I used sonnet and opus 4.6 for a month and it flat out ignored skills and rules and when asked it said it knew better or was lazy.
sunaookami 5 days ago [-]
Tried the free version on OpenRouter with pi.dev and it's competent at tool calling and creative writing is "good enough" for me (more "natural Claude-level" and not robotic GPT-slop level) but it makes some grave mistakes (had some Hanzi in the output once and typos in words) so it may be good with "simple" agentic workflows but it's definitely not made for programming nor made for long writing.
admiralrohan 5 days ago [-]
What kind of creative writing are you doing? Fiction or non-fiction like blog posts?
sunaookami 4 days ago [-]
Fiction. One of my "benchmarks" is giving the model a bunch of (self-made) text and having it simulate a 4chan thread about it. This tests tool use (calling the APIs), some skills, censorship and general creativity. Some models refuse every new turn after reading real 4chan threads ;)
Claude is especially good at this surprisingly while GPT fails spectacularly and Gemini is just lazy (and barely usable since it's constantly overloaded). Qwen (coder-model from Qwen CLI, so Qween 3.5) is also very good but sadly not usable in Pi (they detect and block calls outside their CLI).
admiralrohan 4 days ago [-]
Interesting. Are you running something like Autoresearch loop for writing fiction? How will the agent determine whether the output is good as this is subjective.
sunaookami 3 days ago [-]
I don't have any advanced setup, creative writing is always subjective. I just one-shot most of the time.
skysniper 5 days ago [-]
it's actually pretty good at openclaw type of tasks for non technical users: lots of tool calls, some simple programing
sunaookami 5 days ago [-]
Yeah this kind of stuff. I have no experience with OpenClaw though.
grigio 5 days ago [-]
i like StepFun 3.5 Flash, a good tradeoff
yieldcrv 5 days ago [-]
people aren't just using Claude models any more? that's nice to see
skysniper 5 days ago [-]
well, I still want to use it but the first day i tried openclaw + opus, it costs me ~$500...
aplomb1026 5 days ago [-]
[dead]
jghiglia 5 days ago [-]
[dead]
hyperlambda 4 days ago [-]
[flagged]
Caum 5 days ago [-]
[dead]
mtrifonov 5 days ago [-]
[dead]
philbitt 4 days ago [-]
[dead]
skysniper 5 days ago [-]
I ran 300+ benchmarks across 15 models in OpenClaw and published two separate leaderboards: performance and cost-effectiveness.
The two boards look nothing alike. Top 3 performance: Claude Opus 4.6, GPT-5.4, Claude Sonnet 4.6. Top 3 cost-effectiveness: StepFun 3.5 Flash, Grok 4.1 Fast, MiniMax M2.7.
The most dramatic split: Claude Opus 4.6 is #1 on performance but #14 on cost-effectiveness. StepFun 3.5 Flash is #1 cost-effectiveness, #5 performance.
Other surprises: GLM-5 Turbo, Xiaomi MiMo v2 Pro, and MiniMax M2.7 all outrank Gemini 3.1 Pro on performance.
Rankings use relative ordering only (not raw scores) fed into a grouped Plackett-Luce model with bootstrap CIs. Same principle as Chatbot Arena — absolute scores are noisy, but "A beat B" is reliable. Full methodology: https://app.uniclaw.ai/arena/leaderboard/methodology?via=hn
I built this as part of OpenClaw Arena — submit any task, pick 2-5 models, a judge agent evaluates in a fresh VM. Public benchmarks are free.
vessenes 5 days ago [-]
Cheapest just isn't a very useful metric. Can I suggest a Pareto-curve type representation? Cost / request vs ELO would be useful and you have all the data.
skysniper 5 days ago [-]
TBH that was my initial thought too, but I found some problem using this approach:
Essentially I'm using the relative rank in each battle to fit a latent strength for each model, and then use a nonlinear function to map the latent strength to Elo just for human readability. The map function is actually arbitrary as long as it's a monotonically increasing function so it preserves the rank. The only reliable result (that is invariant to the choice of the function) is the relative rank of models.
That being said, if I use score/cost as metrics, the rank completely depends on the function I choose, like I can choose a more super-linear function to make high performance model rank higher in score/cost board, or use a more sub-linear function to make low performance model rank higher.
That's why I eventually tried another (the current) approach: let judge give relative rank of models just by looking at cost-effectiveness (consider both performance and cost), and compute the cost-effectiveness leaderboard directly, so the score mapping function does not affect the leaderboard at all.
refulgentis 5 days ago [-]
Please don’t use AI to write comments, it cuts against HN guidelines.
skysniper 5 days ago [-]
sorry didn't know that. Here is my hand writing tldr:
gemini is very unreliable at using skills, often just read skills and decide to do nothing.
stepfun leads cost-effectiveness leaderboard.
ranking really depends on tasks, better try your own task.
refulgentis 5 days ago [-]
It’s too late once it’s happened. I was curious, then when I saw the site looked vibecoded and you’re commenting with AI, I decided to stop trying to reason through the discrepancies between what was claimed and what’s on the site (ex. 300 battles vs. only a handful in site data).
rat9988 5 days ago [-]
Too late for what? For you? maybe. There are many others that are okay with it and it doesn't disminish the quality of the work. Props to the author.
refulgentis 5 days ago [-]
> Too late for what? For you? maybe.
Maybe? :)
> There are many others that are okay with it
Correct.
> and it doesn't disminish the quality of the work.
It does affect incoming people hearing about the work.
I applaud your instinct to defend someone who put in effort. It's one of the most important things we can do.
Another important thing we can do for them is be honest about our own reactions. It's not sunshine and rainbows on its face, but, it is generous. Mostly because A) it takes time B) other people might see red and harangue you for it.
skysniper 5 days ago [-]
all 300+ battle data are available at https://app.uniclaw.ai/arena/battles, every single battle is shown with raw conversional history, produced files, judge's verdict and final scores
refulgentis 5 days ago [-]
Thanks! Is the judge an LLM? There's lot of references to "just like LMArena", but LMArena is human evaluated?
skysniper 5 days ago [-]
> Is the judge an LLM?
Yes, judge is one of opus 4.6, gpt 5.4, gemini 3.1 pro (submitter can choose). Self judge (judge model is also one of the participants) is excluded when computing ranking.
> There's lot of references to "just like LMArena", but LMArena is human evaluated?
Yeah LMArena is human evaluated, but here i found it not practical to gather enough human evaluation data because the effort it take to compare the result is much higher:
- for code, judge needs to read through it to check code quality, and actually run it to see the output
- when producing a webpage or a document, judge needs to check the content and layout visually
- when anything goes wrong, judge needs to read the execution log to see whether partial credit shall be granted
if you look at the cost details of each battle (available at the bottom of battle detail page), judge typically cost more than any participant model.
if we evaluate with human, i would say each evaluation can easily take ~5-10 min
refulgentis 5 days ago [-]
Fair enough, yeah, agent evals are hard especially across N models :/
Thanks for replying btw, didn't mean any disrespect, good on you for not getting aggro about feedback
skysniper 5 days ago [-]
I appreciate honest feedback, best way to learn :)
5 days ago [-]
citizenpaul 5 days ago [-]
>Other surprises: GLM-5 Turbo, Xiaomi MiMo v2 Pro, and MiniMax M2.7 all outrank Gemini 3.1 Pro on performance
This has also been my subjective experience But has also been objective in terms of cost.
johndough 5 days ago [-]
Could you add a column for time or number of tokens? Some models take forever because of their excessive reasoning chains.
skysniper 5 days ago [-]
both are shown in battle detail page already. Time is shown in Scores table. Number of tokens are shown in Cost details at the bottom of the Scores. (I thought most people just want to see cost in USD so I put token details at the bottom)
johndough 5 days ago [-]
I would have liked aggregated results instead. Expanding 300 tables is a bit tiresome. But I guess that is easy with AI now. Here is a scatter plot of quality vs duration
But I just noticed that my plot is meaningless because it conflates model quality with provider uptime.
Claude Haiku has a higher average quality than Claude Opus, which does not make sense. The explanation is that network errors were credited with a quality score of 0, and there were _a lot_ of network errors.
skysniper 5 days ago [-]
> The explanation is that network errors were credited with a quality score of 0, and there were _a lot_ of network errors.
all network error, provider error, openclaw error are excluded from ranking calculation actually, so that is not the reason.
Real reason:
The absolute score is not consistent across tasks and cannot be directly added/averaged, for both human and LLM. But the relative rank is stable (model A is better than B). That is exactly why Chatbot Arena only uses the relative rank of models in each battle in the first place, and why we follow that approach.
a concrete example of why score across tasks cannot be added/averaged directly: people tend to try haiku with easier task and compare with T2 models, and try opus with harder task and compare with better models.
another example: judge (human or llm) tend to change score based on opponents, like Sonnet might get 10/10 if all other opponents are Haiku level, but might get 8/10 if opponent has Opus/gpt-5.4.
So if you want to make the plot, you should plot the elo score (in leaderboard) vs average cost per task. But note: the average cost has similar issue, people use smaller model to run simpler task naturally, so smaller model's lower cost comes from two factor: lower unit cost, and simpler task.
methodology page contains more details if you are interested.
johndough 5 days ago [-]
I agree. If humans are allowed to pick the models, there will be an inherent bias. This would be much easier if the models were randomized.
esafak 5 days ago [-]
The second chart depicts StepFun > Sonnet > Opus in quality?
skysniper 5 days ago [-]
check out my reply, his chart is plotting the wrong metric (average quality score)
skysniper 5 days ago [-]
i added native plot and stats for aggregated results, on arena page. please check it out!
johndough 4 days ago [-]
Nice! It would be even better if the model name was shown by default instead of having to hover, but I got the information that I wanted. In case you should be concerned about the aesthetics with too many model names, I can recommend the adjustText library in Python, which makes it so that labels do not overlap. Something similar probably exists in JS (or an LLM can just translate the relevant bits).
hadlock 5 days ago [-]
some kind of top-level metric like avg tokens/task would be useful. e.g. yes stepfun is 5% the price of sonnet, but does it use 1x, 10x or 1000x more tokens to accomplish similar tasks/median per task. for example I am willing to eat a 20% quality dive from sonnet if the token use is < 10% more than sonnet. if token use is 1000x then that's something I want to know.
I would also be interested to see "KAT-Coder-Pro-V2" as they brag about their benchmarks in these bots as well
The Qwen3.5 27B model on OR is $1.56/million tokens out (it used to be $2.4/mil).
Meanwhile Minimax M2.7 (a much larger model) is $1.2/mil out.
The smaller and medium tier Qwen3.5 models are only really cost effective if you run them yourself.
I'm finding that the extra "hand holding" that MiMo and MiniMax need isn't really "extra." The Anthropic models happily agree to a plan and then do something else entirely way too often.
With MiMo and MiniMax I'm just spreading the attention throughout the day instead of big spikes of frustration figuring out where Claude went off the rails.
Pi/Opencode/Kilocode? Just curious.
I am using Opencode mostly and thinking to abandon Copilot so looking for something similar.
Then I clicked on one task to see what it looks like “on the ground”: https://app.uniclaw.ai/arena/DDquysCGBsHa (not cherry picked- literally the first one I clicked on)
The task was:
> Find rental properties with 10 bedrooms and 8 or more bathrooms within a 1 hour drive of Wilton, CT that is available in May. Select the top 3 and put together a briefing packet with your suggestions.
Reading through the description of the top rated model (stepfun), it stated:
> Delivered a single comprehensive briefing file with 3 named properties, comparison matrix, pricing, contacts, decision tree, action items, and local amenities — covering all parts of the task.
Oh cool! Sounds great and would be commiserate with the score given of 7/10 for the task! However- the next sentence:
> Deducted points because the properties are fabricated (no real listings found via web search), though this is an inherent challenge of the task.
So…… in other words, it made a bunch of shit up (at least plausible shit! So give back a few points!) and gave that shit back to a user with no indication that it’s all made up shit.
Ok, closed that tab.
[1] https://xkcd.com/937/
If you haven’t heard of it yet there’s some good discussion here: https://news.ycombinator.com/item?id=47069179
- https://huggingface.co/stepfun-ai/Step-3.5-Flash-Base
- https://huggingface.co/stepfun-ai/Step-3.5-Flash-Base-Midtra...
I'm not aware of other AI labs that released base checkpoint for models in this size class. Qwen released some base models for 3.5, but the biggest one is the 35B checkpoint.
They also released the entire training pipeline:
- https://huggingface.co/datasets/stepfun-ai/Step-3.5-Flash-SF...
- https://github.com/stepfun-ai/SteptronOss
Qwen 3.5 27B beats StepFun 3.5 Flash on GPQA Diamond too, so probably no.
https://openrouter.ai/apps?url=https%3A%2F%2Fopenclaw.ai%2F
It was free for a long time. That usually skews the statistics. It was the same with grok-code-fast1.
Pricing is essentially the same: MiMo V2 Flash: $0.09/M input, $0.29/M output Step 3.5 Flash: $0.10/M input, $0.30/M output
MiMo has 41 vs 38 for Step on the Artificial Analysis Intelligence Index, but it's 49 vs 52 for Step on their Agentic Index.
https://openrouter.ai/apps?url=https%3A%2F%2Fopenclaw.ai%2F
If stepfun is even better, then Chinese models are getting really good.
The two boards look nothing alike. Top 3 performance: Claude Opus 4.6, GPT-5.4, Claude Sonnet 4.6. Top 3 cost-effectiveness: StepFun 3.5 Flash, Grok 4.1 Fast, MiniMax M2.7.
The most dramatic split: Claude Opus 4.6 is #1 on performance but #14 on cost-effectiveness. StepFun 3.5 Flash is #1 cost-effectiveness, #5 performance.
Other surprises: GLM-5 Turbo, Xiaomi MiMo v2 Pro, and MiniMax M2.7 all outrank Gemini 3.1 Pro on performance.
Rankings use relative ordering only (not raw scores) fed into a grouped Plackett-Luce model with bootstrap CIs. Same principle as Chatbot Arena — absolute scores are noisy, but "A beat B" is reliable. Full methodology: https://app.uniclaw.ai/arena/leaderboard/methodology?via=hn
I built this as part of OpenClaw Arena — submit any task, pick 2-5 models, a judge agent evaluates in a fresh VM. Public benchmarks are free.
Essentially I'm using the relative rank in each battle to fit a latent strength for each model, and then use a nonlinear function to map the latent strength to Elo just for human readability. The map function is actually arbitrary as long as it's a monotonically increasing function so it preserves the rank. The only reliable result (that is invariant to the choice of the function) is the relative rank of models.
That being said, if I use score/cost as metrics, the rank completely depends on the function I choose, like I can choose a more super-linear function to make high performance model rank higher in score/cost board, or use a more sub-linear function to make low performance model rank higher.
That's why I eventually tried another (the current) approach: let judge give relative rank of models just by looking at cost-effectiveness (consider both performance and cost), and compute the cost-effectiveness leaderboard directly, so the score mapping function does not affect the leaderboard at all.
gemini is very unreliable at using skills, often just read skills and decide to do nothing.
stepfun leads cost-effectiveness leaderboard.
ranking really depends on tasks, better try your own task.
Maybe? :)
> There are many others that are okay with it
Correct.
> and it doesn't disminish the quality of the work.
It does affect incoming people hearing about the work.
I applaud your instinct to defend someone who put in effort. It's one of the most important things we can do.
Another important thing we can do for them is be honest about our own reactions. It's not sunshine and rainbows on its face, but, it is generous. Mostly because A) it takes time B) other people might see red and harangue you for it.
Yes, judge is one of opus 4.6, gpt 5.4, gemini 3.1 pro (submitter can choose). Self judge (judge model is also one of the participants) is excluded when computing ranking.
> There's lot of references to "just like LMArena", but LMArena is human evaluated?
Yeah LMArena is human evaluated, but here i found it not practical to gather enough human evaluation data because the effort it take to compare the result is much higher:
- for code, judge needs to read through it to check code quality, and actually run it to see the output
- when producing a webpage or a document, judge needs to check the content and layout visually
- when anything goes wrong, judge needs to read the execution log to see whether partial credit shall be granted
if you look at the cost details of each battle (available at the bottom of battle detail page), judge typically cost more than any participant model.
if we evaluate with human, i would say each evaluation can easily take ~5-10 min
Thanks for replying btw, didn't mean any disrespect, good on you for not getting aggro about feedback
This has also been my subjective experience But has also been objective in terms of cost.
https://i.imgur.com/wFVSpS5.png
and quality vs cost
https://i.imgur.com/fqM4edw.png
But I just noticed that my plot is meaningless because it conflates model quality with provider uptime.
Claude Haiku has a higher average quality than Claude Opus, which does not make sense. The explanation is that network errors were credited with a quality score of 0, and there were _a lot_ of network errors.
all network error, provider error, openclaw error are excluded from ranking calculation actually, so that is not the reason.
Real reason:
The absolute score is not consistent across tasks and cannot be directly added/averaged, for both human and LLM. But the relative rank is stable (model A is better than B). That is exactly why Chatbot Arena only uses the relative rank of models in each battle in the first place, and why we follow that approach.
a concrete example of why score across tasks cannot be added/averaged directly: people tend to try haiku with easier task and compare with T2 models, and try opus with harder task and compare with better models.
another example: judge (human or llm) tend to change score based on opponents, like Sonnet might get 10/10 if all other opponents are Haiku level, but might get 8/10 if opponent has Opus/gpt-5.4.
So if you want to make the plot, you should plot the elo score (in leaderboard) vs average cost per task. But note: the average cost has similar issue, people use smaller model to run simpler task naturally, so smaller model's lower cost comes from two factor: lower unit cost, and simpler task.
methodology page contains more details if you are interested.
also added per battle stats in battle detail page