Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Respectfully, from my experience and a few billions of tokens consumed, some opensource models really are strong and useful. Specifically StepFun-3.5-flash https://github.com/stepfun-ai/Step-3.5-Flash

I'm working on a pretty complex Rust codebase right now, with hundreds of integration tests and nontrivial concurrency, and stepfun powers through.

I have no relation to stepfun, and I'm saying this purely from deep respect to the team that managed to pack this performance in 196B/11B active envelope.

 help



What coding agent do you use with StepFun-3.5-flash? I just tried it from siliconflow's api with opencode. The toolcalling is broken: AI_InvalidResponseDataError: Expected 'function.name' to be a string.

I use pi, but I'm almost done writing a better alternative that doesn't have pi's stability issues. 80K Rust SLOC and a few hundred tests btw.

Any place we can look for you to release this?

Yeah, my github is in the profile. Soon (tm). Feel free to follow.

Are you using stepfun mostly because it's free, or is it better than other models at some things?

I think we are at this point where the hard ceiling of a strong model is pretty hard to delineate reliably (at least in coding, in research work it's clearer ofc) - and in a good sense, meaning with suitable task decomposition or a test harness or a good abstraction you can make the model do what you thought it could not. StepFun is a strong model and I really enjoyed studying and comparing it to others by coding pretty complex projects semi-autonomously (will do a write up on this soon tm).

Even purely pragmatically, StepFun covers 95% of my research+SWE coding needs, and for the remaining 5% I can access the large frontier models. I was surprised StepFun is even decent at planning and research, so it is possible to get by with it and nothing else (1), but ofc for minmaxing the best frontier model is still the best planner (although the latest deepseek is surprisingly good too).

Finally we are at a point where there is a clear separation of labor between frontier & strong+fast models, but tbh shoehorning StepFun into this "strong+fast" category feels limiting, I think it has greater potential.


I pay for copilot to access anthropic, google and openai models.

Claude code always give me rate limits. Claude through copilot is a bit slow, but copilot has constant network request issues or something, but at least I don't get rate limited as often.

At least local models always work, is faster (50+ tps with qwen3.5 35b a4b on a 4090) and most importantly never hit a rate limit.


> Claude code always give me rate limits

> 50+ tps with qwen3.5 35b a4b on a 4090

But qwen3.5 35b is worse than even Claude Haiku 4.5. You could switch your Claude Code to use Haiku and never hit rate limits. Also gets similar 50tps.


I haven't tried 4.5 haiku much, but i was not impressed with previous haiku versions.

My goto proprietary model in copilot for general tasks is gemini 3 flash which is priced the same as haiku.

The qwen model is in my experience close to gemini 3 flash, but gemini flash is still better.

Maybe it's somewhat related to what we're using them for. In my case I'm mostly using llms to code Lua. One case is a typed luajit language and the other is a 3d luajit framework written entirely in luajit.

I forgot exactly how many tps i get with qwen, but with glm 4.7 flash which is really good (to be local) gets me 120tps and a 120k context.

Don't get me wrong, proprietary models are superior, but local models are getting really good AND useful for a lot of real work.


I also started playing with 3.5 Flash and was impressed.

It’s 2× faster than its competitors. For tasks where “one-shotting” is unrealistic, a fast iteration loop makes a measurable difference in productivity.


TDD is really the delineation between being successful or not when using [local] LLMs.

> some opensource models really are strong and useful

To be clear I never said they weren’t strong or useful. I use them for some small tasks too.

I said they’re not equivalent to SOTA models from 6 months ago, which is what is always claimed.

Then it turns into a Motte and Bailey game where that argument is replaced with the simpler argument that they’re useful for open weights models. I’m not disagreeing with that part. I’m disagree with the first assertion that they’re equivalent to Sonnet 4.5


They are not equivalent 1:1, esp. in knowledge coverage (given OOM param size difference) and in taste (Sonnet wins, but for taste one can also use Kimi K2.5), but in my hardcore use (high-performance realtime simulations of various kinds) I would prefer StepFun-3.5-Flash to Sonnet 4 strongly and to 4.5 often enough without a decisive advantage in using exclusively Sonnet 4.5. For truly hard tasks or specifications I would turn to 5.2 or 5.3-codex of course - but one KPI for quality of my work as a lead engineer is to ensure that truly hard tasks are known, bounded and planned-for in advance.

Maybe my detailed, requirement-based/spec-based prompting style makes the difference between anthropic's and OSS models smaller and people just like how good Anthropic's models are at reading the programmer's intent from short concise prompts.

Frankly, I think the 1:1 equivalent is an impossible standard given the set of priorities and decisions frontier labs make when setting up their pre-, mid- and post-training pipelines, and benchmark-wise it is achievable for a smaller OSS model to align with Sonnet 4.5 even on hard benchmarks.

Given the relatively underwhelming Sonnet 4.5 benchmarks [1], I think StepFun might have an edge over it esp. in Math/STEM [2] - even an old deepseek-3.2 (not speciale!) had a similar aggregate score. With 4.6 Anthropic ofc vastly improved their benchmark game, and it now truly looks like a frontier model.

1. https://artificialanalysis.ai/models/claude-4-5-sonnet-think... 2. https://matharena.ai/models/stepfun_3_5_flash


What are you running that model on?

I just use openrouter, it's free for now. But I would pay 30-100$ to use it 24/7.

Ah, I thought you meant you were running it locally.

Have you tried Minimax M2.5? How did it compare?

A 3 bit quant will run on a 128gb MacBook Pro, it works pretty well.

A 3 bit quant is quite a lot weaker than the OpenRouter version the OP is using.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: