• 1 Post
  • 51 Comments
Joined 6 months ago
cake
Cake day: August 27th, 2025

help-circle






  • Yeah me too. Opus 4.5 is awesome but my god…om nom nom go my daily / weekly quotas. Probably I should not yeet the entire repo at it lol.

    4.6 is meant to be 2x worse for not much better output.

    Viewed against that, Codex 5.3 @ medium is actual daylight robbery of OAI.

    I was just looking at benchmarks and even smaller 8-10B models are now around 65-70% Sonnet level (Qwen 3-8, Nemotron 9B, Critique) and 110-140% Haiku.

    If I had the VRAM, I’d switch to local Qwen3 next (which almost 90% of Opus 4.5 on SWE Bench) and just git gud. Probably I’ll just look at smaller models, API calls and the git gud part.

    RTX 3060 (probably what you need for decent Qwen 3 next) is $1500 here :(

    For that much $$$ I can probably get 5 years of surgical API calls via OR + actual skills.

    PS: how are you using batch processing? How did you set it up?


  • Ah but subscription to OpenAI ChatGPT ($20/USD) gives you access to ChatGPT 5.3 codex bundled in, with some really generous usage allowances (well, compared to Claude)

    I haven’t looked recently, but API calls to Codex 5.2 via OR were silly expensive per million tokens; I can’t imagine 5.3 is any cheaper.

    To be fair to your point: I doubt many people sign up specifically for this (let’s say 20% if were making up numbers). Its still a good deal though. I can chew thru 30 million tokens in pretty much a day when I’m going hammer at tongs at stuff.

    Frankly, I don’t understand how OAI remain solvent. They’re eating a lot of shit in their “undercut the competition to take over the market” phase. But hey, if they’re giving it away, sure, I’ll take it.


  • Let’s be fair - not all of the masses are so ignorant.

    If you consider API vs subscription, you probably get more bang for buck out of paying $20/USD than just paying per million tokens via API calls. At least for OAI models. It’s legitimately a good deal for heavy users.

    For simipler stuff and/or if you have decent hardware? For sure - go local. Qwen3-4B 2507 instruct matches or surpasses ChatGPT 4.1 nano and mini on almost all benchmarks…and you can run it on your phone. I know because it (or the ablit version) is my go to at home. Its stupidily strong for a 4B.

    But if you need SOTA (or near to) and are rocking typical consumer grade hardware, then $20/month for basically unlimited tokens is the reason for subscription.






  • You’re over-egging it a bit. A well written SOAP note, HPI etc should distill to a handful of possibilities, that’s true. That’s the point of them.

    The fact that the llm can interpret those notes 95% as well as a medical trained individual (per the article) to come up with the correct diagnosis is being a little under sold.

    That’s not nothing. Actually, that’s a big fucking deal ™ if you think thru the edge case applications. And remember, these are just general LLMs - and pretty old ones at that (ChatGPT 4 era). Were not even talking medical domain specific LLM.

    Yeah; I think there’s more here to think on.


  • Agreed!

    I think (hope) the next application of this tech is in point of care testing. I recall a story of a someone in Sudan(?) using a small, locally hosted LLM with vision abilities to scan hand written doctor notes and come up with an immunisation plan for their village. I might be misremembering the story, but the anecdote was along those lines.

    We already have PoC testing for things like Ultrasound… but some interpretation workflows rely on strong net connection iirc. It’d be awesome to have something on device that can be used for imaging interpretation where there is no other infra.

    Maybe someone can finally win that $10 million dollar X prize for the first viable tricorder (pretty sure that one wrapped up years ago? Too lazy to look)…one that isn’t smoke and mirror like Theranos.


  • Funny how people over look that bit enroute to dunk on LLMs.

    If anything, that 90% result supports the idea that Garbage In = Garbage Out. I imagine a properly used domain-tuned medical model with structured inputs could exceed those results in some diagnostic settings (task-dependent).

    Iirc, the 2024 Nobel prize in chemistry was won on the basis of using ML expert system to investigate protein folding. ML =! LLM but at the same time, let’s not throw the baby out with the bathwater.

    EDIT: for the lulz, I posted my above comment in my locally hosted bespoke llm. It politely called my bullshit out (Alpha fold is technically not an expert system, I didn’t cite my source for Med-Palm 2 claims). As shown, not all llm are tuned sycophantic yes man; there might be a sliver of hope yet lol.


    The statement contains a mix of plausible claims and minor logical inconsistencies. The core idea—that expert systems using ML can outperform simple LLMs in specific tasks—is reasonable.

    However, the claim that “a properly used expert system LLM (Med-PALM-2) is even better than 90% accurate in differentials” is unsupported by the provided context and overreaches from the general “Garbage In = Garbage Out” principle.

    Additionally, the assertion that the 2024 Nobel Prize in Chemistry was won “on the basis of using ML expert system to investigate protein folding” is factually incorrect; the prize was awarded for AI-assisted protein folding prediction, not an ML expert system per se.

    Confidence: medium | Source: Mixed



  • I don’t think it’s their information per se, so much as how the LLMs tend to use said information.

    LLMs are generally tuned to be expressive and lively. A part of that involves “random” (ie: roll the dice) output based on inputs + training data. (I’m skipping over technical details here for sake of simplicity)

    That’s what the masses have shown they want - friendly, confident sounding, chat bots, that can give plausible answers that are mostly right, sometimes.

    But for certain domains (like med) that shit gets people killed.

    TL;DR: they’re made for chitchat engagement, not high fidelity expert systems. You have to pay $$$$ to access those.



  • Agree.

    I’m sorta kicking myself I didn’t sign up for Google’s MedPALM-2 when I had the chance. Last I checked, it passed the USMLE exam with 96% and 88% on radio interpretation / report writing.

    I remember looking at the sign up and seeing it requested credit card details to verify identity (I didn’t have a google account at the time). I bounced… but gotta admit, it might have been fun to play with.

    Oh well; one door closes another opens.

    In any case, I believe this article confirms GIGO. The LLMs appear to have been vastly more accurate when fed correct inputs by clinicians versus what lay people fed it.