KETTLE We've been experimenting with LLMs for a while here at The Register, and if you ask our systems editor Tobias Mann and senior reporter Tom Claburn, locally installed coding assistants have actually become so good they could relieve some of the compute load that's pushing AI companies to raise their prices. This week on The Kettle, host Brandon Vigliarolo is joined by Mann and Claburn to discuss their work with locally-hosted LLMs, why we're revisiting the topic at all, how to do local LLMs safely, and whether there's orbital relief coming for the compute crunch. You can listen to The Kettle here, as well as on Spotify and Apple Music, or read the full transcript of this episode below. ® --- Brandon (00:01) Welcome back to another episode of The Register's Kettle podcast. I'm Reg reporter Brandon Vigliarolo and with me this week are systems editor Tobias Mann and senior reporter Tom Claburn to talk about some experiments they've been doing with AI coding assistants, but not just any AI coding assistant mind you, we're talking about local ones that live right on your own machine. Guys, thanks for joining me this week. Tobias Mann (00:24) Good to be here. Thomas Claburn (00:25) Thank you. Brandon (00:29) So before we jump into what learned during these experiments and how effective local large language models actually are as coding assistants. Let's talk a bit about why we're having this discussion in the first place. And I understand that AI coding assistants are about to become way more expensive. And I think, Tom, these were stories that you wrote recently. So can you walk us through a bit what's going on with the current cloud-hosted ones? Thomas Claburn (00:52) Back in November, there was, I think around Opus 4.5, pretty much all the developers started to realize that these models were actually getting pretty good and there's no longer, vibe coding was less of a joke and more like, you know, maybe this will work. And then by the time, you know, around February with the OpenClaw craze, was a lot more demand for sort of coding agents and people would start running these for long periods of time. And it sort of caught Anthropic and others unaware, Google and open AI as well. There was a lot of capacity constraints, a lot more people were trying these things out and they ended up having to find ways to limit demand through session limits and made a lot of people unhappy but they basically just didn't have the compute available to serve capacity. And on top of that, they're serving a lot of these at a price that is loss-leading. They're trying to get people into the business, but these are unprofitable workloads for them. And if you look at something like Mythos, which came out, is their big security model, it was too good for anybody, but large companies with expensive payrolls to run. Brandon (02:08) Right, right. Thomas Claburn (02:10) It's clear that they're looking for ways to increase their revenue because they're investing a lot in the infrastructure to make this run, but they don't yet have the recurring revenue that justifies all this. The ramps look good. They're bringing more people on, but they invested a lot of money in this. Brandon (02:29) OpenAI famously has never actually turned a profit in its history. I don't know about Anthropic ⁓ personally, but I can't imagine they're doing a whole lot better. And so I understand the two specific examples you had was that Anthropic recently yanked Claude Code from Pro plans, but only for some people. Is that correct? Thomas Claburn (02:49) Yeah and they wrote that off as an A/B test. Basically they were doing live A/B testing and people noticed and they were saying, oh, well, no, that's doesn't apply to everyone. We're not going to change or take away from existing Pro users. But clearly there are someone there saying, hey, can we get away with charging this much but providing less service? And that doesn't happen unless you're trying to figure out a way to increase your revenue and reduce the demand on your services. Brandon (02:53) Okay. Totally. Did they backtrack on that at all or is that still, is that A/B test still going on? Thomas Claburn (03:23) I don't think it's still going. Tobias Mann (03:24) They do really do do a lot of A/B testing. I think I have a Claude Code Max subscription that is about, has a 50 % discount on it right now. So I'm a little hesitant to give it up because yeah, it's a hundred bucks a month and I don't use it nearly enough to justify that. But also if I cancel and decide I wanted it back, it'd be 200. Brandon (03:46) Yes, the reason I'm still an Nvidia GeForce Now gaming cloud subscriber, right? Because I was there in the beta test and I've never given that discount up, even if I haven't used it in a while. So I understand. Claude did that, Anthropic did that, and then GitHub also has just straight up jumped to metered billing for AI, think. Correct? Thomas Claburn (04:05) Yeah, and they were taking a huge loss on things because they would give you a flat rate, but then people would use the most expensive models. And of course, those things are billed at different rates and offering a flat rate versus these very inflated Opus 4.7 models, which also take a lot longer to process stuff, even if they're a little bit more efficient, they'll think for longer periods. It's just they're losing money. So everyone has to go to meter billing. And once that happens, it's going to cost people a lot of money. You can look at it now, even on a subscription plan, you'll write up a little widget and you look at the thing and it's, you know, $2 worth of whatever. You think, well, is that worth it? Maybe. And then if it's a more substantial project, you know, people spend, you know, hundreds of thousands of dollars on stuff. And if that's not returning you any revenue, are you still going to do that? So it's going to be interesting to see how this goes. Brandon (04:59) Maybe local LLMs like what we're here to talk about today are kind of the market control, right? I'm sure there are gonna be people who are using these paid services, or were at least, that are gonna say, I don't care what the justification is, whether they're trying to make more money, sure, they might deserve to, or whether they just need to reserve compute resources. Either way, I can't afford to pay for this, so I'm going local. Maybe that'll be the cost control, right? Maybe there'll be some balance that kind of equals out there between, we're losing customers, so we got to make this cheaper versus we need to actually get some return on our investment someday. But I guess either way, right, this discussion is kind of is indicative of why we're talking about using local LLMs. Specifically, I believe, coding assistants, which is what the two of you have been kind of spending some time working with. And I understand you've both had success in various ways with this. Let's talk a bit about I guess the one large story you wrote this week about local LLMs and just kind of more broadly what you guys think of them. Tobias Mann (06:05) Many of us on the team have been playing with local LLMs in some shape or fashion for a couple of years now. And probably within the last year, certainly in the last six months, the models that are small enough that you can run on consumer hardware – and I'm not talking cheap consumer hardware, I'm talking about high-end consumer GPUs, quasi-workstation mini PCs, higher-end MacBooks and Macs – the quality of those models have jumped from being kind of like toys, tech demonstrator,s to being really rather competent. At the same time, we've also seen the rise of these agentic coding frameworks. That's the other part of the equation. These are things like Claude Code . Claude Code is a framework thatconnects to models running in Anthropic's various data centers and cloud providers, and is what's actually orchestrating the generation of the code, the testing of the code, the validation of the code, and allowing developers to kind of use these as actually useful tools rather than just getting a code snippet that may or may not work out of a model as you might have done with ChatGPT four years ago. Right around the time that Microsoft was going to usage-based billing and Anthropic was toying around with kicking the $20 a month Pro users off of the Claude Code entirely to save on compute, Alibaba's Qwen team popped in with a relatively small 27 billion parameter LLM Brandon (08:05) Relatively small. I just think it's funny how quick the parameters have grown over the years. Tt's small. It's only a few billion, you know. Tobias Mann (08:08) Yeah,it's only 27 billion. You know, they popped in and they presented this as being frontier-quality coding out of a pretty small model. And so with all of the harnesses you need to do this and now a model that is supposedly competent, it was just kind of the perfect storm so to speak, to start looking into whether or not these small models could be a replacement for some part of the development flow, for the entire development flow. And it's surprising just how good these small models have gotten. Thomas Claburn (08:53) I was experimenting just recently with the Qwen 3.6 and it's like a, whatever, 35 billion parameters ... but it's like a mixture of experts, so it's actually only like 3 billion, I think, when it's running. And it's an 8-bit quantization. And it's actually, it's working pretty speedily. And I was doing a sort of comparison test to see whether it would do a drag-and-drop metadata removal app on a map, which is like a very particular kind of thing. And initially it kind of suggested some things that were wrong. And I sort of cross-checked that with Claude OpenAI and they both came up with things that were like not really right either and then when I sort of rephrased the question to it more carefully, they basically came up with the same answer with Claude. And what it tells me is to your point about the harnesses, I think a lot of the things that makes local coding work is how good the local harness is. And this was a point that came up yesterday in a piece I was working on about Mozilla when they were talking about all the bugs they fixed with Mythos. One of the people I was talking to, Davi Ottenheimer, argued pretty strongly that you can do Mythos-quality work with a much smaller model as long as you have a good harness. Unfortunately, a lot of the setup of that is very kind of...there's not a standard way to do it. So people will either figure out a way that makes it work or they'll set something up and it just doesn't work. But it's not really clear why that happens. And there's a lot of just sort of arcana about like what skills you have and what the pipeline looks like. People are still figuring it out. But I think that local is where it will go because there's nothing that beats the price of being able to run this for next to nothing excluding your very expensive hardware. Brandon (10:59) And it's improving to the point where it's not something that it would have like a while ago, was like, this doesn't really work. Now we're reaching the point where these local models are viable, right? Well, like you said, you've got to word things carefully. I mean, that feels like anything that was the early days of AI, right? It's like, OK, you got to word it carefully. But eventually, it's going to get better to the point where it's not going to have to be so particular. And you get the same results, hopefully. Tobias Mann (11:24) Yeah, there are two key technologies that I think have really helped these smaller models compete. The first is, as Tom mentioned, is mixture-of-expert models. They only use a subset of the total parameter count for each token generated, which reduces the barrier to entry for hardware. The larger the models get, the more memory bandwidth you need in a consumer or even workstation class of product. It gets absurdly expensive as your memory bandwidth requirements increase. Brandon (12:01) Even for doing some of the basic ones here, I think you wrote in your story that the things you need, you need an M5 Mac with 32 gigabytes of memory. Or 24 gigabytes with multiple GPUs. You need a beefy machine from a consumer perspective to run this stuff. I've got an M1 Mac I wonder if I could run some of these. My Mac's pretty fast. I haven't needed to think about upgrading it in several years. And I looked at them and there's no way. Tobias Mann (12:29) So older Macs can do it. You will run into issues where the prompt processing side of it, that's the, hit enter on your prompt and then you wait. It gets to be problematic. Like you're talking several minutes of waiting for it to start generating a response because older Macs lacked the matmul acceleration necessary for this. So they were brute-forcing a lot of the compute on the GPU. Starting with the M5 Max, they integrated the matmul acceleration into the GPU. It makes a huge, huge difference in terms of performance. That's why we recommended newer Macs. Tom and I, I think we're both testing on older M Series Macs. Yes, it can work and especially with the 35 billion parameter mixture-of-experts model, it's a little bit better, but the quality is generally worse than the dense 27 billion parameter model. Brandon (13:39) I guess I can understand that, right? mean, the more processing you can get done the faster, the better the response is. Tobias Mann (13:48) That's a really important part of this because the other piece, the thing that has changed that the models are that small models can be this competitive is something called test time scaling. We saw this first with DeepSeek and OpenAI o1, which is this, you you hit enter on your prompt and then you see the model thinking and the model can work through different paths and then choose which path it wants to present tto the user at the end. So you can, the idea behind test time scaling is that you can take a smaller model and have it think for longer in order to make up for the lack of parameters in that model. And so we have both of those things coming together in models like Qwen 3.6 27B or Qwen 3.6 35B. Brandon (14:30) Okay, cool. Now, I mean, for those who are interested in setting this up and go, okay, I've got some hardware that's beefy enough and I think I'm willing to give this a shot. This has also gotten a lot easier. I think in the past year, year and a half, two years, it's also gotten multiple factors of simplicity easier to actually set up one of these things and run them locally. Is that accurate to say? It seems like it's gotten a lot simpler to configure this. Thomas Claburn (15:10) People often use Ollama or Unsloth, I'm using OMLX, which uses the Mac MLX. And these are basically the model serving platforms. You can get your model from a variety of places. Hugging Face is a very common one. But a lot of the model platforms like Ollama will fetch the model for you and handle all the installation stuff. The trick is a lot of them have different formats. And if you're using Olamma CCP directly on your computer, which is the C-based model runner, it's going to have a different format than say something else. And they'll all talk to each other, but it tends to lock you into one particular way of doing it and you get used to it. There's not really a right way of doing it right now and that's part of the problem is everyone's kind of figuring out what's the right way to do this? Which one do I want to use, how do I configure it? Even just looking at the model and trying to decipher the quantization and the features it has, isn't always clear to everybody. That I think hopefully will become more standardized as you get sort of more common knowledge about, yeah, this one works really well for me. Throughout the forums every week there's someone saying, yeah, this model is great for XYZ and we'll try that out. I mean, that's really the experience you have to have is figure out what you're gonna use it for and try it and see what other people are doing. And you can probably arrive at something that's useful locally. Brandon (16:45) Useful locally, I guess, also implies the need to do some security legwork. Right. I know when we first started writing about local LLMs, things like OpenClaw right. I mean, the the going headline for any of those right was, this this local LLM has caused chaos for somebody again. Right. Is that I think, Tom, you wrote a couple of stories recently about running local LLMs safely. Has it gotten to the point where it's easier to do that safely or is that still going to be a big concern for anyone doing this? Thomas Claburn (17:17) It is easier to do. The setup can be pretty complicated for these anyway. I just spent an evening building a sandbox for the Py agent because Py is sort of a very permissive agent that comes out of the box in YOLO mode. It can sort of do anything. It has very limited command set, but it has very few limitations. And that's by design. It's sort of like in the same way Flask is a very open Python framework. It's not this sort of "batteries included" thing, know, compared to Django. Something like Claude will come with a bunch of sort of predefined ways to do things. Claude has its own sort of sandboxing system and you can add a lot of safety through things like hooks. You know, there people who will write hooks that will intercept dangerous commands like, you know, rm. So there's a lot of ways to do it. Docker has a sandboxing system. That's what I tried to build on is basically figure out a way to do a Docker sandbox that runs Py and it protects the local file system but leaves the internet space open and those are kind of the security decisions you have to make because if this thing is totally enclosed in a VM and there's no way out, it can't really do anything! I mean you can do anything that you stick in the VM, but if you wanted to work on a project on your own system, you have to break that boundary somehow to get the file across and give it access, and then if you need to update something you have to open it up to a code repo somewhere. So there are a lot of security decisions you have to make and for me biggest one was just like making sure it doesn't mess with my local files and that gives me little bit more confidence to run a model that I don't really know how well it will perform. Having my Claude for a long time I'm a little bit more confident that it behaved behaves well, but the risk is there for all of them. Tobias Mann (19:10) So we looked at, I think, three different agent harnesses in the piece. Claude Code, which you would think is for work with Anthropic's stuff, but it works just fine with local models. It's two additional commands and you're up and running. It's very heavy. The system prompt is enormous. And so if you have lesser hardware, you might struggle a little bit with it. We also looked at Cline, which is a VS code extension that is very easy to install, pretty fast to configure. And then we looked at PyCodingAgent, which Tom had suggested that we discuss as well. Out of the box, Cloud Code and Cline both default to user-in-the loop, deny-by-default kind of situations where it'll ask for permission before performing any commands or writing any code. It'll say, "I want to write this code. What do you think? Do you want to proceed?" But they can be made to go fully automatic and just say, you know, I'm not worried, YOLO, let's go. And so thatmodel is a different security model than what we saw with PyCodingAgent, which to Tom's point is just pure YOLO mode out of the box. And so the security models differ wildly depending on which agent harness you're using or which sandbox that you're trying to play in, so to speak. There are several kind of agent sandboxes that have emerged that default to blocking all outbound network activity, which really limits the capabilities of the agent and forces you to be deliberate about what you do and don't want it talking to. Others are just, you know, they're focused on isolating doing kind of limiting the blast radius if the agent decides to go AWOL and do rm, rf, you know, the root file structure and just take the whole thing out. That's fine if it's in the container and it destroys the container because you run two commands and you're back up and running again. It's less okay if you're running bare metal. Brandon (21:32) So security considerations, seems like the core is basically just know what you're working with, right? Like don't deploy an agent that you don't at least have some idea how the security apparatus built into it functions by default, right? And just what you can do with it. But I guess whether we think about security or not, a lot of the conversation around the need to run LLMs locally seems to boil down to compute resources and the cost to maintain them, the cost to operate them, the cost to serve them. And I guess, Anthropic, speaking of Claude, right? Anthropic's big longshot this week, I guess, was a plan or a partnership they signed with SpaceX to occupy some space on the fleet of orbital data centers that Elon Musk seems intent on building. Tom, so is that gonna happen? Thomas Claburn (22:27) [Laughs.] I don't know. I I would think that they would put them in the ocean before they would put them in space. And, you know, they talk about data centers, but I think that it's I'll wait and see if they actually build them on land first, because there's a lot of terrestrial construction that is planned and hasn't happened. And we'll see. Tobias Mann (22:49) Yeah, the whole idea is that in space, you put the satellites in a sun-synchronous orbit, then they have basically unlimited power. The problem is that you have to get them there in the first place, which you need a launch vehicle for, which, last I checked, Starship still does not work. Brandon (23:09) I was gonna say this seems awfully familiar to me if we just change orbital data centers to Mars colonization, right? Like same problem here. We gotta have a vehicle that can get us there yet and we do not. Thomas Claburn (23:21) The Hyperloop will be the way they'll take it out there. Brandon (23:24) Yeah, right. Tobias Mann (23:28) And once we get the orbital cluster in place, Elon wants to put a mass driver on the moon so that we can put even more of these things into deep space for reasons I guess. Brandon (23:42) It just seems like there's a lot of, I don't know, it feels like the idea that Anthropic is gonna get on board with these SpaceX data centers in orbit. It feels to me a lot like when a data center company is like, hey, we just signed a huge deal with this company that makes nuclear reactors that don't exist yet. And it's kind of like, cool guys, well, let us know when we've actually got a real solution for the compute crisis that you guys are dealing with right now that you caused. Thomas Claburn (24:05) I kind of interpret the whole space thing as like, we made a deal with SpaceX and we have to say something nice about their future plans. Brandon (24:18) Right. Yeah. Tobias Mann (24:19) This really boils down to Anthropic is getting access to Colossus One, this massive, what, 150-megawatt AI factory, purpose-built for GPU training and inference. And so I think really what they need is compute and they cannot get enough of it. The inflection point has hit and we're seeing adoption, which means we need compute for inference and we need more compute for inference than we've had in the past. And so I think really what this is, is we'll say whatever you want. We will say that we will ride along on your Starship into the heavens and live in your space data centers. Just give us access to Colossus, please, because we're dying for compute. Brandon (25:15) We need it now and it'd be great if it happened someday in orbit, right? So in the meantime, I guess, basically, have we reached the point where localized AI, local LLM coding agents, right? Are we at the point now where they might be able to ease some of the compute stress that these companies are feeling or is this still early days something that's going to have to be developed, not worth it for the average developer? Thomas Claburn (25:41) I think they're going to be useful for sort of prototyping stuff. One of the things I've done is, I'll run it through the local one and then I'll have Claude check it. You often get a lot of, you know, code fixes that way. So it is a way to offload some less important jobs. I mean, you don't need a frontier model for everything. Brandon (25:49) Right. Right. I think that was kind an argument you made to bias about, you know, using a massive data center to build an HTML page is not a good use of resources. Tobias Mann (26:09) Right, Using the biggest, baddest model to write some HTML is probably not the most efficient thing to do, and it's certainly well within the capabilities of these small models. The other thing I'll say is, if you look at how GPT-5 works, if you go to ChatGPT, not Codex, when you first enter a prompt, it gets routed to one of three models based on the complexity of that model. Conceivably, we could do the same thing with local models, where you sign into Codex, it does a check. If you have sufficient hardware, it will run some portion of that query through the local model, do a yes/no check on the big model in the cloud, and decide whether or not, at that point, whether or not it needs to be regenerated via the API, or it can move forward with what's generated locally. So there's definitely a path forward for local playing a bigger role in reducing the amount of compute required to scale it.... Brandon (27:25) I guess the only key caveat there would be that if you're gonna install local LLMs on people's machines to split your compute load, you should probably let them know first, right Google? Tobias Mann (27:38) You probably should. Brandon (27:43) Probably. Or you can just do it and ask for forgiveness later on. Who's gonna uninstall Chrome? You? Ha ha ha. Tobias Mann (27:49) Yeah, the other thing I would point out is that, while a 24- or 32-gigabyte GPU is very expensive, we're talking anywhere from $1,00 to, you know, $4,000 plus for GPUs with that memory, those GPUs could serve that model to an entire team, realistically. And so if you were thinking about this from an enterprise adoption standpoint, you could buy one machine that sits in the corner, basically silent, that could serve an entire dev team with this smaller model. Or you could spend a whole lot more, but still something that fits on a desktop in the corner that runs a big model, like a trillion-parameter model, locally on that system and for that team. We're not just limited to these small models. You and I might be, but from an enterprise standpoint, a $70,000 DGX Station, for example, is capable of running very large models, trillion-parameter scale models. And that's less than the cost of one developer for a year. Brandon (29:06) Yeah, so maybe that's the case now, right? Maybe we've just reached a point where there's enough value in these local models as a sort of prototyping testbed, as a entry level dev replacement to do the first work before someone more experienced or with more parameters reviews it. Yeah, so it might be there. That's interesting. I will be interested to see how the evolution of AI models and like you said, the kind of linking between cloud-based versus local. I'll be interested to see how that develops. It could be the next phase of the AI industry's evolution. We'll see. We'll see. Something's got to give with compute, right? No matter what it is, we are going to be sure we're here on The Register to write about it and here at the kettle to talk about it. And until then, we will see you next week on the next episode.
source
https://www.theregister.com/ai-and-ml/2026/05/11/yes-local-llms-are-ready-to-ease-the-compute-strain/5237451