Chasing AI: running ollama on my old AMD RX470 GPU

AI technology still seems rather new, yet much of the groundwork we’re working with today has been around for many years – and has many contributors who have built on top of each other.  We read articles about the immense cost of training new models; we read the news about billions being invested in AI infrastructure; releases seem to come out even more often than Prime Day; we live in fear: can we keep up with AI?  Hence this series of posts – Chasing AI – to try and bring this back down to earth.

I was hesitant to start using ChatGPT when it came out, but a couple of years in I’m to the point where I use it reflexively, even more often than I run a DDG/Google search. Yes, I’m one of those people who uses DuckDuckGo. If you’re weird about privacy like I am, AI was not invented to siphon off your data.  One of the reasons to explore running LLMs locally is to have a safe place to try LLMs without fear that someone else is making a buck off my data.

So what’s actually happening on the server-side of all those magical LLM API calls? I think OpenAI got it annoyingly right when they used annoying bits of haptic feedback – vibration on the phone app – to emphasize the feeling of cost for every word generated; these models are not just large ($millions) to create, but the actual hardware requirements to answer your questions from them are too. Not insanely large like “big data” which is typically petabytes, but > 100G, which has to be loaded into expensive GPU VRAM to be able to run and give you an answer in a reasonable amount of time.  This means that the datacenter computers answering your question need huge racks full of machines running large ~100G GPUs which is why Nvidia is doing so well.  Or do they use clusters of smaller GPUs?  Someone tell me, now I’m curious.  But let’s get down to earth.

What is all this llama business anyway?

  • llama is a series of large language models produced and released by Meta
  • llama.cpp is an open source application which runs inference (queries) on a LLM on disk, originally built to run Meta’s llama models.
  • ollama is a nice wrapper around llama.cpp which simplifies the process of loading a variety of different models

I decided to jump into running ollama on some machines here at home to see what I can do – and learn.

Trying ollama on a CPU

I didn’t really understand how any of this worked until this summer when I decided to try running ollama on a headless x86-64 machine that I use to self host apps like Home Assistant which has an old, mediocre, 15-year old CPU and no GPU available.  I was able to run tinyllama, which fits in about 1G of RAM, and generated a few words of text per second.  It was rather under-whelming – and incredibly taxing on a machine that was busy doing other stuff.  But…it worked, so that’s something, right?

Wait, don’t I have a GPU in my basement?

Quite a few years back, before GPUs became popular for machine learning, they were popular for something else: mining crypto-currencies!  During the GPU shortage caused by this craze, I snatched up a pair of old AMD RX470 4GB GPUs and mined…well…just enough to pay for the hardware and the electricity.  It’s been sitting dusty in my basement for >5 years.

These GPUs are the better part of 10 years old and not officially supported by the ollama project, however a few side projects exist to make it work.  I ran with this one – it worked! https://github.com/robertrosenbusch/gfx803_rocm/

I was so beside myself that I recorded how it compares to CPU-based model performance in another screen cast:

Much like the CPU-based demo above, it’s nothing to compare it against what you get out of modern, cloud-based models – but it’s interesting, educational, and might be useful.  The difference between the CPU mode and GPU mode is huge – and this is on rather antiquated hardware.

PS: Naturally, this old machine burns 35W so I power it off when I’m not using it, to save my precious solar energy for important things like making breakfast for my kids :)

Highlights – what I’ve learned

The most obvious thing – I knew that LLMs ran largely on GPUs, but I didn’t appreciate the need for huge amounts of VRAM to load the model.

Through all my reading I’ve found that apparently Apple made a stroke of brilliance (or was it luck?) to share CPU and GPU memory on Apple silicon devices.  This means that a moderately spec’d MacBook can run some rather hefty LLMs entirely on GPU!  I haven’t owned a Mac for a few years now – but I see a purchase coming, once I find a use case more useful than drawing large birds on a bicycle.

There are many, many models – more than just the big name providers. New models are developed regularly, too. What I didn’t expect is that many models are published on exchange sites so that you can download and try them – most notably, HuggingFace.co (great name). The leading models from Anthropic and OpenAI are not openly published, but many others are.

OpenRouter is cool. It’s an abstraction in front of all the major LLM API providers; it also has tags for who is free to use which is useful, and privacy controls to even make sure you don’t unwittingly use models which have permission to learn from your inputs and outputs.  It’s great when paired with…

llm which is a handy CLI utility (written by the ever-astute Simon Willison) for running LLMs at the CLI.  Sometimes I just want a quick answer, and a terminal is cheaper than a browser tab!

What’s next?

I’ve been hacking  endlessly with Claude Code…I’m curious to see if I can get Ollama Code working…we’ll see!

Until next time – may your models be grounded, and your prompts be precise!

📧 Chasing AI: Get notified of new posts

Enter your email to be notified when I publish something new:


I’ll only email you about new blog posts. No spam.

This entry was posted in Uncategorized and tagged . Bookmark the permalink.