Improving LLM Resource Usage
Large Language Models (LLMs) can be run with less resources than we currently are doing. Smaller models are cheaper and faster to run, while simultaneously taking less time to train. However, they are not as proficient in general purpose tasks as the bigger LLMs. This stems from it's architecture and how it is trained, fine-tuned, and used in production which I cover here in LLM Architecture and Costs. Let's go over some ways to reduce the resource usage of your own LLM usage.
Use Smaller Models
The obvious answer to this is to use smaller models. Many models can be hosted on your local computer assuming you have enough RAM to load the weights and run the encoders. The problem with this is that performance is bottlenecked by your hardware, increasing latency if you are not running it on a powerful enough computer. Also, smaller models do worse in general purpose tasks. Out of the box, you are going to have a worse experience compared to running inference on a relatively smaller (but still very large) model like Claude Haiku.
Smaller model performance can be improved for your specific task through fine-tuning if you have a strong access pattern and very confined space. This comes at cost in time, power, etc for doing this training step. And you take this on as an ongoing need before you can use any other model. With the pace of models coming out both in the open source space and in the large provider space, this can be a drag on your business. Still worth noting, especially if you have high scale in terms of inference amounts! Model Distillation, below covers this in some more detail.
Context Management - Limit Prompt Size
The main control you as a customer have when using any given language model is how much context you give the model. LLMs have a token limit in the context window. The prompt can be only as long as the context window. This is usually a sufficiently large amount for many tasks, but the more tokens you use in the prompt, the more you are billed for and the longer your inference takes. Also, using very large contexts causes the "lost in the middle" problem where important pieces of the prompt can be missed in the middle, causing worse performance.
Limit your context to reduce your token usage and increase your performance.
Prompt Caching
This is a powerful technique to reduce the inference requirements when processing prompts by caching parts of the prompt and reusing the pre-computed output rather than continually re-processing it. The major providers tend to offer this in their APIs for you to apply to your systems:
This is to me the biggest thing that makes Claude Code such a useful tool. According to my analytics as visualized by the analytics tool I use, npx claude-code-templates@latest --analytics, I have ~98.5% cache hit rate on the cached prompts with Claude Code (500k input tokens vs 370 million cache tokens). Claude code is utilizing the prompt caching functionality through how it structures the prompts to a great effect.
If you have repeated patterns of prompts, define these (in code, skills, etc) and make use of prompt caching. It can drastically reduce your token cost and latency. Check out the docs for your provider and make sure you are taking advantage of what is offered and if there are ways to better structure prompts to hit the cache better.
Quantization
Quantization is the act of reducing the precision in the model weights either through truncating the floating point weights or through more complex processes to modify those weights to reduce the precision. Sometimes this is necessary even if the model is trained on a machine that can handle higher precision than the inference machine. This is very similar to lossy compression algorithms where the data is mostly preserved, but not guaranteed in total. By reducing the precision on the smaller bits, you can drastically reduce the cost in inference while mostly preserving the output and performance because you are doing many fewer arithmetic operations on the chip, but only where the precision matters the least. Consider quantization if running your own model for inference.
Model Distillation
Model Distillation is where you take a larger model (sometimes called "teacher" model) and use it to generate labelled training data for a smaller "student" model to be trained on. In this blog from google research, they use "chain of thought" techniques to create a dataset to train a smaller model using rationales as labels.
Chain of thought is where the model first describes it's "reasoning" (usually within
<thinking></thinking>tags) in how it comes to a conclusion, a process which may change the conclusion, itself. This is something that can be done with any model as a prompting technique, but some reasoning models come with this as a fine-tuned feature. Tags are used to make it easy to parse for display purposes in an LLM client, or in this case for creating labeled training data.
Distillation can be a great way to take a larger model and create a more useful and much more computationally efficient smaller model for inference. This process is itself computationally expensive requiring model inference to generate the set followed by smaller parameter model training, but for high-use applications with defined patterns, the tradeoffs might be worth it. Still much less intensive than training new LLMs with higher parameters and potentially less than fine-tuning such a high parameter model while improving results in a specialized domain. Before using this technique, check with your model provider on what permissions you may need (Anthropic requires written approval, for instance).
Conclusion
LLMs are inherently resource intensive systems due to their architecture. They work due to the massive scale of information used in training, and carry a sustained cost of inference which goes up the "bigger" the model is. Companies need to spend ever larger amounts of money for more chips, power, water access for cooling, access to datasets and storing those processed datasets. Training alone cannot be relied on, making grounding techniques essential. Prompt caching, quantization, and distillation are some techniques that are widely applied in LLM systems, but still cannot fully mitigate the cost of running and building ever larger systems.