The Ollama team released an awesome parallelisation update for v0.1.33 of the software. Providing two parameters OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS allows one to serve and max out the usage of the GPU’s allocation and VRAM.
OLLAMA_NUM_PARALLEL
: Handle multiple requests simultaneously for a single modelOLLAMA_MAX_LOADED_MODELS
: Load multiple models simultaneously
The settings can be either applied via the Environment variables or via the serve
command. For more info see this guide:
OLLAMA_NUM_PARALLEL=4 OLLAMA_MAX_LOADED_MODELS=4 ollama serve