Concurrency for Ollama Server

The Ollama team released an awesome parallelisation update for v0.1.33 of the software. Providing two parameters OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS allows one to serve and max out the usage of the GPU’s allocation and VRAM.

  • OLLAMA_NUM_PARALLEL: Handle multiple requests simultaneously for a single model
  • OLLAMA_MAX_LOADED_MODELS: Load multiple models simultaneously

The settings can be either applied via the Environment variables or via the serve command. For more info see this guide:

OLLAMA_NUM_PARALLEL=4 OLLAMA_MAX_LOADED_MODELS=4 ollama serve