To run the worker, you need to have docker installed, including the docker nvidia runtime if you want to use a GPU. We made a convenience-script you can download and run to start the worker:
curl -sL https://raw.githubusercontent.com/LAION-AI/Open-Assistant/main/inference/worker/run_worker_container.sh | bash
This will download the latest version of the worker and start it.
You can configure the script by setting the following environment variables
(they go before the bash
)):
IMAGE_TYPE
(default:full
): Set tollama
for llama modelsCUDA_VISIBLE_DEVICES
(default:0,1,2,3,4,5,6,7
): Set to the GPU you want to useMODEL_CONFIG_NAME
: Set to the name of the model config you want to use, see here, for exampleOA_SFT_Llama_30Bq
.API_KEY
: Set to the API key you want to use for the workerMAX_PARALLEL_REQUESTS
(default:1
): Set to the maximum number of parallel requests the worker should handle. Set this if you know what you're doing.BACKEND_URL
(default:wss://inference.prod2.open-assistant.io
): Set to the URL of the backend websocket endpoint you want to connect toLOGURU_LEVEL
(default:INFO
): Set to the log level you want to use.OAHF_HOME
(default:$HOME/.oasst_cache/huggingface
): Set to the directory where you want to store the HF cache. The user for new files will be root, so be careful in case you set this to your own$HOME/.cache/huggingface
(but you can save space like that).
Here is how to know whether your GPU supports a model config: Take the number of
parameters of the model in billion and multiply it by 2.5, except if the model
config ends in "q" (quantized to int8
), in which case multiply it by 1.25.
That number is the minimum Gigabytes of Memory your GPU needs to have. For
example, the OA_SFT_Llama_30B
model config has 30 billion parameters, so it
needs at least 75 GB of memory, while the OA_SFT_Llama_30Bq
model only needs
like 40ish GB of memory, due to the quantization.
If you have a lot of spare GPU memory compared to what your model needs, you can
increase MAX_PARALLEL_REQUESTS
to increase the throughput of the worker. We
have an OOM test program in the worker to figure out how far you can go, but
it's best to contact us on Discord if you want to do that.