Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
__main__.py		__main__.py
basic_hf_server.py		basic_hf_server.py
download_model.py		download_model.py
download_model_hf.py		download_model_hf.py
get_model_config_prop.py		get_model_config_prop.py
hf_streamer.py		hf_streamer.py
interface.py		interface.py
requirements-hf.txt		requirements-hf.txt
requirements.txt		requirements.txt
run_worker_container.sh		run_worker_container.sh
settings.py		settings.py
utils.py		utils.py
work.py		work.py
worker_full_main.sh		worker_full_main.sh
worker_hf_main.sh		worker_hf_main.sh
worker_standalone_main.sh		worker_standalone_main.sh

README.md

OpenAssistant Workers

Running the worker

To run the worker, you need to have docker installed, including the docker nvidia runtime if you want to use a GPU. We made a convenience-script you can download and run to start the worker:

curl -sL https://raw.githubusercontent.com/LAION-AI/Open-Assistant/main/inference/worker/run_worker_container.sh | bash

This will download the latest version of the worker and start it.

You can configure the script by setting the following environment variables (they go before the bash)):

IMAGE_TYPE (default: full): Set to llama for llama models
CUDA_VISIBLE_DEVICES (default: 0,1,2,3,4,5,6,7): Set to the GPU you want to use
MODEL_CONFIG_NAME: Set to the name of the model config you want to use, see here, for example OA_SFT_Llama_30Bq.
API_KEY: Set to the API key you want to use for the worker
MAX_PARALLEL_REQUESTS (default: 1): Set to the maximum number of parallel requests the worker should handle. Set this if you know what you're doing.
BACKEND_URL (default: wss://inference.prod2.open-assistant.io): Set to the URL of the backend websocket endpoint you want to connect to
LOGURU_LEVEL (default: INFO): Set to the log level you want to use.
OAHF_HOME (default: $HOME/.oasst_cache/huggingface): Set to the directory where you want to store the HF cache. The user for new files will be root, so be careful in case you set this to your own $HOME/.cache/huggingface (but you can save space like that).

Choosing a model config

Here is how to know whether your GPU supports a model config: Take the number of parameters of the model in billion and multiply it by 2.5, except if the model config ends in "q" (quantized to int8), in which case multiply it by 1.25. That number is the minimum Gigabytes of Memory your GPU needs to have. For example, the OA_SFT_Llama_30B model config has 30 billion parameters, so it needs at least 75 GB of memory, while the OA_SFT_Llama_30Bq model only needs like 40ish GB of memory, due to the quantization.

Choosing `MAX_PARALLEL_REQUESTS`

If you have a lot of spare GPU memory compared to what your model needs, you can increase MAX_PARALLEL_REQUESTS to increase the throughput of the worker. We have an OOM test program in the worker to figure out how far you can go, but it's best to contact us on Discord if you want to do that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

worker

worker

README.md

OpenAssistant Workers

Running the worker

Choosing a model config

Choosing `MAX_PARALLEL_REQUESTS`

Files

worker

Directory actions

More options

Directory actions

More options

Latest commit

History

worker

Folders and files

parent directory

README.md

OpenAssistant Workers

Running the worker

Choosing a model config

Choosing MAX_PARALLEL_REQUESTS

Choosing `MAX_PARALLEL_REQUESTS`