Installation with CPU
Aphrodite implements CPU support using multiple different backends. The most performant is OpenVINO, but we also support inference via IPEX (Intel Extensions for PyTorch). The supported architectures are AVX2, AVX512, and PPC64LE.
The only CPU backend that supports quantization is OpenVINO, which can load FP16 Hugging Face models into INT8.
OpenVINO Backend
Requirements
- Linux
 - Python 3.9 - 3.11
 - Instruction set architecture: at least AVX2
 
Dockerfile
docker build -f Dockerfile.openvino -t aphrodite-openvino .docker run -it --rm aphrodite-openvinoBuilding from Source
First, install Python. On Ubuntu 22.04 machines, you can run:
sudo apt-get updatesudo apt-get install python3Then, install the requirements for Aphrodite:
python3 -m pip install -U pippython3 -m pip install -r requirements/build.txt --extra-index-url https://download.pytorch.org/whl/cpuFinally, install Aphrodite:
PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu" APHRODITE_TARGET_DEVICE=openvino python3 -m pip install -e .Performance tips for OpenVINO
The OpenVINO backend uses the following environment variables:
APHRODITE_OPENVINO_KVCACHE_SPACE: To specify the KV cache size. e.g., 40 would mean 40GB of KV cache space. Larger numbers allows for more parallel requests. This would occupy space in RAM, so be careful.APHRODITE_OPENVINO_CPU_KV_CACHE_PRECISION=u8: Set the KV cache precision. By default, FP16/BF16 is used. This will set it to INT8.APHRODITE_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON: To enable INT8 weights compression during model loading. By default, this is turned on. You can also export your model with different compression techniques usingoptimum-cliand pass the exported folder as the model ID to aphrodite.
To enable further performance improvements, use --enable-chunked-prefill. The recommend batch size for chunked prefill in OpenVINO is --max-num-batched-tokens 256.
Limitations
- LoRA is not supported.
 - Only decoder-only LLMs are supported. Vision and Embedding models are not.
 - Tensor and Pipeline Parallelism is not supported.
 
CPU Backend
We also support basic CPU inference for x86_64 platforms. The only supported data types are FP32 and BF16.
Requirements
- Linux (or WSL on Windows)
 - Compiler: gcc/g++ >= 12.3.0
 - Instruct set: AVX2 or AVX512 (recommended)
 
Dockerfile
docker build -f Dockerfile.cpu -t aphrodite-cpu --shm-size=4g .docker run -it \  --rm \  --network=host \  --ipc=host \  -p 2242:2242 \  #--cpuset-cpus=<cpu-id-list, optional> \  #--cpuset-mems=<memory-node, optional> \  aphrodite-cpuBuilding from Source
First, install the compiler to avoid potential issues.
sudo apt-get updatesudo apt-get install -y gcc-12 g++-12 libnuma-devsudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12Then, install the requirements:
pip install -U pippip install wheel packaging ninja "setuptools>=49.4.0" numpypip install -v -r requirements/cpu.txt --extra-index-url http://download.pytorch.org/whl/cpuFinally, install Aphrodite:
APHRODITE_TARGET_DEVICE=cpu python setup.py installIntel Extension for PyTorch
You can massively boost the performance of the CPU backend by installing IPEX. Installation instructions are provided in the Dockerfile.cpu.
Performance tips
- Aphrodite CPU backend uses env variable 
APHRODITE_CPU_KVCACHE_SPACEto specify the KV cache size in GBs. - We highly recommend using TCMalloc for high performance memory allocation and better cache locality. For example, on Ubuntu 22.04 you’d run:
 
sudo apt-get install libtcmalloc-minimal4sudo find / -name *libtcmalloc*export LD_PRELOAD=/usr/lib/x86_64-linux-gpu/libtcmalloc_minimal.so.4:$LD_PRELOAD- The CPU backend uses OpenMP for thread-parallel computation. If you want the best performance on CPU, it’ll be very critical to isolate CPU cores for OpenMP threads with other thread pools (like web-service event-loop), to avoid CPU oversubscription.
 - If using Aphrodite CPU backend on bare-metal, it’s recommended to disable hyper-threading.
 - If using Aphrodite CPU backend on a multi-socket machine with NUMA, make sure to set CPU cores and memory nodes, to avoid remote memory node access. 
numactlis a useful tool for this.