Getting Started
Aphrodite can be used for several purposes, and can be run in several ways. This guide will show you how to:
- Launch an OpenAI-compatible API server,
- run batched inference on a dataset,
- and build an API server for an LLM yourself.
Be sure to read the installation instructions for your device before continuing.
OpenAI API server
Aphrodite has implemented the OpenAI API protocol, and has almost perfect feature parity with it. For this reason, it can be used as a drop-in replacement for almost any application that uses OpenAI API. Aphrodite launches the server at http://localhost:2242 by default, making the base URL http://localhost:2242/v1.
The server currently runs one model at a time, and implements the following endpoints:
/v1/models: To show a list of the models available. This can include the primary LLM, and adapters (e.g. LoRA)./v1/completions: Provides a POST endpoint to send text completions requests to. The model field in the body is mandatory./v1/chat/completions: Provides a POST endpoint to send chat completions requests to. The model field in the body is mandatory.
There are two ways to start the server; using a YAML config file, or the CLI. In this guide, we assume you want to run the Meta-Llama-3.1-8B-Instruct model on 2 GPUs.
CLI
Start the server:
export HUGGINGFACE_HUB_TOKEN=<your hf token> # only if using private or gated reposaphrodite run meta-llama/Meta-Llama-3.1-8B-Instruct -tp 2To see the full list of supported arguments, run aphrodite run -h.
By default, the server will use the chat template (for /v1/chat/completions) stored in the model’s tokenizer. You can override this by adding the --chat-template argument:
aphrodite run meta-llama/Meta-Llama-3.1-8B-Instruct -tp 2 --chat-template ./examples/chat_templates/chatml.jinjaYou may also provide direct download URLs to the argument.
You can launch the server with authentication enabled via API keys by either exporting APHRODITE_API_KEY environment variable, or passing your key to the --api-keys argument.
YAML Config
Aphrodite allows its users to define a YAML config for easier repeated launches of the engine. We provide an example here. You can use this to get stated, by filling out the fields with your required parameters. Here’s how launching Llama-3.1-8B-Instruct would look:
basic_args: # Your model name. Can be a local path or huggingface model ID - model:
# The tensor parallelism degree. Set this to the number of GPUs you have # Keep in mind that for **quantized** models, this will typically only work # with values between 1, 2, 4, and 8. - tensor_parallel_size: 2You can save this to a config.yaml file, then launch Aphrodite like this:
aphrodite yaml config.yamlAs per the notice the sample config, Tensor Parallelism only works with odd-numbered GPUs for non-quantized models. For quantized models, it’s recommended to use pipeline_parallel_size if you need to launch the model on 3, 5, 6, or 7 GPUs.
Example Usage
Query the /v1/models endpoints like this:
curl http://localhost:2242/v1/models | jq .Send a prompt and request completion tokens like this:
curl http://localhost:2242/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "prompt": "Once upon a time", "max_tokens": 128, "temperature": 1.1, "min_p": 0.1 }' | jq .These curl commands assume you have jq installed to prettify the output JSON in the terminal. If you don’t wish to install it, or don’t have it installed already, please remove the | jq . at the end of each command.
You can also use the endpoints via the openai python library:
from openai import OpenAI
openai_api_key = "EMPTY"openai_api_base = "http://localhost:2242/v1"
client = OpenAI( api_key=openai_api_key, base_url=openai_api_base,)
completion = client.completions.create( model="meta-llama/Meta-Llama-3.1-8B-Instruct", prompt="Once upon a time", temperature=1.1, extra_body={"min_p": 0.1})
print("Completion result:", completion)Offline Batched Inference
You can use Aphrodite to process large datasets, or generate large amounts of data using a list of inputs.
To get started, first import the LLM and SamplingParams modules from Aphrodite. LLM is used to create a model object, and SamplingParams is used to define the sampling parameters to use for the requests.
from aphrodite import LLM, SamplingParamsThen you can define your prompt list and sampling params. For the sake of simplicity, we won’t load a dataset but rather define a few hardcoded prompts here:
prompts = [ "Once upon a time", "A robot may hurt a human if", "To get started with HF transformers,"]sampling_params = SamplingParams(temperature=1.1, min_p=0.1)Now, initialize the engine using the LLM class with your model of choice. We will use Meta-Llama-3.1-8B-Instruct on 2 GPUs for this example.
llm = LLM(model="meta-llama/Meta-Llama-3.1-8B-Instruct", tensor_parallel_size=2)The LLM class has a generate() method that we can use now. It adds the input prompts to the engine’s waiting queue and executes them to generate outputs with high throughput, in parallel. The outputs are returned as a list of RequestOutput objects, which include all the output tokens.
outputs = llm.generate(prompts, sampling_params)
for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r},\nGenerated text: {generated_text!r}")Please view the examples/ directory for a full list of examples, for various different use-cases.