Low-rank Adaptation (LoRA)
Aphrodite allows loading LoRA adapters on top of supported LLMs.
We use the method proposed by S-LoRA: Serving Thousands of Concurrent LoRA Adapters to efficiently serve LoRA adapters to thousands of users.
Python Usage
We can load LoRA adapters directly from Hugging Face or from disk. They're served on a per-request basis.
from aphrodite import LLM, SamplingParams
from aphrodite.lora.request import LoRARequest
llm = LLM(model="meta-llama/Llama-2-7b-hf",
enable_lora=True)
With this, we've launched the model with LoRA support enabled. We can now submit our prompts and call llm.generate()
with the lora_request
parameter. The first parameter of LoRARequest
is a human identifiable name, the second is a globally unique ID for the adapter, and the third is the path to the LoRA adapter.
sampling_params = SamplingParams(
temperature=0,
max_tokens=256,
stop=["[/assistant]"]
)
prompts = [
"[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_74 (icao VARCHAR, airport VARCHAR)\n\n question: Name the ICAO for lilongwe international airport [/user] [assistant]",
"[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_11 (nationality VARCHAR, elector VARCHAR)\n\n question: When Anchero Pantaleone was the elector what is under nationality? [/user] [assistant]",
]
outputs = llm.generate(
prompts,
sampling_params,
lora_request=LoRARequest("alpindale/l2-lora-test", 1, sql_lora_path)
)
Serving with OpenAI API server
LoRA adapters can be served with our OpenAI-compatible API server. To do so, specify --lora-modules {name}={path} {name}={path} ...
in the CLI.
aphrodite run meta-llama/Llama-2-7b-hf \
--enable-lora \
--lora-modules l2-lora-test=alpindale/l2-lora-test
The YAML config:
model: meta-llama/Llama-2-7b-hf
lora_modules:
- l2-lora-test=alpindale/l2-lora-test
The server endpoint accepts all LoRA configuration parameters (max_loras
, max_lora_rank
, max_cpu_loras
), which will apply to all forthcoming requests. Upon querying the /v1/models
endpoint, you should see the LoRA along with its base model:
{
"object": "list",
"data": [
{
"id": "meta-llama/Llama-2-7b-hf",
"object": "model",
...
},
{
"id": "l2-lora-test",
"object": "model",
...
}
]
}
Requests can specify the LoRA adapter as if it were any other model
via the model request parameter. The requests will be processed according to the server-wide LoRA configuration (i.e. in parallel with base model requests, and potentially other LoRA adapter requests if they were provided and max_loras
is set high enough).
Example curl request:
curl http://localhost:2242/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "l2-lora-test",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}' | jq