Skip to content

Low-rank Adaptation (LoRA)

Aphrodite allows loading LoRA adapters on top of supported LLMs.

We use the method proposed by S-LoRA: Serving Thousands of Concurrent LoRA Adapters to efficiently serve LoRA adapters to thousands of users.

Python Usage

We can load LoRA adapters directly from Hugging Face or from disk. They're served on a per-request basis.

py
from aphrodite import LLM, SamplingParams
from aphrodite.lora.request import LoRARequest  

llm = LLM(model="meta-llama/Llama-2-7b-hf",
          enable_lora=True) 

With this, we've launched the model with LoRA support enabled. We can now submit our prompts and call llm.generate() with the lora_request parameter. The first parameter of LoRARequest is a human identifiable name, the second is a globally unique ID for the adapter, and the third is the path to the LoRA adapter.

py
sampling_params = SamplingParams(
    temperature=0,
    max_tokens=256,
    stop=["[/assistant]"]
)

prompts = [
     "[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_74 (icao VARCHAR, airport VARCHAR)\n\n question: Name the ICAO for lilongwe international airport [/user] [assistant]",
     "[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_11 (nationality VARCHAR, elector VARCHAR)\n\n question: When Anchero Pantaleone was the elector what is under nationality? [/user] [assistant]",
]


outputs = llm.generate(  
    prompts,  
    sampling_params, 
    lora_request=LoRARequest("alpindale/l2-lora-test", 1, sql_lora_path)   
)

Serving with OpenAI API server

LoRA adapters can be served with our OpenAI-compatible API server. To do so, specify --lora-modules {name}={path} {name}={path} ... in the CLI.

sh
aphrodite run meta-llama/Llama-2-7b-hf \
    --enable-lora \
    --lora-modules l2-lora-test=alpindale/l2-lora-test

The YAML config:

yaml
model: meta-llama/Llama-2-7b-hf
lora_modules:
    - l2-lora-test=alpindale/l2-lora-test

The server endpoint accepts all LoRA configuration parameters (max_loras, max_lora_rank, max_cpu_loras), which will apply to all forthcoming requests. Upon querying the /v1/models endpoint, you should see the LoRA along with its base model:

json
{
    "object": "list",
    "data": [
        {
            "id": "meta-llama/Llama-2-7b-hf",
            "object": "model",
            ...
        },
        {
            "id": "l2-lora-test",
            "object": "model",
            ...
        }
    ]
}

Requests can specify the LoRA adapter as if it were any other model via the model request parameter. The requests will be processed according to the server-wide LoRA configuration (i.e. in parallel with base model requests, and potentially other LoRA adapter requests if they were provided and max_loras is set high enough).

Example curl request:

sh
curl http://localhost:2242/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "l2-lora-test",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
    }' | jq