AI & ML

S-LoRA vs LoRAX vs vLLM PEFT: which multi-adapter serving stack fits your workload?

Q: What is the difference between LoRAX and vLLM?

LoRAX is a purpose-built multi-adapter inference server with dynamic just-in-time adapter loading, per-request tenant isolation, heterogeneous continuous batching, and a full production tooling stack (Docker, Helm, Prometheus, OpenTelemetry). vLLM is a general-purpose high-throughput LLM serving runtime whose native LoRA support allows per-request adapter serving with minimal overhead inside the existing vLLM runtime — no dynamic loading at request time, no specialized multi-tenant orchestration.

Q: Is vLLM better than PEFT for LoRA serving?

Yes, for throughput-sensitive multi-request workloads. The S-LoRA paper groups HuggingFace PEFT and vLLM's naive LoRA support as the baseline tier, implying vLLM already outperforms standalone PEFT in the batched request regime. vLLM's continuous batching and PagedAttention provide throughput advantages that standalone PEFT does not have.

Q: How many LoRA adapters can S-LoRA serve?

S-LoRA is designed to serve thousands of concurrent adapters, as stated in the [paper title](https://arxiv.org/abs/2311.03285) and the LMSYS benchmark. The practical ceiling depends on GPU memory, adapter rank, and base model size — unified paging manages adapter weight memory dynamically, but the GPU VRAM pool is still finite.

Q: Does LoRAX support dynamic loading of adapters?

Yes. LoRAX loads adapters just-in-time per request without blocking concurrent requests. You specify the adapter as the `model` parameter in the OpenAI-compatible API call; the server loads it from HuggingFace Hub, Predibase, or a local filesystem path.

Q: What hardware do I need for multi-LoRA serving?

For LoRAX: an NVIDIA Ampere-or-newer GPU, CUDA 11.8-compatible drivers or above, Linux, Docker, and the NVIDIA Container Toolkit. For S-LoRA: compatible CUDA hardware with enough VRAM for your base model plus adapter overhead. For vLLM PEFT: any hardware that runs vLLM, typically Ampere or newer for best PagedAttention support. Across all three stacks, VRAM is the gating constraint — base model at fp16 plus adapter weight overhead determines how many adapters you can hold hot simultaneously. | Question | LoRAX | vLLM PEFT | S-LoRA | |---|---|---|---| | Dynamic adapter loading | Yes, JIT per request | No | Limited | | Min GPU generation | Ampere | Ampere+ recommended | Ampere+ recommended | | Container toolkit required | Yes | No | No | | Multi-tenant isolation | Yes | Not documented | Not a primary feature | | Hard adapter-count ceiling stated | No (memory-bound) | No (memory-bound) | No (memory-bound) |

S-LoRA is optimized for high-scale multi-adapter serving through unified paging and heterogeneous batching, LoRAX is designed for thousands of adapters with dynamic loading and production features, and vLLM PEFT is the lighter-weight option when you want vLLM’s serving stack with adapter support but not the most aggressive multi-adapter specialization.

By AxiomLogica Editorial

May 15, 202620 min read

Reviewed by Editorial

S-LoRA vs LoRAX vs vLLM PEFT: which multi-adapter serving stack fits your workload?

How we compared S-LoRA, LoRAX, and vLLM PEFT

The comparison across S-LoRA, LoRAX, and vLLM hinges on five criteria that actually determine whether a serving stack survives contact with a real workload: how many adapters it can serve concurrently, how it batches heterogeneous requests, how hard it is to deploy, how production-ready its operational surface is, and what hardware it demands. Throughput headlines are secondary unless they come with those qualifiers.

The LMSYS team frames the scale problem directly: "Compared to state-of-the-art libraries such as HuggingFace PEFT and vLLM (with naive support of LoRA serving), S-LoRA can improve the throughput by up to 4 times and increase the number of served adapters by several orders of magnitude." That claim is benchmark-specific and compares against baselines that predate current vLLM LoRA optimizations — treat it as a directional indicator, not a universal guarantee.

Criterion	S-LoRA	LoRAX	vLLM PEFT
Adapter scale target	Thousands concurrent	Thousands on one GPU	Per-request, unspecified ceiling
Batching model	Heterogeneous batching	Heterogeneous continuous batching	Standard continuous batching
Deployment complexity	High (custom CUDA-heavy stack)	Medium (Docker + NVIDIA Container Toolkit)	Low (native vLLM extension)
Production ops surface	Research-grade	Docker, Helm, Prometheus, OpenTelemetry	vLLM's existing ops surface
Primary differentiator	Unified paging + throughput	Dynamic loading + multi-tenant tooling	Simplicity inside existing vLLM

At a glance: adapter count, batching, and deployment fit

LoRAX and vLLM differ most on where adapter management lives. vLLM's official LoRA docs describe a model where "LoRA adapters can be used with any vLLM model that implements SupportsLoRA" and "adapters can be efficiently served on a per-request basis with minimal overhead" — adapter support is native to the runtime but not architecturally specialized for multi-tenant scale. LoRAX builds the opposite way: it is purpose-built for "serving thousands of fine-tuned models on a single GPU, dramatically reducing the cost of serving without compromising on throughput or latency", with adapter exchange scheduling and heterogeneous continuous batching as first-class mechanisms.

Feature	LoRAX	vLLM PEFT
Dynamic adapter loading	Yes, just-in-time per request	No hot-swap; adapters pre-registered
Batching model	Heterogeneous continuous batching	Standard PagedAttention continuous batching
OpenAI-compatible API	Yes	Yes
Multi-tenant isolation	Per-request tenant isolation	Not explicitly documented
Additional runtime deps	Docker, NVIDIA Container Toolkit	None beyond vLLM itself

Which stack is optimized for thousands of adapters?

Both S-LoRA and LoRAX target thousand-adapter scale, but via different mechanisms and with different operational costs. S-LoRA's arXiv title states the goal directly: "Serving Thousands of Concurrent LoRA Adapters", achieved through unified paging and custom CUDA kernels that schedule adapter weights alongside KV cache. LoRAX targets the same scale from a product angle: "a framework that allows users to serve thousands of fine-tuned models on a single GPU", with dynamic loading as the key mechanism — adapters arrive just-in-time without blocking concurrent requests.

Stack	Adapter scale design target	Mechanism	Ops complexity
S-LoRA	Thousands concurrent	Unified paging, heterogeneous batching, tensor parallelism	High — research-grade deployment
LoRAX	Thousands on a single GPU	Dynamic JIT loading, heterogeneous continuous batching	Medium — containerized production stack
vLLM PEFT	Per-request, no stated ceiling	Native SupportsLoRA model integration	Low — zero added deps

If your workload genuinely operates hundreds to thousands of distinct adapters concurrently, both S-LoRA and LoRAX are defensible choices. S-LoRA maximizes raw throughput efficiency at the cost of systems complexity; LoRAX packages comparable scale in a deployable product.

Which option is the lightest path if you already run vLLM?

vLLM PEFT requires no additional runtime, no container toolkit changes, and no new deployment surface. Adapter support is built into vLLM's existing model abstraction — any model implementing SupportsLoRA can serve adapters per-request with what vLLM describes as "minimal overhead." If your infrastructure already runs vLLM and your adapter count fits within what vLLM can pre-register, there is no operational reason to add LoRAX or S-LoRA.

Operational factor	vLLM PEFT	Alternative (LoRAX / S-LoRA)
New runtime required	No	Yes
Additional container deps	No	NVIDIA Container Toolkit (LoRAX); custom CUDA stack (S-LoRA)
Adapter hot-loading	No	Yes (LoRAX); limited (S-LoRA)
Observability tooling	vLLM's existing	LoRAX adds Prometheus + OpenTelemetry
Migration cost from vLLM	Zero	Medium to high

The vLLM docs do not define a hard adapter-count ceiling, so the practical limit is VRAM: base model footprint plus the adapter weight footprint for however many adapters you hold in memory simultaneously.

S-LoRA: when research-grade adapter scheduling pays off

S-LoRA is the right answer when serving throughput at multi-adapter scale is the dominant requirement and your team can absorb the systems engineering cost. The LMSYS benchmark claims — up to 4× throughput improvement and several-orders-of-magnitude more served adapters versus HuggingFace PEFT and naive vLLM LoRA serving — are the strongest source-backed throughput numbers in this comparison. Those gains are tied specifically to the multi-adapter high-concurrency regime; they do not describe single-adapter or low-concurrency serving.

S-LoRA capability	Source-backed claim
Throughput vs. PEFT / naive vLLM	Up to 4× improvement
Adapter scale vs. PEFT / naive vLLM	Several orders of magnitude more adapters
Core mechanism	Unified paging for KV cache + adapter weights
Parallelism support	Tensor parallelism
Batching model	Heterogeneous batching

Unified paging and heterogeneous batching

S-LoRA's throughput advantage is mechanistic, not incidental. The LMSYS system "is designed for scalable serving of many LoRA adapters using unified paging for KV cache and adapter weights, heterogeneous batching, and tensor parallelism". Unified paging extends the PagedAttention memory management model to cover adapter weight storage alongside KV cache — both are treated as pageable memory pools rather than statically allocated buffers. Heterogeneous batching allows requests that use different adapters to share the same batch, eliminating the serialization penalty of adapter-per-batch approaches.

Mechanism	What it solves	Benefit regime
Unified paging (KV + adapter)	Memory fragmentation across adapters	High adapter count, variable sequence length
Heterogeneous batching	Adapter-switching serialization	Mixed-adapter concurrent request traffic
Tensor parallelism	Single-GPU memory ceiling	Large base models requiring multi-GPU

These mechanisms matter when you have many adapters active simultaneously with mixed request traffic. They add no value — and add substantial complexity — for single-adapter or low-concurrency deployments.

Where S-LoRA is a bad fit on a constrained GPU budget

S-LoRA is the most systems-heavy option in this comparison. Its design assumes CUDA-heavy custom serving infrastructure that diverges from standard HuggingFace or vLLM deployment patterns. On a single 24GB consumer GPU running one domain-specific adapter, the memory management sophistication of unified paging provides no meaningful benefit — base model VRAM plus a single adapter fits comfortably, and batching is homogeneous by definition.

Constraint	S-LoRA behavior	Implication
Deployment complexity	High — custom CUDA runtime	Requires systems engineering investment
Single-adapter workloads	Over-engineered	Simpler stacks dominate on ease-of-use
Constrained VRAM (24GB)	All stacks face the same base-model floor	Complexity cost is not recovered
Research-to-production path	Not turnkey	Expect non-trivial integration work

Watch Out: If your primary goal is straightforward domain fine-tuning on a single GPU card, S-LoRA's deployment overhead will cost you more engineering time than the throughput gains recover. Use vLLM PEFT or LoRAX instead.

LoRAX: production multi-LoRA serving with dynamic loading

Yes, LoRAX supports dynamic loading of adapters — it is the defining feature of the stack. The README documents it explicitly: "Dynamic Adapter Loading: include any fine-tuned LoRA adapter from HuggingFace, Predibase, or any filesystem in your request, it will be loaded just-in-time without blocking concurrent requests." This just-in-time loading model means you do not need to pre-register or pre-load adapters at server startup, which directly enables SaaS and multi-tenant deployment patterns where the adapter set is dynamic and per-customer.

LoRAX also ships a production tooling stack that S-LoRA does not: "Ready for Production prebuilt Docker images, Helm charts for Kubernetes, Prometheus metrics, and distributed tracing with Open Telemetry." It also supports "multi-turn chat conversations combined with dynamic adapter loading through an OpenAI compatible API."

LoRAX production feature	Status
Dynamic JIT adapter loading	Yes — per-request, non-blocking
OpenAI-compatible API	Yes
Docker images (prebuilt)	Yes
Helm charts for Kubernetes	Yes
Prometheus metrics	Yes
Distributed tracing (OpenTelemetry)	Yes
Multi-turn chat with adapter switching	Yes

Why LoRAX is attractive for multi-tenant deployments

LoRAX's heterogeneous continuous batching packs requests for different adapters into the same batch, which the docs claim keeps "latency and throughput nearly constant with the number of concurrent adapters." Per-request tenant isolation means each customer's adapter is scoped to their request — a necessary property for B2B SaaS products where adapter leakage between tenants is a correctness problem, not just a performance one.

Multi-tenant requirement	LoRAX behavior
Adapter isolation per tenant	Per-request tenant isolation
Adapter source flexibility	HuggingFace Hub, Predibase, or any filesystem path
Concurrent adapter batching	Heterogeneous continuous batching
Latency scaling with adapter count	Near-constant (per product claims)
API compatibility for existing clients	OpenAI-compatible

This profile maps directly to customer-specific fine-tuned models — legal writing assistants per firm, support tone adapters per brand, code-completion adapters per internal language — where adapter count grows with customer count.

Hardware and runtime prerequisites you cannot ignore

LoRAX's official docs list non-negotiable prerequisites: an NVIDIA GPU in the Ampere generation or newer, CUDA 11.8-compatible device drivers or above, Linux OS, and Docker. The Docker docs add: "To use GPUs, you need to install the NVIDIA Container Toolkit." These are baseline requirements for the runtime to start — actual VRAM needs scale with base model size plus the adapter footprint of however many adapters you hold hot in memory.

Prerequisite	LoRAX requirement	S-LoRA	vLLM PEFT
GPU generation	NVIDIA Ampere or newer	Ampere+ recommended	Ampere+ recommended
CUDA driver version	11.8-compatible or above	CUDA-heavy custom stack	CUDA 11.8+ typical
Container runtime	Docker + NVIDIA Container Toolkit	Not containerized turnkey	Not required
OS	Linux	Linux	Linux
VRAM floor	Base model + adapter overhead	Base model + adapter overhead	Base model + adapter overhead

Watch Out: Consumer-grade GPUs in the Ampere generation (RTX 30xx, RTX 40xx) satisfy the architecture requirement but may not satisfy the VRAM floor for larger base models. A 7B model at fp16 consumes roughly 14GB before adapters; a 13B model at fp16 consumes roughly 26GB. Budget accordingly.

vLLM PEFT: the pragmatic adapter path inside vLLM

vLLM's native LoRA support is not a thin wrapper over HuggingFace PEFT — it is integrated into vLLM's continuous batching and PagedAttention runtime. The question "is vLLM better than PEFT for LoRA serving?" resolves in vLLM's favor for throughput-sensitive workloads: the S-LoRA paper places both HuggingFace PEFT and "vLLM with naive support of LoRA serving" in the baseline category, implying vLLM already outperforms raw PEFT in the multi-request regime before any specialized adapter system is added. vLLM's docs describe adapter serving as per-request with minimal overhead — the integration point is the SupportsLoRA model interface, which most major architectures implement.

Comparison axis	vLLM PEFT	HuggingFace PEFT (standalone)
Batching model	Continuous batching (PagedAttention)	Request-by-request, no batching optimization
Throughput	Higher (batching)	Lower (no multi-request optimization)
Adapter serving overhead	Minimal (per vLLM docs)	Higher at scale
Integration complexity	Native to vLLM	Separate library and serving setup

When official vLLM adapter support is enough

vLLM PEFT is sufficient when your adapter count is small (single digits to low tens), your team already operates vLLM in production, and adapter hot-loading is not a requirement. The per-request adapter model means you declare which adapters the server can serve at startup; requests then specify which adapter to use. This covers domain-specific serving for a bounded set of use cases without adding a new runtime layer.

Workload signal	vLLM PEFT fit
Team already runs vLLM	Strong fit — zero additional ops surface
Adapter count < ~10	Strong fit — no need for specialized scheduler
Static adapter set (no hot-swap)	Strong fit
Single-tenant deployment	Strong fit
Adapter set changes at runtime	Poor fit — requires restart or external adapter loading

When vLLM PEFT starts to fall behind dedicated multi-adapter servers

The S-LoRA paper explicitly uses vLLM's LoRA support as a baseline, reporting the 4× throughput and orders-of-magnitude adapter-scale improvements against it. That positions vLLM's native adapter support as the performance floor, not the ceiling, for multi-adapter serving. LoRAX adds multi-tenant isolation, dynamic loading, and production observability that vLLM's docs do not expose in the same form.

Limitation signal	vLLM PEFT behavior	Alternative
Hundreds of concurrent adapters	No specialized scheduler	S-LoRA or LoRAX
Dynamic adapter set (customer-driven)	No hot-swap documented	LoRAX
Multi-tenant isolation per request	Not explicitly documented	LoRAX
Throughput at high adapter concurrency	Baseline vs. S-LoRA's claims	S-LoRA
Production observability (Prometheus, OTel)	vLLM's own metrics only	LoRAX

The key threshold: once adapter heterogeneity, tenant count, or throughput requirements push past what vLLM's per-request model handles comfortably, moving to LoRAX or S-LoRA recovers capability that vLLM PEFT is not designed to provide.

Benchmarks and workload-fit matrix

The available benchmark data is fragmented across a research paper, a product README, and runtime documentation — there is no single head-to-head benchmark suite covering all three stacks under identical conditions. The numbers below are source claims drawn from each project's own reporting; treat them as directional indicators with different methodological provenance, not apples-to-apples measured parity.

Stack	Throughput claim	Adapter scale claim	Baseline compared against
S-LoRA	Up to 4× improvement	Several orders of magnitude more adapters	HuggingFace PEFT + naive vLLM LoRA
LoRAX	Near-constant throughput / latency with concurrent adapters	Thousands of models on a single GPU	Internal product benchmarks (not published head-to-head)
vLLM PEFT	Minimal overhead per-request	Unspecified ceiling	Not benchmarked against the other two by vLLM

Throughput and adapter-scale claims from the source material

S-LoRA's headline numbers — up to 4× throughput improvement and several orders of magnitude more served adapters — are the strongest quantified claims in this comparison. They justify the extra systems complexity when your workload actually operates at hundreds-to-thousands of concurrent adapters. Below that threshold, the complexity cost is not recovered by measurable throughput gain.

S-LoRA claim	vs. which baseline	Applicability
Up to 4× throughput	HuggingFace PEFT + naive vLLM LoRA	High-concurrency, many-adapter workloads
Orders-of-magnitude more adapters	Same baselines	Systems with large, dynamic adapter catalogs
Unified paging benefit	High adapter count only	Marginal at low adapter concurrency

No absolute requests-per-second number is available from the retrieved source material — only the relative gain. Engineers evaluating S-LoRA for production should run the system against their own traffic shape to translate the relative claim into an absolute one.

A decision matrix for legal, support, code, and hobby workloads

Workload type	Adapter count	Tenants	Recommended stack	Rationale
Legal writing (per-firm adapters)	10–1000+	Many	LoRAX	Dynamic loading, tenant isolation, production ops
Customer support (per-brand tone)	10–1000+	Many	LoRAX	Same multi-tenant profile
Internal code completion	1–10	Single team	vLLM PEFT	Small static adapter set; simplicity wins
Research / hobby (single 24GB GPU)	1–3	Single user	vLLM PEFT	No multi-tenant need; minimal overhead
High-scale API with 100+ adapters	100–10,000	Many	S-LoRA or LoRAX	Throughput and scale specialization required

Pro Tip: VRAM math always comes first. For any stack, the base model at fp16 consumes roughly 2 bytes × parameter count. A 7B model needs ~14GB; a 13B needs ~26GB. Each LoRA adapter at rank 16 for a 7B model typically adds tens to low hundreds of megabytes — trivial per adapter, but it accumulates when holding many hot in memory simultaneously.

When to choose each stack

The choice between S-LoRA, LoRAX, and vLLM PEFT resolves along three axes: scale requirements, operational maturity, and deployment constraints. No stack wins across all three simultaneously.

Stack	Best fit threshold	Choose this when	Avoid this when
S-LoRA	Hundreds to thousands of concurrent adapters	Throughput efficiency is the primary cost lever	You need turnkey deployment or small-scale serving
LoRAX	Dynamic multi-tenant adapter catalogs	You need just-in-time loading and production ops	You already have a stable low-adapter vLLM deployment
vLLM PEFT	Single digits to low tens of static adapters	Simplicity and zero new runtime surface matter most	Adapter sets change frequently or tenant count grows fast

Choose S-LoRA when adapter count and throughput dominate

Choose S-LoRA when: your production workload serves hundreds to thousands of concurrent LoRA adapters, throughput efficiency is the primary cost lever, and your team can invest in custom CUDA-heavy serving infrastructure.

Selection criterion	S-LoRA threshold
Concurrent adapter count	Hundreds to thousands
Throughput priority	Primary — 4× claim vs. baselines
Team systems capability	Able to operate research-grade serving stack
Deployment environment	Custom infrastructure (not simple Docker + Helm)
Use case	Large-scale API serving with many model variants

S-LoRA is not appropriate for teams that need a turnkey deployment path, teams operating below the adapter-count threshold where unified paging provides measurable benefit, or teams with a constrained GPU budget and a single-adapter use case.

Choose LoRAX when production ops and adapter loading matter most

Choose LoRAX when: dynamic adapter loading at request time, multi-tenant isolation, and production observability are requirements — and you can meet the Ampere GPU + Docker + NVIDIA Container Toolkit prerequisites.

Selection criterion	LoRAX threshold
Dynamic adapter set	Yes — adapters change at runtime per customer
Multi-tenant isolation	Yes — per-request adapter scoping required
Production ops maturity	Docker and Kubernetes environment available
Observability requirements	Prometheus + OpenTelemetry needed
Adapter source	HuggingFace Hub, Predibase, or filesystem paths

LoRAX's "Dynamic Adapter Loading" — just-in-time, non-blocking, per-request — makes it the correct choice for any SaaS or multi-tenant product where the adapter catalog grows with the customer list and adapters cannot be pre-registered at server startup.

Choose vLLM PEFT when simplicity beats specialization

Choose vLLM PEFT when: your team already operates vLLM, adapter count is small and static, and adding a specialized adapter serving layer would create more operational surface than it eliminates.

Selection criterion	vLLM PEFT threshold
Existing vLLM deployment	Yes — already in production
Adapter count	Small (single digits to low tens)
Adapter set changes	Static or infrequent
Tenant model	Single-tenant or small internal team
Additional runtime tolerance	None — avoid new deps

The simplicity argument is real: zero new runtime dependencies, zero new deployment patterns, and the existing vLLM operational surface already covers the team's observability and scaling needs. For workloads that do not need thousands of adapters or dynamic loading, the specialization in S-LoRA and LoRAX creates complexity without proportional return.

FAQ

What is the difference between LoRAX and vLLM?

LoRAX is a purpose-built multi-adapter inference server with dynamic just-in-time adapter loading, per-request tenant isolation, heterogeneous continuous batching, and a full production tooling stack (Docker, Helm, Prometheus, OpenTelemetry). vLLM is a general-purpose high-throughput LLM serving runtime whose native LoRA support allows per-request adapter serving with minimal overhead inside the existing vLLM runtime — no dynamic loading at request time, no specialized multi-tenant orchestration.

Is vLLM better than PEFT for LoRA serving?

Yes, for throughput-sensitive multi-request workloads. The S-LoRA paper groups HuggingFace PEFT and vLLM's naive LoRA support as the baseline tier, implying vLLM already outperforms standalone PEFT in the batched request regime. vLLM's continuous batching and PagedAttention provide throughput advantages that standalone PEFT does not have.

How many LoRA adapters can S-LoRA serve?

S-LoRA is designed to serve thousands of concurrent adapters, as stated in the paper title and the LMSYS benchmark. The practical ceiling depends on GPU memory, adapter rank, and base model size — unified paging manages adapter weight memory dynamically, but the GPU VRAM pool is still finite.

Does LoRAX support dynamic loading of adapters?

Yes. LoRAX loads adapters just-in-time per request without blocking concurrent requests. You specify the adapter as the model parameter in the OpenAI-compatible API call; the server loads it from HuggingFace Hub, Predibase, or a local filesystem path.

What hardware do I need for multi-LoRA serving?

For LoRAX: an NVIDIA Ampere-or-newer GPU, CUDA 11.8-compatible drivers or above, Linux, Docker, and the NVIDIA Container Toolkit. For S-LoRA: compatible CUDA hardware with enough VRAM for your base model plus adapter overhead. For vLLM PEFT: any hardware that runs vLLM, typically Ampere or newer for best PagedAttention support. Across all three stacks, VRAM is the gating constraint — base model at fp16 plus adapter weight overhead determines how many adapters you can hold hot simultaneously.

Question	LoRAX	vLLM PEFT	S-LoRA
Dynamic adapter loading	Yes, JIT per request	No	Limited
Min GPU generation	Ampere	Ampere+ recommended	Ampere+ recommended
Container toolkit required	Yes	No	No
Multi-tenant isolation	Yes	Not documented	Not a primary feature
Hard adapter-count ceiling stated	No (memory-bound)	No (memory-bound)	No (memory-bound)

Sources and references

LMSYS Blog — "Recipe for Serving Thousands of Concurrent LoRA Adapters" — S-LoRA throughput claims, unified paging, heterogeneous batching
arXiv — S-LoRA paper — Primary research reference for S-LoRA mechanisms and scale target
predibase/lorax GitHub README — LoRAX feature set: dynamic loading, heterogeneous continuous batching, tenant isolation, API compatibility
LoRA eXchange docs — LoRAX production feature confirmation: Docker, Helm, Prometheus, OpenTelemetry
LoRAX development environment prerequisites — Hardware and driver requirements: Ampere+, CUDA 11.8+, Linux, Docker
LoRAX Docker getting started — NVIDIA Container Toolkit requirement
vLLM LoRA documentation — vLLM native adapter support: SupportsLoRA interface, per-request minimal-overhead serving

Keywords: S-LoRA, LoRAX, vLLM, Hugging Face PEFT, PagedAttention, Unified Paging, Heterogeneous Batching, Tensor Parallelism, CUDA 11.8, NVIDIA Ampere, nvidia-container-toolkit, OpenAI-compatible API, HuggingFace TRL, Axolotl, Unsloth

Was this guide helpful?

Share: X · LinkedIn · Reddit