Architecting Low-Power Edge AI: Implementing SLMs on Alif Ensemble E-Series MCUs

18 min read · Published Apr 17, 2026, 6:05 PM

The inference workload is moving to silicon that runs on milliwatts. Cloud-bound large language models are architecturally incompatible with always-on IoT endpoints—the bandwidth cost, latency floor, and privacy exposure are non-negotiable disqualifiers. The operative question is no longer whether to run generative AI at the edge, but how to partition silicon resources precisely enough to sustain transformer inference without exceeding a 40mW power budget.

The Alif Ensemble E-Series answers that question with a specific architectural bet: pair dual-core Arm Cortex-M55 with an Ethos-U85 NPU, back it with 9.75MB of tightly coupled on-die SRAM, and eliminate the external memory bottleneck that kills latency in every competing edge-MCU design. This article is the engineering implementation guide for that bet.


The Architectural Shift to Edge-Native Generative AI

SLM deployment on MCUs represents a qualitative departure from classical TinyML. Attention mechanisms, KV-cache management, and autoregressive decoding impose memory access patterns that expose every weakness of external-flash-dependent architectures. The Alif Ensemble E4 SKU sustains SLM execution at approximately 36mW—a figure that becomes meaningful only when you understand what silicon architecture makes it achievable.

For context, benchmark data from Alif confirms that the Ethos-U85 NPU "is able to uplift inference performance and improve power efficiency by two orders of magnitude for popular open source models" (Alif Semiconductor, 2026). A two-orders-of-magnitude efficiency gain over CPU-only MCU inference changes the deployment calculus for battery-constrained endpoints entirely.

The table below compares representative edge platforms against the Ensemble E-Series across the dimensions that matter most for sustained SLM inference:

Platform NPU Architecture Typical Power (mW) Peak Inference Latency (ms/token) On-Die SRAM
Alif Ensemble E4 Ethos-U85 ~36 mW ~45–80 ms 4.5 MB
Alif Ensemble E8 Ethos-U85 ~40–60 mW ~30–55 ms 9.75 MB
STM32H7 (No NPU) CPU-only (M7) ~180–220 mW ~800–1500 ms 1 MB
Nordic nRF9160 CPU-only (M33) ~90–140 mW Not viable 256 KB
NXP i.MX RT1170 CPU-only (M7+M4) ~200–350 mW ~600–1200 ms 2 MB

Technical Warning: Latency figures for CPU-only platforms assume quantized INT8 CMSIS-NN kernels. Without NPU offload, transformer attention heads execute serially on the Cortex core, making real-time token generation impractical at MCU clock rates.

The constraint that forces every deployment decision is operating within the 32-bit MCU power envelope, which mandates quantization to fit model weights into integrated memory. Floating-point weight storage is not an option—the SRAM budget simply does not accommodate it.


Deconstructing the Alif Ensemble E-Series Hardware Stack

The Ensemble E8 is a heterogeneous compute system on a single die. Understanding the data path between its components is a prerequisite for any memory partitioning strategy.

graph TD
    subgraph "Alif Ensemble E8 SoC"
        subgraph "High-Performance Subsystem"
            M55_HP["Cortex-M55 HP Core\n(TCM: 1.25MB ITCM + DTCM)"]
            NPU["Ethos-U85 NPU\n(HW Accelerated Inference)"]
        end

        subgraph "High-Efficiency Subsystem"
            M55_HE["Cortex-M55 HE Core\n(TCM: 0.5MB TCM)"]
        end

        subgraph "Memory Subsystem"
            SRAM["Bulk SRAM\n(8MB Shared)"]
            MRAM["Integrated MRAM\n(5.5MB NVM)"]
        end

        subgraph "External Interfaces"
            FLASH["External SPI/OSPI Flash\n(High Latency)"]
        end
    end

    M55_HP -- "Dispatches command stream" --> NPU
    NPU -- "DMA reads/writes operands" --> SRAM
    M55_HP -- "AXI bus: loads weights/activations" --> SRAM
    M55_HE -- "Application logic / sensor fusion" --> SRAM
    SRAM -- "Weight staging from NVM" --> MRAM
    MRAM -.->|"Avoid for hot inference paths"| FLASH

    style NPU fill:#4A90D9,color:#fff
    style SRAM fill:#27AE60,color:#fff
    style FLASH fill:#E74C3C,color:#fff

The Ensemble E8 integrates 9.75MB of total SRAM: 8MB of bulk SRAM accessible via the AXI interconnect, plus 1.25MB (HP core) and 0.5MB (HE core) of Tightly Coupled Memory directly wired to each Cortex-M55. The TCM banks operate at zero-wait-state latency—one clock cycle for load/store operations—making them the highest-value memory for hot code paths and inference dispatch routines.

The Ethos-U85 NPU is tightly coupled with the HP Cortex-M55 subsystem. It does not execute instructions autonomously; the Cortex-M55 HP core constructs and dispatches a command stream to the NPU, which then drives its own DMA engine to fetch operands directly from the shared SRAM. This architecture means that if operand buffers reside in external flash, every NPU fetch stalls on the external memory controller—a latency trap that destroys any throughput advantage.

As Alif's own product documentation states: "The Ensemble E8 series is the specialized fusion processor, capable of delivering more performance per square millimeter than anything in its class." The architectural basis for that claim is the elimination of the external-memory bottleneck through integrated MRAM and the 9.75MB SRAM pool.


Mastering Memory Partitioning: SRAM vs. External Flash

The total 9.75MB SRAM budget is finite and non-negotiable. Application heap, RTOS stack allocations, NPU operand buffers, and KV-cache storage for autoregressive decoding all compete for the same physical resource. The only viable strategy is compile-time static partitioning—runtime malloc-based allocation will cause non-deterministic failures under peak inference load.

The Ensemble series supports up to 5.5MB of integrated MRAM for non-volatile storage. Use MRAM for read-only weight tensors and model constants. Reserve all 9.75MB SRAM for runtime operands, activations, and KV-cache. As Arm's ExecuTorch documentation states: "Intermediate tensors are stored in the SRAM, leveraging its low-latency and high-bandwidth."

The following linker script fragment and C allocation demonstrate compile-time SRAM segmentation on the Ensemble E8. Weight tensors are placed in MRAM; all mutable inference buffers are pinned to specific SRAM regions via custom sections:

/* memory_config.h - Compile-time SRAM partition constants for Ensemble E8 */
/* Total on-die SRAM: 9.75MB = 10,223,616 bytes */

#ifndef MEMORY_CONFIG_H
#define MEMORY_CONFIG_H

/* Bulk SRAM base address for Ensemble E8 - verify against your BSP linker script */
#define SRAM_BASE_ADDR          0x02000000UL
#define SRAM_TOTAL_BYTES        (9750UL * 1024UL)   /* 9.75MB */

/*
 * NPU_OPERAND_SIZE: Budget for Ethos-U85 activation/scratch buffers.
 * The Ethos-U85 Vela compiler reports required 'scratch_fast' size;
 * set this to that value plus a 128-byte alignment margin.
 */
#define NPU_OPERAND_SIZE        (4096UL * 1024UL)   /* 4MB for NPU operands */

/*
 * KV_CACHE_SIZE: Autoregressive decoding requires per-token KV storage.
 * For a 25M-parameter SLM at INT8: (2 * layers * heads * head_dim * max_seq_len)
 * Example: 12 layers, 8 heads, 64 head_dim, 128 seq_len = 12*8*64*128*2 = 1,572,864 bytes
 */
#define KV_CACHE_SIZE           (1536UL * 1024UL)   /* 1.5MB for KV-cache */

/*
 * APP_RUNTIME_SIZE: RTOS heap + stacks for application threads.
 * HE core TCM (0.5MB) handles the HE runtime; this budget covers HP subsystem.
 */
#define APP_RUNTIME_SIZE        (3072UL * 1024UL)   /* 3MB for app runtime */

/* Remaining: ~1.25MB maps to Cortex-M55 HP TCM for dispatch hotpath code */

/* Compile-time assertion: partitions must not exceed total SRAM */
_Static_assert(
    (NPU_OPERAND_SIZE + KV_CACHE_SIZE + APP_RUNTIME_SIZE) <= SRAM_TOTAL_BYTES,
    "Memory partition exceeds total SRAM budget"
);

#endif /* MEMORY_CONFIG_H */
/* tensor_arena.c - Static allocation of NPU operand arena in dedicated SRAM section */
#include "memory_config.h"
#include <stdint.h>

/*
 * Place npu_tensor_arena in a dedicated linker section ".npu_sram".
 * The linker script must map ".npu_sram" to the NPU-accessible SRAM bank.
 * __attribute__((aligned(16))) satisfies Ethos-U85 DMA alignment requirements.
 */
__attribute__((section(".npu_sram"), aligned(16)))
static uint8_t npu_tensor_arena[NPU_OPERAND_SIZE];

/*
 * kv_cache_buffer is placed in the general SRAM region.
 * This buffer is managed by the inference runtime to store K/V projections
 * across autoregressive decoding steps.
 */
__attribute__((section(".sram_data"), aligned(16)))
static uint8_t kv_cache_buffer[KV_CACHE_SIZE];

/* Expose pointers for the inference driver to bind at init time */
uint8_t *get_npu_arena(void)       { return npu_tensor_arena; }
size_t   get_npu_arena_size(void)  { return NPU_OPERAND_SIZE; }
uint8_t *get_kv_cache(void)        { return kv_cache_buffer; }
size_t   get_kv_cache_size(void)   { return KV_CACHE_SIZE; }

Memory Constraint: The Vela compiler's --optimise-for-performance flag generates a scratch_fast memory requirement in its output report. This value must be used to set NPU_OPERAND_SIZE. Undersizing this buffer causes the Ethos-U85 driver to silently fall back to slower memory tiers.

Avoiding External Latency Traps

The Ethos-U85 NPU's DMA engine operates on physical addresses. If operand buffers are not in the SRAM address range the NPU's DMA can reach, the driver inserts CPU-mediated copy operations—negating the zero-copy architecture entirely. Eliminating this requires configuring the Ethos-U driver to bind directly to SRAM-resident buffers and triggering DMA from those addresses without intermediate copies.

As Arm's ExecuTorch documentation notes: "The dedicated SRAM acts as a software managed cache, improving performance by pre-fetching frequently accessed tensors." The following pattern implements this pre-fetch trigger using the Ethos-U driver's DMA callback interface:

/* npu_dma_dispatch.c - Zero-copy DMA dispatch for Ethos-U85 on Ensemble E8 */
#include "ethosu_driver.h"       /* Arm Ethos-U driver API */
#include "tensor_arena.h"        /* get_npu_arena() declarations */
#include <string.h>
#include <stdint.h>

/* Driver handle - initialized once at system startup */
static struct ethosu_driver npu_driver_handle;

/*
 * npu_init_zero_copy: Registers the pre-allocated SRAM arena with the
 * Ethos-U85 driver. Passing the SRAM base pointer here ensures the NPU's
 * internal DMA engine maps directly to SRAM, bypassing any MRAM or flash
 * address ranges that would trigger the external memory controller.
 */
int npu_init_zero_copy(void)
{
    void  *arena_ptr  = (void *)get_npu_arena();
    size_t arena_size = get_npu_arena_size();

    /* ethosu_init: base_addr is NPU MMIO; arena args set fast-memory window */
    int ret = ethosu_init(
        &npu_driver_handle,
        (void *)ETHOS_U85_BASE_ADDR,  /* NPU MMIO register base - from BSP */
        arena_ptr,                     /* fast_memory_ptr: must be in SRAM  */
        arena_size,                    /* fast_memory_size                   */
        1U,                            /* secure: 0 for non-secure execution */
        1U                             /* privileged: 1 for bare-metal/RTOS  */
    );

    /* A non-zero return indicates the NPU could not map the fast memory window.
     * Most common cause: arena_ptr is not in NPU-accessible SRAM address range. */
    return ret;
}

/*
 * npu_invoke_zero_copy: Runs inference on a pre-staged command stream.
 * 'cmd_stream' resides in MRAM (read-only, Vela-compiled binary).
 * 'base_addrs' are SRAM pointers for input/output/scratch tensors.
 * No memcpy occurs in this path - the NPU DMA fetches directly from SRAM.
 */
int npu_invoke_zero_copy(
    const uint8_t  *cmd_stream,
    size_t          cmd_stream_size,
    const uint64_t *base_addrs,
    size_t          num_base_addrs)
{
    return ethosu_invoke(
        &npu_driver_handle,
        cmd_stream,
        cmd_stream_size,
        base_addrs,
        num_base_addrs,
        NULL   /* No user-data callback required for synchronous invocation */
    );
}

Pro-Tip: Cache coherency is your silent failure mode. Cortex-M55 has an optional L1 D-cache. If enabled, call SCB_CleanInvalidateDCache_by_Addr() on any buffer the CPU writes before NPU DMA reads it, and invalidate after NPU writes before CPU reads. Skipping this produces non-deterministic inference outputs.


Deploying Small Language Models on Embedded Constraints

Quantization is not optional on the Ensemble E-Series—it is a hard architectural requirement. The ExecuTorch backend mandates symmetric INT8 quantization for optimized Ethos-U85 execution. Per the PyTorch/ExecuTorch documentation: "Currently, the symmetric int8 config is the main config available to use with the Ethos-U quantizer."

The throughput-vs-precision trade-off for the Ethos-U85 follows directly from the NPU's MAC array width and the memory bandwidth equation:

$$\text{Throughput (OPS/s)} = \frac{\text{MAC_count} \times 2}{\text{latency (s)}}$$

$$\text{Memory_BW_required} = \frac{\text{param_count} \times \text{bytes_per_param}}{\text{inference_time (s)}}$$

For INT8 vs. INT16, the memory bandwidth cost doubles when moving from INT8 to INT16, and the Ethos-U85 MAC array throughput halves for 16-bit operands on supported operations. Concretely, for a 25M-parameter SLM:

  • INT8: 25MB weight footprint, ~full MAC utilization, fits in 9.75MB SRAM with weight streaming from MRAM
  • INT16: 50MB weight footprint, ~50% MAC utilization, requires full external flash dependency—unacceptable for latency targets

Quantization must follow TOSA (Tensor Operator Set Architecture) standards for Vela compiler compatibility. Any operator that does not map to a TOSA-compliant representation will not generate an NPU command stream entry—it falls back to Cortex-M55 software execution. The tool chain to verify TOSA compliance before committing to a quantization scheme is the Vela compiler's --output-dir diagnostic output, which explicitly flags non-delegatable operators.

Optimizing Inference Kernels with CMSIS-NN

CMSIS-NN provides the software acceleration layer for operations that the Ethos-U85 does not hardware-accelerate. CMSIS-NN kernels are specifically optimized for Armv8.1-M architectures, using the Helium SIMD instruction set present on the Cortex-M55 to accelerate fallback layer execution. This matters because only specific transformer operators receive NPU hardware acceleration; unsupported layers fall back to Cortex-M55 execution, incurring measurable performance penalties.

The full pipeline from a trained PyTorch SLM to an NPU binary follows this sequence:

flowchart LR
    A["Trained PyTorch SLM\n(fp32 checkpoint)"] --> B["ExecuTorch Export\nwith Ethos-U Quantizer\n(symmetric INT8 / TOSA)"]
    B --> C["TFLite Flatbuffer\nor TOSA Graph"]
    C --> D["Vela Compiler\n(ethos-u-vela)"]
    D --> E{"Operator\nDelegation Check"}
    E -->|"TOSA-compliant ops"| F["NPU Command Stream\n(.tflite w/ ethosu delegate)"]
    E -->|"Fallback ops"| G["CMSIS-NN Kernels\n(Cortex-M55 / Helium)"]
    F --> H["Ethos-U85 Runtime\non Ensemble E8"]
    G --> H
    H --> I["Inference Output\n(tokens/s)"]

    style D fill:#8E44AD,color:#fff
    style F fill:#4A90D9,color:#fff
    style G fill:#E67E22,color:#fff
    style H fill:#27AE60,color:#fff

The Vela compiler's --accelerator-config ethos-u85-256 flag targets the 256 MAC configuration of the Ethos-U85. Running Vela with --verbose-performance produces per-layer cycle estimates that directly identify which transformer components benefit most from NPU delegation.


Advanced Runtime Management in ZephyrOS

On Zephyr RTOS, NPU inference threads require deterministic scheduling guarantees. Background telemetry, sensor polling, and connectivity stacks compete for the same CPU time as the NPU dispatch thread. The Zephyr scheduler operates as a preemptive priority system; NPU threads must be pinned to cooperative or high-priority preemptive tiers to prevent scheduling gaps that stall the command stream pipeline.

/* inference_thread.c - Zephyr RTOS NPU thread configuration for Ensemble E8 */
#include <zephyr/kernel.h>
#include "npu_dma_dispatch.h"
#include "tensor_arena.h"

/* Stack size: 4KB minimum for NPU driver call stack depth.
 * Increase if using ExecuTorch runner which has deeper call chains. */
#define NPU_THREAD_STACK_SIZE   4096
#define NPU_THREAD_PRIORITY     2    /* Priority 2: below IRQ handlers (0,1),
                                      * above all application logic (>= 5).
                                      * Never set to K_PRIO_COOP range unless
                                      * you need strict non-preemptibility. */

K_THREAD_STACK_DEFINE(npu_stack_area, NPU_THREAD_STACK_SIZE);
static struct k_thread npu_thread_data;

/* Semaphore gates inference jobs: inference_sem is given by the producer
 * (e.g., tokenizer thread) and taken by the NPU dispatch thread.
 * Binary semaphore prevents queuing multiple overlapping inference jobs. */
K_SEM_DEFINE(inference_sem, 0, 1);

/* Shared job descriptor - written by producer, read by NPU thread.
 * Protect writes with a mutex if multiple producers are possible. */
static struct {
    const uint8_t  *cmd_stream;
    size_t          cmd_stream_size;
    const uint64_t *base_addrs;
    size_t          num_base_addrs;
    volatile int    result;
} npu_job;

static void npu_inference_thread_fn(void *p1, void *p2, void *p3)
{
    ARG_UNUSED(p1); ARG_UNUSED(p2); ARG_UNUSED(p3);

    while (1) {
        /* Block here until a job is submitted - no busy-waiting, no CPU waste */
        k_sem_take(&inference_sem, K_FOREVER);

        /* Execute inference synchronously; Ethos-U85 driver blocks until done */
        npu_job.result = npu_invoke_zero_copy(
            npu_job.cmd_stream,
            npu_job.cmd_stream_size,
            npu_job.base_addrs,
            npu_job.num_base_addrs
        );
    }
}

/* Call once at system startup, after npu_init_zero_copy() succeeds */
void npu_thread_start(void)
{
    k_thread_create(
        &npu_thread_data,
        npu_stack_area,
        K_THREAD_STACK_SIZEOF(npu_stack_area),
        npu_inference_thread_fn,
        NULL, NULL, NULL,
        NPU_THREAD_PRIORITY,
        0,              /* No special thread options */
        K_NO_WAIT       /* Start immediately */
    );
    k_thread_name_set(&npu_thread_data, "npu_inference");
}

Pro-Tip: Set CONFIG_MAIN_THREAD_PRIORITY=7 in your prj.conf and keep all sensor/telemetry threads at priority 8 or lower. This ensures the NPU dispatch thread at priority 2 is never preempted by application logic, while still yielding to IRQ-driven drivers.


Bridging the Competitive Gap: Practical Partitioning

Running concurrent application logic alongside NPU inference is where most edge AI designs fail. The Alif Semiconductor claim that they are "the first silicon provider to offer the Arm Ethos-U85 NPU which supports transformer-based ML networks" (EETasia, 2026) is meaningful only if the memory map is partitioned to exploit the dual-subsystem architecture. The HE Cortex-M55 must handle application logic while the HP Cortex-M55 drives the NPU—otherwise, context switches on the HP core stall the inference command stream.

Static partitioning of the 9.75MB SRAM is mandatory. The following memory map schematic shows the required separation between application runtime, NPU operand storage, and KV-cache across the Ensemble E8 address space:

graph TD
    subgraph "Ensemble E8 On-Die Memory Map"
        direction TB

        subgraph "Cortex-M55 HE TCM [0x50000000] — 512KB"
            HE_STACK["HE Core Stack + Heap\n(Sensor fusion, telemetry,\nconnectivity drivers)"]
        end

        subgraph "Cortex-M55 HP TCM [0x20000000] — 1.25MB"
            HP_TCM_CODE["NPU Dispatch Code\n(npu_dma_dispatch, hot paths)\n~256KB"]
            HP_TCM_STACK["HP Core Stack\n(RTOS threads, IRQ stacks)\n~256KB"]
            HP_TCM_KV["KV-Cache — Active Seq Window\n~768KB"]
        end

        subgraph "Bulk SRAM [0x02000000] — 8MB"
            NPU_ARENA["NPU Tensor Arena\n(Activations, scratch_fast)\n4MB  ← Ethos-U85 DMA Only"]
            MODEL_IO["Model I/O Buffers\n(Input tokens, output logits)\n512KB"]
            APP_HEAP["Application Heap\n(ZephyrOS pool, FreeRTOS heap)\n2MB"]
            RESERVED["Reserved / Guard Pages\n1.5MB"]
        end

        subgraph "Integrated MRAM [0x60000000] — 5.5MB"
            WEIGHTS["Quantized Model Weights\n(Read-only, INT8)\n~3.5MB for 25M-param SLM"]
            CMD_STREAM["Vela Command Stream\n(NPU binary, read-only)\n~512KB"]
            FW["Firmware + RTOS Image\n~1.5MB"]
        end
    end

    style NPU_ARENA fill:#4A90D9,color:#fff
    style WEIGHTS fill:#8E44AD,color:#fff
    style APP_HEAP fill:#27AE60,color:#fff
    style HE_STACK fill:#E67E22,color:#fff

The guard pages between the NPU arena and the application heap are not optional. Without them, a stack overflow in any application thread will silently corrupt NPU operand buffers mid-inference, producing logically invalid but syntactically well-formed token outputs—the worst class of embedded AI failure.

Benchmarking Real-Time Inference Performance

On-device profiling must track cycles-per-layer to identify compute-bound operations within transformer attention mechanisms. Building on established research from Arm Education Developer Labs regarding low-latency inference, the Ethos-U85 driver exposes hardware performance counters accessible via JTAG/SWD through the DK-E8 development kit. The following Python script parses the structured profiling log output from a DK-E8 session:

#!/usr/bin/env python3
"""
parse_ethosu_profile.py — Parse Alif DK-E8 Ethos-U85 inference profiling logs.
Usage: python parse_ethosu_profile.py --log inference_profile.txt
The DK-E8 generates per-command-stream-entry cycle counts over UART/SWD.
Log format (CSV): layer_name, npu_cycles, cpu_cycles, memory_bytes_read, memory_bytes_written
"""

import argparse
import csv
import sys
from dataclasses import dataclass, field
from typing import List

NPU_CLOCK_HZ = 400_000_000  # Ensemble E8 NPU clock: 400MHz (verify against BSP config)


@dataclass
class LayerProfile:
    name: str
    npu_cycles: int
    cpu_cycles: int
    mem_read_bytes: int
    mem_write_bytes: int

    @property
    def npu_latency_us(self) -> float:
        """Convert NPU cycle count to microseconds at configured NPU clock rate."""
        return (self.npu_cycles / NPU_CLOCK_HZ) * 1e6

    @property
    def cpu_latency_us(self) -> float:
        """CPU fallback layers run on Cortex-M55 HP at up to 160MHz."""
        cpu_clock_hz = 160_000_000
        return (self.cpu_cycles / cpu_clock_hz) * 1e6

    @property
    def is_npu_delegated(self) -> bool:
        """Layers with zero CPU cycles ran entirely on the Ethos-U85."""
        return self.cpu_cycles == 0


def parse_log(filepath: str) -> List[LayerProfile]:
    profiles = []
    with open(filepath, newline="") as f:
        reader = csv.reader(f)
        next(reader, None)  # Skip header row
        for row in reader:
            if len(row) < 5:
                continue  # Skip malformed lines
            try:
                profiles.append(LayerProfile(
                    name=row[0].strip(),
                    npu_cycles=int(row[1]),
                    cpu_cycles=int(row[2]),
                    mem_read_bytes=int(row[3]),
                    mem_write_bytes=int(row[4]),
                ))
            except ValueError as e:
                print(f"Warning: skipping row {row} — {e}", file=sys.stderr)
    return profiles


def report(profiles: List[LayerProfile]) -> None:
    total_npu_us = sum(p.npu_latency_us for p in profiles)
    total_cpu_us = sum(p.cpu_latency_us for p in profiles)
    total_latency_us = total_npu_us + total_cpu_us

    print(f"\n{'Layer':<40} {'NPU (µs)':>10} {'CPU (µs)':>10} {'Delegated':>10}")
    print("-" * 72)
    for p in sorted(profiles, key=lambda x: x.npu_latency_us + x.cpu_latency_us, reverse=True):
        print(f"{p.name:<40} {p.npu_latency_us:>10.1f} {p.cpu_latency_us:>10.1f} "
              f"{'YES' if p.is_npu_delegated else 'NO (FALLBACK)':>10}")

    print("-" * 72)
    print(f"{'TOTAL INFERENCE LATENCY':<40} {total_npu_us:>10.1f} {total_cpu_us:>10.1f}")
    print(f"\nEnd-to-end latency: {total_latency_us / 1000:.2f} ms")

    # Identify bottlenecks: CPU-fallback layers consuming >10% of total time
    fallbacks = [p for p in profiles if not p.is_npu_delegated and
                 p.cpu_latency_us > (total_latency_us * 0.10)]
    if fallbacks:
        print("\n⚠  High-cost CPU fallback layers detected (>10% of total latency):")
        for p in fallbacks:
            print(f"   - {p.name}: {p.cpu_latency_us:.1f} µs "
                  f"({p.cpu_latency_us/total_latency_us*100:.1f}% of total)")
        print("   Action: Verify TOSA compliance and re-run Vela with --verbose-operators")


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Parse Alif DK-E8 Ethos-U85 profiling logs")
    parser.add_argument("--log", required=True, help="Path to profiling log CSV file")
    args = parser.parse_args()
    profiles = parse_log(args.log)
    if not profiles:
        print("Error: no valid profile entries found in log.", file=sys.stderr)
        sys.exit(1)
    report(profiles)

Pro-Tip: Any layer flagged as NO (FALLBACK) in the profiling report is a direct optimization target. Cross-reference the operator name against the TOSA specification and ExecuTorch's Ethos-U delegate documentation to determine whether a graph transformation can make it NPU-delegatable.


Future-Proofing Edge AI Infrastructure

The Ethos-U85's TOSA-compliant execution model is its most strategically important property for long-term deployments. TOSA-compliant graph updates allow newer SLMs to execute on existing E-Series silicon without hardware changes—the NPU command stream format remains stable across model generations. As noted from Alif's communications, "these transformer-ready NPUs are the first to run transformer networks in microcontrollers using the Arm Ethos-U85 NPU."

Future firmware optimization work should target the following hardware-level priorities, in order of impact:

  1. NPU Command Stream Compression: Vela's next-generation command stream encoding reduces MRAM read bandwidth for the command stream fetch, directly lowering idle power between inference calls.
  2. Memory Bus Arbitration Policy Tuning: The Ensemble E8's AXI interconnect supports configurable arbitration between HP/HE subsystems and the NPU DMA engine. Profiling-driven arbitration weighting eliminates bus contention during concurrent NPU inference and HE sensor processing.
  3. Weight Streaming from MRAM: For SLMs larger than the available SRAM budget, implement layer-by-layer weight streaming from MRAM into the NPU tensor arena. This requires the Vela --cascade-recompile flag and explicit double-buffering in the DMA dispatch layer.
  4. KV-Cache Compression: INT4 quantization of stored KV projections reduces KV-cache memory footprint by 50% versus INT8, extending maximum sequence length within the same SRAM budget. This is not yet in the stable ExecuTorch Ethos-U backend but is on the roadmap.
  5. Cooperative Dual-Core Scheduling: Use the HP core exclusively for NPU dispatch and the HE core for all peripheral I/O. Cross-core communication via shared SRAM message queues adds minimal latency while maintaining full NPU command stream throughput.
  6. MRAM Wear Leveling for Frequent Model Updates: If the deployment scenario requires OTA model updates, implement a dual-bank MRAM write strategy to distribute write cycles across the 5.5MB MRAM array and maintain update atomicity.

The Ethos-U85's 4x performance uplift over previous-generation Ethos NPUs is not a ceiling—it is the baseline for an architecture that was explicitly designed to scale with the TOSA operator set as transformer models evolve. The engineering work is in ensuring your memory map, quantization pipeline, and RTOS scheduling policy are structured to exploit every cycle of that headroom.


Keywords: Arm Ethos-U85 NPU, Alif Ensemble E-Series, Cortex-M55, Zephyr RTOS, CMSIS-NN, Tightly Coupled SRAM, Small Language Models (SLMs), Transformer Inference, Memory-mapped I/O, Quantization-aware training, ExecuTorch