Skip to content
AxiomLogicaSearch
Category

AI & ML

All about AI and Machine Learning, Latest articles, advances in domain.

All articles

Qdrant vs pgvector vs pgvectorscale for billion-vector filtering workloads
AI & ML

Qdrant vs pgvector vs pgvectorscale for billion-vector filtering workloads

On a 50M-vector benchmark, pgvectorscale/Postgres delivered 11.4x higher throughput than Qdrant at 99% recall (471.57 QPS vs 41.47 QPS) while Qdrant kept lower tail latency, but the result is workload-dependent and the Tiger Data comparison notes index build speed and operational trade-offs still matter.

16 min read
MoE++ with zero-computation experts: how the routing and gating residuals work
AI & ML

MoE++ with zero-computation experts: how the routing and gating residuals work

MoE++ adds zero-computation experts (zero, copy, constant) so tokens can discard, skip, or replace the MoE path, while gating residuals inject the previous layer’s routing signal to stabilize expert selection — but the design only pays off when FFN experts are the real bottleneck and zero-cost experts are deployed locally on every GPU to avoid communication overhead.

21 min read
Should enterprises migrate from naive RAG to modular or GraphRAG architectures?
AI & ML

Should enterprises migrate from naive RAG to modular or GraphRAG architectures?

Naive RAG is fast and cheap for localized FAQ-style queries, but GraphRAG and modular RAG become the better investment when questions require multi-hop reasoning, cross-document relationships, or stronger governance — the catch is that GraphRAG adds ontology/graph-maintenance overhead and can be slower to operate.

24 min read
How to enable FP8 KV cache quantization in vLLM without breaking prefix caching
AI & ML

How to enable FP8 KV cache quantization in vLLM without breaking prefix caching

vLLM’s FP8 KV cache can coexist with prefix caching because automatic prefix cache keys are still block-hash based — but on ROCm/W7900 the combination has a documented crash path, so the safe article must show the exact FP8 calibration path and the validation checks that prove prefix cache reuse still works.

18 min read
Why MultiHop-RAG exposes the limits of naive retrieval in multi-hop question answering
AI & ML

Why MultiHop-RAG exposes the limits of naive retrieval in multi-hop question answering

MultiHop-RAG shows that naive top-k retrieval breaks down when answers require chaining evidence across documents — the practical result is markedly weaker multi-hop QA accuracy than graph-augmented approaches, but the benchmark demonstrates failure modes more than it proves a single production architecture is universally superior.

18 min read
How to use vLLM for Mixtral and DeepSeek-V3 serving with expert parallelism
AI & ML

How to use vLLM for Mixtral and DeepSeek-V3 serving with expert parallelism

vLLM’s support for Mixtral and DeepSeek-V3 pairs expert parallelism with PagedAttention, continuous batching, and distributed inference so MoE serving can stay memory-efficient — but the deployment path is constrained by model-specific parallelism settings, supported hardware backends, and the need to tune GPU memory utilization and batching for expert-heavy traffic.

18 min read

The weekly brief.

One email each Sunday with what we tested, what we'd buy, and what to skip. No filler.