SambaNova SN40L & Composition of Expert (CoE): Tech-insights

You know how we developers usually get caught up in the fancy world of APIs and model capabilities? Today, we’re diving deep into something brewing at the very bottom of the AI stack. And trust me, it’s juicier than your morning standup meeting!

As software developers increasingly integrate AI into their workflows, understanding the complete AI value chain—from hardware to application—becomes essential.

While most developers focus on APIs and model capabilities, the real breakthroughs often begin at the foundation: hardware infrastructure.

This article explores a revolutionary shift in AI hardware architecture through SambaNova’s SN40L chip and its Composition of Experts (CoE) approach.

This technology allows up to 150 specialized AI models to run simultaneously on a single card. It opens new possibilities in code generation, automated testing, and system optimization.

These hardware innovations at the bottom of the AI stack are set to reshape what’s possible at the application layer, making it crucial for developers to understand how these changes will impact their daily work.

Hardware innovations at the bottom of the AI stack are set to reshape what’s possible at the application layer. This makes it crucial for developers to understand how these changes will impact their daily work.

Jevons Paradox & Consuming More AI (& GPUs)

GPGPUs are set at the bottom of the AI supply chain. It is the most critical infrastructure building block for training and fine-tuning LLMs. Yet, it is the most scarce resource so far.

Yes, electricity and building enough data centers are other problems that infrastructure companies need to worry about. However, we still don’t have enough GPUs to perform the necessary computing.

The cost per million tokens has indeed been going down rapidly. But the Jevons Paradox is taking effect very clearly!

The concept of the Jevons Paradox is becoming increasingly relevant in the AI space. This is particularly true with the rapidly falling cost per million tokens processed by large language models (LLMs) like GPT-4.

As the cost of processing tokens decreases—prices have dropped from $36 per million tokens in March 2023 to as low as $2 in 2024—the overall use of AI technologies is expanding.

This aligns with Jevons Paradox, where increased efficiency leads to higher consumption rather than reducing overall usage.

Major LLM-as-a-service providers will keep lowering the cost per million tokens with their software optimizations. Meanwhile, a new generation of GPUs and AI accelerators is emerging, promising not only further cost reductions but also introducing an entirely new paradigm.

SambaNova Systems is one of them. SambaNova is best known for its Reconfigurable Dataflow Unit (RDU) architecture. Its unique design enables its hardware to outperform traditional GPUs in handling large AI workloads.

One of SambaNova SN40L chip is its most notable innovations , released in September 2023. This chip can handle 5 trillion parameters and support sequence lengths of up to 256k tokens. This offers unprecedented performance for LLMs.

The SambaNova SN40L is a Reconfigurable Dataflow Unit (RDU) specifically engineered to handle AI workloads. It serves as an alternative to traditional GPUs. Its architecture is optimized for tasks like the Composition of Experts (CoE). The latter involves managing several specialized models in parallel, more on that below.

The SambaNovaSN40L employs a three-tier memory system designed to address the memory bottlenecks in AI workloads, particularly when dealing with large models.

This system consists of on-chip distributed SRAM, on-package HBM, and off-package DDR DRAM. These provide the high bandwidth, low latency, and storage capacity needed for efficient model execution and model switching in CoE tasks.

Central to the SN40L is its streaming dataflow architecture, surpassing conventional operator fusion GPUs use. Key features include Composable Memory Units (PMUs) for flexible memory management, Pattern Compute Units (PCUs) for high-density computing, and a Reconfigurable Dataflow Network (RDN) that enables efficient data routing and reduces movement overhead.

These components allow the SambaNovaSN40L to fuse numerous operations into a single kernel, providing 2× to 13× speedups over traditional methods. The SambaNova SN40L is particularly suited for applications demanding high parallelism. It offers a robust solution for running complex AI models, which could be revolutionary for AI in software development.

the SambaNova Reconfigurable Dataflow Architecture (RDA) creates custom processing pipelines that allow data to flow through the complete computation graph. This minimizes data movement and results in extremely high hardware utilization.

On the smallest 8B model, SambaNova outperforms the best GPU offering by 5.2X. What’s interesting is that it can run up to 150 instances of this model concurrently in a single card! What does that mean? CoEs unleashed!

SambaNova SN40L outperforms the best GPU offering by 5.2X.

Output tokens per second from all providers for Llama3.1 70B Instruct

from artificialanalysis.ai at the time of this newsletter.

A New Parallel LLM Approach – Composition of Experts (CoE)

The Composition of Experts (CoE) is a modular approach for building AI systems that contrasts with monolithic models like GPT-4. It involves using multiple smaller, specialized expert models, each focused on a particular task or domain.

How CoE Works

CoE Pipeline from prompt to completion.

CoE systems are composed of several main components:

Expert Models: Each model specializes in a specific task, allowing for heterogeneous architectures and parameter counts.
Router Model: Directs incoming tasks to the most relevant expert model.
Pipeline: Manages the flow of tasks from routing to execution, optimizing resource use.

Key benefits of CoE include:

Lower Cost and Complexity: Managing smaller models is more affordable and resource-efficient.
Enhanced Accuracy: Fine-tuning specialized models for specific tasks allows them to outperform larger models in those areas.
Modularity and Flexibility: CoE enables the independent development and updating of models, adding agility and adaptability.
Scalability and Efficiency: CoE supports executing many parallel small models, optimizing performance and scalability.

What CoE didn’t take off until now?

Low Operational Intensity and Fusion Limitations.

Smaller expert models inherent to CoE often exhibit lower operational intensity. That’s to say, the ratio of compute operations to memory accesses is less favorable. This characteristic arises because smaller models require relatively more data movement for a given amount of computation.

Traditional GPUs face challenges executing such models efficiently because they lack the capability for effective operator fusion. This technique combines multiple operations into a single kernel, boosting operational intensity and reducing data movement overhead.

GPUs often face restrictions in fusing operators with complex access patterns involving shuffles and transposes, common in smaller models.
The rigid memory hierarchy and programming model of GPUs can create data movement bottlenecks during fusion.
Limited on-chip SRAM capacity necessitates storing intermediate results in off-chip memory, further hindering fusion opportunities.

CoE involves managing and switching between a multitude of expert models. This poses challenges in terms of:

Hosting a large number of models: Conventional hardware with limited high-bandwidth memory (HBM) capacity struggles to simultaneously accommodate all expert weights.
Dynamically switching between models: Transferring model weights between the main memory and the accelerator when switching experts can be slow and inefficient.

These challenges necessitate either (1) Employing multiple machines to provide sufficient memory capacity, increasing cost and complexity, or (2) Utilizing slower host memory, leading to increased switching latency and reduced performance.

The new bread of GPUs or accelerators betting on CoEs, such as SambaNova, are addressing these challenges with specialized solutions, such as:

Deeper Memory Hierarchy: The SambaNova SN40L, for example, uses a three-tier memory system to handle expert models and their switching efficiently.
Operator Fusion: The SambaNova SN40L enables fusion and parallelism, maximizing performance.
Model Switching: High-bandwidth memory interfaces minimize the performance impact of switching between models.

Such novel architectures enable up to 150 expert models and a trillion parameters of combined LLM models running on a single card!

What does this mean for software development?

I know you have been waiting for this. But, remember, it is about learning what’s behind it first.

1. Accelerated Agentic Workflows in Software Development

In software development, agentic workflows refer to tasks that intelligent agents, such as code generation, automated testing, and debugging can automate.

For example, a developer working on a complex web application could rely on AI agents to generate sections of code based on high-level descriptions.

Tools like GitHub Copilot or Cursor already provide this functionality by suggesting code snippets in real-time, based on user input.

Using the SN40L’s parallelism, these workflows can now run much faster. For example, multiple specialized models could analyze different aspects of the code at once. One model might handle syntax correction, another focus on security vulnerabilities, and a third optimize performance.

This drastically reduces the time spent on iterative tasks, freeing the developer to focus on higher-level problem-solving. The SN40L’s streaming dataflow ensures that these models work concurrently without causing performance bottlenecks.

This is especially the case during critical workflows such as continuous integration (CI) and deployment (CD) pipelines.

Let’s take a continuous integration environment as an example. Here, AI agents can quickly analyze pull requests for potential issues, validate unit tests, and suggest improvements across multiple services simultaneously.

Without the performance benefits of the SN40L, this type of task might cause delays due to context-switching between different models or processing stages.

2. Unprecedented Parallelism for Solving Parallelizable Problems

In scenarios like optimizing parallel algorithms or modifying microservices architecture, SN40L’s ability to run multiple models in parallel provides significant advantages.

For example, in a large microservices architecture, making changes to multiple microservices often requires understanding the dependencies between them.

AI agents can analyze and update these services in parallel, identifying areas for improvement and suggesting changes. This could include optimizing API request handling, database queries, or load-balancing configurations.

Imagine a situation where a developer is trying to improve the efficiency of a parallel algorithm processing large datasets (such as map-reduce operations). Typically, the bottleneck occurs when analyzing how different algorithm parts interact with the dataset.

With the SN40L, AI models can simultaneously assess the computational complexity of different algorithm sections, optimize memory usage, and even re-architect the data flow without waiting for one analysis to finish before starting another.

This allows developers to optimize quickly, especially in high-performance computing (HPC) environments where micro-optimizations are critical.

In a DevOps context, parallelism can also be applied to automate the deployment and scaling of multiple microservices. AI models can predict potential failures during deployment, suggest fixes, and even apply those fixes in parallel. This minimizes downtime and ensures that updates can be rolled out quickly and with fewer errors.

3. Overcoming Benchmark Limitations in Real-world Software Development

In AI-assisted software development, benchmarks traditionally measure performance through fixed tests like code generation or algorithm efficiency. However, these benchmarks often fail to capture real-world development workflows’ iterative, unpredictable nature.

When developers use AI for debugging, code optimization, or software refactoring, they often work in cycles of experimentation and revision. Feedback from the AI must be seamlessly integrated into their broader development workflow.

This iterative style, known as real-world software development, is often far more chaotic and less structured than what standard benchmarks reflect.

The SN40L’s Composition of Experts (CoE) architecture is designed to address this gap by offering dynamic model switching and high-parallelism features that better support iterative workflows.

For example, when debugging a large codebase, developers often need to test multiple potential fixes across different code segments in parallel.

Traditional benchmarks might evaluate the AI’s ability to debug isolated code snippets but fail to account for the need to switch between multiple contexts (e.g., different code modules or dependencies) seamlessly.

SN40L’s architecture overcomes this by allowing many specialized AI models to be run simultaneously, with minimal switching costs, enabling developers to interact with the system fluidly, as they would in real-world conditions.

Final Thoughts

As AI technology evolves, so does the infrastructure that powers it. SambaNova’s SN40L and the Composition of Experts model signal a new era for developers, where specialized tasks are managed with unmatched efficiency.

By understanding the foundational shifts in AI hardware, developers can anticipate and harness these advancements to optimize workflows, from code generation to real-time debugging.

The future of AI lies not only in the models we build but also in the hardware that enables them, making it essential for developers to stay informed and adapt to these ground-breaking tools.

Discover more from AI For Developers

Subscribe to get the latest posts sent to your email.

Revolutionizing AI Infrastructure: SambaNova SN40L & Composition of Experts (CoE)

Jevons Paradox & Consuming More AI (& GPUs)