Machine learning chips that use analog circuits instead of digital ones have long promised huge energy savings. But in practice they’ve mostly delivered modest savings, and only for modest-sized neural networks. Silicon Valley startup Sageance says it has the technology to bring the promised power savings to tasks suited for massive generative AI models. The startup claims that its systems will be able to run the large language model Llama 2-70B at one-tenth the power of an Nvidia H100 GPU-based system, at one-twentieth the cost and in one-twentieth the space.
“My vision was to create a technology that was very differentiated from what was being done for AI,” says Sageance CEO and founder Vishal Sarin. Even back when the company was founded in 2018, he “realized power consumption would be a key impediment to the mass adoption of AI…. The problem has become many, many orders of magnitude worse as generative AI has caused the models to balloon in size.”
The core power-savings prowess for analog AI comes from two fundamental advantages: It doesn’t have to move data around and it uses some basic physics to do machine learning’s most important math.
That math problem is multiplying vectors and then adding up the result, called multiply and accumulate.Early on, engineers realized that two foundational rules of electrical engineers did the same thing, more or less instantly. Ohm’s Law—voltage multiplied by conductance equals current—does the multiplication if you use the neural network’s “weight” parameters as the conductances. Kirchoff’s Current Law—the sum of the currents entering and exiting a point is zero—means you can easily add up all those multiplications just by connecting them to the same wire. And finally, in analog AI, the neural network parameters don’t need to be moved from memory to the computing circuits—usually a bigger energy cost than computing itself—because they are already embedded within the computing circuits.
Sageance uses flash memory cells as the conductance values. The kind of flash cell typically used in data storage is a single transistor that can hold 3 or 4 bits, but Sageance has developed algorithms that let cells embedded in their chips hold 8 bits, which is the key level of precision for LLMs and other so-called transformer models. Storing an 8-bit number in a single transistor instead of the 48 transistors it would take in a typical digital memory cell is an important cost, area, and energy savings, says Sarin, who has been working on storing multiple bits in flash for 30 years.
Digital data is converted to analog voltages [left]. These are effectively multiplied by flash memory cells [blue], summed, and converted back to digital data [bottom].Analog Inference
Adding to the power savings is that the flash cells are operated in a state called “deep subthreshold.” That is, they are working in a state where they are barely on at all, producing very little current. That…
Read full article: Analog AI Startup Aims to Lower the Power of Gen AI
The post “Analog AI Startup Aims to Lower the Power of Gen AI” by Samuel K. Moore was published on 11/19/2024 by spectrum.ieee.org
Leave a Reply