Here’s How to Make Deep Learning More Sustainable - IEEE Spectrum

2022-06-25 09:08:17 By : Ms. Meredith Yuan

IEEE websites place cookies on your device to give you the best user experience. By using our websites, you agree to the placement of these cookies. To learn more, read our Privacy Policy.

Substituting equations for neural network operations can yield whole new efficiencies

Look under the hood of a car and you’ll see an efficient modular design with each component performing specific actions: The fuel injectors squirt fuel into the cylinders (or intake manifold); the coil packs send high-voltage pulses to the spark plugs, the catalytic converter removes problematic exhaust gases, and so forth. Look under the hood of a large deep neural network, the workhorse of modern AI, and you’ll see something that looks a lot more generic and disorganized: a mathematical model with billions of parameters whose inner workings are often mysterious even to its creators.

But allowing all that staggering complexity has been crucial for deep learning to achieve the amazing feats it has. Deep neural networks can detect cancermore accuratelythan do doctors. Such a network has beaten the world champion in the game of Go. And these networks can figure out appropriate tax policies when traditional economic models would be too complex to solve. The downside is that the enormous computational and energy costs needed to use these systems are making AI based on deep learning unsustainable.

But it is possible to lower these costs by changing when and how deep learning is used. Currently, people often use huge neural networks to learn about the world. The many parameters in these networks provide the flexibility needed to discover new patterns in how things work. But, once these underlying patterns have been unearthed, an overwhelming share of the parameters in the model are unnecessary. These extra parameters just impose overhead, which should be avoided whenever possible.

Practitioners already attempt to reduce overhead by “pruning” their networks. That can help reduce the amount of computation (and hence energy) used in operating deep networks because many calculations can be avoided. Pruning techniques can even be interjected into the training process to reduce costs there as well. But recent research that we’ve been involved in has introduced new techniques that have the potential to help shrink the environmental footprint of AI much more dramatically.

Modern deep neural networks were first popularized in 2012, when AlexNet shattered records in image recognition, identifying objects in pictures far more accurately than did other models. As the name implies, deep neural networks were inspired by the neural connections in brains. Just as bigger-brained animals can solve more complex problems, the ever-bigger neural networks that researchers have fashioned can solve increasingly complicated problems—these huge networks have set records in computer vision, natural-language understanding, and other domains.

The problem now is that deep neural networks are growing too fast. Today, deep neural networks for image recognition require almost 100,000 times as many calculations as AlexNet. The size of modern neural networks built for natural-language processing are similarly worrisome: Researchers have estimated that training the state-of-the-art language generation model GPT-3 took weeks and cost millions of dollars. It also required 190,000 kilowatt-hours of electricity, producing the same amount of CO2 as driving a car a distance roughly equivalent to a trip to the moon and back!

We want a two-step process: Use the flexibility of deep learning to discover things, and then jettison the unneeded complexity by distilling the insights back into concise, efficient equations.

Spending huge amounts of energy in return for the feats that deep neural networks can provide is a trade-off that people have made many times in the last 10 years. And that’s understandable, since deep learning often finds solutions that are better than those human ingenuity has produced. But the flexibility that makes this possible requires neural networks to be enormous and inefficient.

If we continue to develop ever bigger networks, as researchers are doing now, society will end up with major new sources of greenhouse gases. We’ll also end up with economic monopolies because only a few companies and organizations will be able to afford to create these systems.

One way that specialists could make their neural networks more efficient is to imitate what’s been done since the dawn of science, which is to turn large amounts of data (observations of the physical world) into equations that describe concisely how the world works. One of us (Udrescu) recently did this, as part of a team that worked out an algorithmic approach for iteratively distilling neural networks down to efficient computational nuggets. The result was exemplified in something called AI Feynman, which turns a set of input-output data into succinct equations that summarize how those data are related.

To better understand how this worked, imagine trying to model gravity using a neural network. To teach the network, you would show it many examples of gravity in action—apples falling from trees, cannon balls flying through the air, maybe satellites circling Earth.

Eventually, the network would become a gravity calculator, predicting with high accuracy how objects fall. But there is a catch. The “neural equation for gravity” found by the network would be voluminous, including those many parameters that make deep learning so flexible. Indeed, when the AI Feynman researchers modeled gravity this way, the resulting equation would have filled 32,000,000 chalkboards!

While deep networks are immensely flexible in what they can model, they are more expensive to run than are systems that use equations, which are preferable when the phenomenon at hand is amenable to being described by them.Alison Walsh

The irony, of course, is that we know that there is a simple and efficient way to model gravity: Newton’s equation, which is taught in high school physics. All the rest of the complexity in the neural-network version came from having to discover this formula using deep learning. That is overhead you’d like to shed so you don’t have to pay the computational costs of running it.

What we are suggesting is a two-step process: First, use the flexibility of deep learning to discover things; then jettison the unneeded complexity by distilling the insights gained from the network back into concise, efficient equations.

The AI Feynman team did that exact experiment, wherein 120 neural networks were taught to predict various natural phenomena, including gravity. In 118 of the 120 cases, it was possible to use this “symbolic distillation” technique to cut those oversized networks all the way down from the millions-of-parameter models to the underlying (extremely efficient) physics equation.

Symbolic distillation is more sophisticated than pruning because of how it learns from data and deduces the overall relationship between inputs and outputs. So, in our gravity example, the algorithm would analyze how particular inputs, like the mass of an object, affect the answer the network predicts. Whole parts of the network can then be simplified based on a variety of telltale signs, such as when two variables don’t interact. Even better, as the network gets simpler, it becomes feasible to test whether whole sections can be replaced with mathematical equations, which are easier to calculate and are more interpretable than what’s normally inside these networks.

Of course, this technique won’t always be as effective as it is for gravity. For example, some phenomena may be inherently complex—for example, the turbulent flow of fast-moving water or natural language—and simple underlying equations may not emerge. But where it does work, this technique promises to reduce the computational cost (and hence environmental cost) of these models by many orders of magnitude.

Thus far what we’ve described only helps with deployment of a deep-learning system, because the full network must still be trained normally in the first place. But symbolic distillation can also help with future training costs by taking advantage of modularity. Already, trainable or premade modules are being used in network design. For example, researchers have designed networks with separate modules so that each can become an expert in a particular subtask. Convolutions and attention layers can also be seen as trained modules and are in widespread use in image recognition.

The problem is that you can’t build an efficient module that you haven’t yet discovered. This limits the ability of traditional neural networks to build on the results of similar networks that came before them. But symbolic distillation suggests an alternative: You could learn concepts, transform these into modules, and then use them as premade building blocks for subsequent networks. Indeed, you can even create these modules as part of the training process.

How does this work? Imagine a neural network that predicts where a baseball will land once it has been hit. It will clearly need to understand concepts like gravity and air resistance, either explicitly or implicitly. With symbolic distillation, these two concepts could be first learned as modules, and the number of calculations could then be drastically reduced by replacing these network modules by equations. What’s more, these modules could then be reused as part of future networks that predict where meteorites, say, will land, which would save the cost of training such functionality over and over.

The authors suggest a new approach to deep learning, called network distillation, by which parts of an otherwise complicated deep network are replaced by modules based on concise equations.Alison Walsh

We believe that such a modular approach will lessen the cost of AI based on deep learning. It requires three distinct steps:

First, test whether existing systems can be modularized and simplified without losing accuracy, robustness, or other desirable features. If so, use these streamlined systems for implementations. If the system (or a part of it) can be represented by a simple formula, you also learn what aspect of the world the neural network was modeling. This may help develop an understanding of the fundamental properties of complex systems, something not possible with the current black-box approach.

Second, use the new streamlined modules to help train subsequent neural networks. For example, include a module that calculates the simple formula for gravity in a system designed to detect planetary systems from astronomical measurements. This strategy would help reduce the cost of future training as researchers develop more and more such modules.

Finally, build replicable machine-learning modules and systems by expanding in the emerging area of machine-learning operations (MLOps). This area applies best practices from software engineering to machine-learning models. This effort could include version control and continuous integration to build large machine-learning systems around reusable modules and trainable parts.

The pressure to change how we do machine learning is growing rapidly. Already, many companies cannot innovate in important areas for which deep learning could be applied because they cannot afford the price tag. Even for those that can stomach the expense, expanding carbon footprints threaten to make deep learning environmentally unsustainable.

But deep learning can become more benign by using the complexity of neural networks only to learn about new aspects of the world, while invoking efficient, pretrained modules elsewhere. Streamlining modules, ideally down to simple mathematical equations, could slash the cost of running many AI systems and, we hope, reduce the training cost for these future systems as well.

Ultimately, we see the biggest benefit of symbolic distillation coming from its ability to address Canadian computer scientist Richard Sutton’s bitter lesson—that the history of AI is one of people trying to incorporate their expertise into systems, but then more-powerful computers overtake these efforts. We believe that a hybrid approach may provide some sweetener for Sutton’s bitter lesson by using AI systems first to learn new results, but then also to find an efficient way to implement them. Such an approach can help us make the AI enterprise sustainable.

Neil C. Thompson is a research scientist at MIT’s Computer Science and Artificial Intelligence Laboratory.

The AI pioneer says it’s time for smart-sized, “data-centric” solutions to big issues

Andrew Ng was involved in the rise of massive deep learning models trained on vast amounts of data, but now he’s preaching small-data solutions.

Andrew Ng has serious street cred in artificial intelligence. He pioneered the use of graphics processing units (GPUs) to train deep learning models in the late 2000s with his students at Stanford University, cofounded Google Brain in 2011, and then served for three years as chief scientist for Baidu, where he helped build the Chinese tech giant’s AI group. So when he says he has identified the next big shift in artificial intelligence, people listen. And that’s what he told IEEE Spectrum in an exclusive Q&A.

Ng’s current efforts are focused on his company Landing AI, which built a platform called LandingLens to help manufacturers improve visual inspection with computer vision. He has also become something of an evangelist for what he calls the data-centric AI movement, which he says can yield “small data” solutions to big issues in AI, including model efficiency, accuracy, and bias.

The great advances in deep learning over the past decade or so have been powered by ever-bigger models crunching ever-bigger amounts of data. Some people argue that that’s an unsustainable trajectory. Do you agree that it can’t go on that way?

Andrew Ng: This is a big question. We’ve seen foundation models in NLP [natural language processing]. I’m excited about NLP models getting even bigger, and also about the potential of building foundation models in computer vision. I think there’s lots of signal to still be exploited in video: We have not been able to build foundation models yet for video because of compute bandwidth and the cost of processing video, as opposed to tokenized text. So I think that this engine of scaling up deep learning algorithms, which has been running for something like 15 years now, still has steam in it. Having said that, it only applies to certain problems, and there’s a set of other problems that need small data solutions.

When you say you want a foundation model for computer vision, what do you mean by that?

Ng: This is a term coined by Percy Liang and some of my friends at Stanford to refer to very large models, trained on very large data sets, that can be tuned for specific applications. For example, GPT-3 is an example of a foundation model [for NLP]. Foundation models offer a lot of promise as a new paradigm in developing machine learning applications, but also challenges in terms of making sure that they’re reasonably fair and free from bias, especially if many of us will be building on top of them.

What needs to happen for someone to build a foundation model for video?

Ng: I think there is a scalability problem. The compute power needed to process the large volume of images for video is significant, and I think that’s why foundation models have arisen first in NLP. Many researchers are working on this, and I think we’re seeing early signs of such models being developed in computer vision. But I’m confident that if a semiconductor maker gave us 10 times more processor power, we could easily find 10 times more video to build such models for vision.

Having said that, a lot of what’s happened over the past decade is that deep learning has happened in consumer-facing companies that have large user bases, sometimes billions of users, and therefore very large data sets. While that paradigm of machine learning has driven a lot of economic value in consumer software, I find that that recipe of scale doesn’t work for other industries.

It’s funny to hear you say that, because your early work was at a consumer-facing company with millions of users.

Ng: Over a decade ago, when I proposed starting the Google Brain project to use Google’s compute infrastructure to build very large neural networks, it was a controversial step. One very senior person pulled me aside and warned me that starting Google Brain would be bad for my career. I think he felt that the action couldn’t just be in scaling up, and that I should instead focus on architecture innovation.

“In many industries where giant data sets simply don’t exist, I think the focus has to shift from big data to good data. Having 50 thoughtfully engineered examples can be sufficient to explain to the neural network what you want it to learn.” —Andrew Ng, CEO & Founder, Landing AI

I remember when my students and I published the first NeurIPS workshop paper advocating using CUDA, a platform for processing on GPUs, for deep learning—a different senior person in AI sat me down and said, “CUDA is really complicated to program. As a programming paradigm, this seems like too much work.” I did manage to convince him; the other person I did not convince.

I expect they’re both convinced now.

Ng: I think so, yes.

Over the past year as I’ve been speaking to people about the data-centric AI movement, I’ve been getting flashbacks to when I was speaking to people about deep learning and scalability 10 or 15 years ago. In the past year, I’ve been getting the same mix of “there’s nothing new here” and “this seems like the wrong direction.”

How do you define data-centric AI, and why do you consider it a movement?

Ng: Data-centric AI is the discipline of systematically engineering the data needed to successfully build an AI system. For an AI system, you have to implement some algorithm, say a neural network, in code and then train it on your data set. The dominant paradigm over the last decade was to download the data set while you focus on improving the code. Thanks to that paradigm, over the last decade deep learning networks have improved significantly, to the point where for a lot of applications the code—the neural network architecture—is basically a solved problem. So for many practical applications, it’s now more productive to hold the neural network architecture fixed, and instead find ways to improve the data.

When I started speaking about this, there were many practitioners who, completely appropriately, raised their hands and said, “Yes, we’ve been doing this for 20 years.” This is the time to take the things that some individuals have been doing intuitively and make it a systematic engineering discipline.

The data-centric AI movement is much bigger than one company or group of researchers. My collaborators and I organized a data-centric AI workshop at NeurIPS, and I was really delighted at the number of authors and presenters that showed up.

You often talk about companies or institutions that have only a small amount of data to work with. How can data-centric AI help them?

Ng: You hear a lot about vision systems built with millions of images—I once built a face recognition system using 350 million images. Architectures built for hundreds of millions of images don’t work with only 50 images. But it turns out, if you have 50 really good examples, you can build something valuable, like a defect-inspection system. In many industries where giant data sets simply don’t exist, I think the focus has to shift from big data to good data. Having 50 thoughtfully engineered examples can be sufficient to explain to the neural network what you want it to learn.

When you talk about training a model with just 50 images, does that really mean you’re taking an existing model that was trained on a very large data set and fine-tuning it? Or do you mean a brand new model that’s designed to learn only from that small data set?

Ng: Let me describe what Landing AI does. When doing visual inspection for manufacturers, we often use our own flavor of RetinaNet. It is a pretrained model. Having said that, the pretraining is a small piece of the puzzle. What’s a bigger piece of the puzzle is providing tools that enable the manufacturer to pick the right set of images [to use for fine-tuning] and label them in a consistent way. There’s a very practical problem we’ve seen spanning vision, NLP, and speech, where even human annotators don’t agree on the appropriate label. For big data applications, the common response has been: If the data is noisy, let’s just get a lot of data and the algorithm will average over it. But if you can develop tools that flag where the data’s inconsistent and give you a very targeted way to improve the consistency of the data, that turns out to be a more efficient way to get a high-performing system.

“Collecting more data often helps, but if you try to collect more data for everything, that can be a very expensive activity.” —Andrew Ng

For example, if you have 10,000 images where 30 images are of one class, and those 30 images are labeled inconsistently, one of the things we do is build tools to draw your attention to the subset of data that’s inconsistent. So you can very quickly relabel those images to be more consistent, and this leads to improvement in performance.

Could this focus on high-quality data help with bias in data sets? If you’re able to curate the data more before training?

Ng: Very much so. Many researchers have pointed out that biased data is one factor among many leading to biased systems. There have been many thoughtful efforts to engineer the data. At the NeurIPS workshop, Olga Russakovsky gave a really nice talk on this. At the main NeurIPS conference, I also really enjoyed Mary Gray’s presentation, which touched on how data-centric AI is one piece of the solution, but not the entire solution. New tools like Datasheets for Datasets also seem like an important piece of the puzzle.

One of the powerful tools that data-centric AI gives us is the ability to engineer a subset of the data. Imagine training a machine-learning system and finding that its performance is okay for most of the data set, but its performance is biased for just a subset of the data. If you try to change the whole neural network architecture to improve the performance on just that subset, it’s quite difficult. But if you can engineer a subset of the data you can address the problem in a much more targeted way.

When you talk about engineering the data, what do you mean exactly?

Ng: In AI, data cleaning is important, but the way the data has been cleaned has often been in very manual ways. In computer vision, someone may visualize images through a Jupyter notebook and maybe spot the problem, and maybe fix it. But I’m excited about tools that allow you to have a very large data set, tools that draw your attention quickly and efficiently to the subset of data where, say, the labels are noisy. Or to quickly bring your attention to the one class among 100 classes where it would benefit you to collect more data. Collecting more data often helps, but if you try to collect more data for everything, that can be a very expensive activity.

For example, I once figured out that a speech-recognition system was performing poorly when there was car noise in the background. Knowing that allowed me to collect more data with car noise in the background, rather than trying to collect more data for everything, which would have been expensive and slow.

What about using synthetic data, is that often a good solution?

Ng: I think synthetic data is an important tool in the tool chest of data-centric AI. At the NeurIPS workshop, Anima Anandkumar gave a great talk that touched on synthetic data. I think there are important uses of synthetic data that go beyond just being a preprocessing step for increasing the data set for a learning algorithm. I’d love to see more tools to let developers use synthetic data generation as part of the closed loop of iterative machine learning development.

Do you mean that synthetic data would allow you to try the model on more data sets?

Ng: Not really. Here’s an example. Let’s say you’re trying to detect defects in a smartphone casing. There are many different types of defects on smartphones. It could be a scratch, a dent, pit marks, discoloration of the material, other types of blemishes. If you train the model and then find through error analysis that it’s doing well overall but it’s performing poorly on pit marks, then synthetic data generation allows you to address the problem in a more targeted way. You could generate more data just for the pit-mark category.

“In the consumer software Internet, we could train a handful of machine-learning models to serve a billion users. In manufacturing, you might have 10,000 manufacturers building 10,000 custom AI models.” —Andrew Ng

Synthetic data generation is a very powerful tool, but there are many simpler tools that I will often try first. Such as data augmentation, improving labeling consistency, or just asking a factory to collect more data.

To make these issues more concrete, can you walk me through an example? When a company approaches Landing AI and says it has a problem with visual inspection, how do you onboard them and work toward deployment?

Ng: When a customer approaches us we usually have a conversation about their inspection problem and look at a few images to verify that the problem is feasible with computer vision. Assuming it is, we ask them to upload the data to the LandingLens platform. We often advise them on the methodology of data-centric AI and help them label the data.

One of the foci of Landing AI is to empower manufacturing companies to do the machine learning work themselves. A lot of our work is making sure the software is fast and easy to use. Through the iterative process of machine learning development, we advise customers on things like how to train models on the platform, when and how to improve the labeling of data so the performance of the model improves. Our training and software supports them all the way through deploying the trained model to an edge device in the factory.

How do you deal with changing needs? If products change or lighting conditions change in the factory, can the model keep up?

Ng: It varies by manufacturer. There is data drift in many contexts. But there are some manufacturers that have been running the same manufacturing line for 20 years now with few changes, so they don’t expect changes in the next five years. Those stable environments make things easier. For other manufacturers, we provide tools to flag when there’s a significant data-drift issue. I find it really important to empower manufacturing customers to correct data, retrain, and update the model. Because if something changes and it’s 3 a.m. in the United States, I want them to be able to adapt their learning algorithm right away to maintain operations.

In the consumer software Internet, we could train a handful of machine-learning models to serve a billion users. In manufacturing, you might have 10,000 manufacturers building 10,000 custom AI models. The challenge is, how do you do that without Landing AI having to hire 10,000 machine learning specialists?

So you’re saying that to make it scale, you have to empower customers to do a lot of the training and other work.

Ng: Yes, exactly! This is an industry-wide problem in AI, not just in manufacturing. Look at health care. Every hospital has its own slightly different format for electronic health records. How can every hospital train its own custom AI model? Expecting every hospital’s IT personnel to invent new neural-network architectures is unrealistic. The only way out of this dilemma is to build tools that empower the customers to build their own models by giving them tools to engineer the data and express their domain knowledge. That’s what Landing AI is executing in computer vision, and the field of AI needs other teams to execute this in other domains.

Is there anything else you think it’s important for people to understand about the work you’re doing or the data-centric AI movement?

Ng: In the last decade, the biggest shift in AI was a shift to deep learning. I think it’s quite possible that in this decade the biggest shift will be to data-centric AI. With the maturity of today’s neural network architectures, I think for a lot of the practical applications the bottleneck will be whether we can efficiently get the data we need to develop systems that work well. The data-centric AI movement has tremendous energy and momentum across the whole community. I hope more researchers and developers will jump in and work on it.

This article appears in the April 2022 print issue as “Andrew Ng, AI Minimalist.”