The Embedded Blog: Research tool uses machine learning to predict how fast code will run

Thursday, February 27, 2020

Research tool uses machine learning to predict how fast code will run

By Nick Flaherty www.flaherty.co.uk

Researchers at MIT's CSAIL lab in the US have developed a machine-learning tool that predicts how fast computer chips will execute code from various applications.

Compilers typically use performance models that run the code through a simulation of given chip architectures and use that for the code optimisation. Developers can then go in and work on the bottlenecks that slow down the operation.

However the performance models for machine code are handwritten by a relatively small group of experts and are not necessarily completely validated, which can be an issue. This means that the simulated performance measurements often deviate from real-life results.

The machine learning pipeline that automates this process, making it easier, faster, and more accurate. The Ithemal tool is a neural-network model that trains on labelled data in the form of “basic blocks” — fundamental snippets of computing instructions — to automatically predict how long it takes a given chip to execute previously unseen basic blocks. The results suggest this performs far more accurately than traditional hand-tuned models.

The researchers presented a benchmark suite of basic blocks from a variety of domains, including machine learning, compilers, cryptography, and graphics that can be used to validate performance models. They pooled more than 300,000 of the profiled blocks into an open-source dataset called BHive. During their evaluations, Ithemal predicted how fast Intel chips would run code even better than a performance model built by Intel itself using over 3,000 pages describing its chips’ architectures.

“Intel’s documents are neither error-free nor complete, and Intel will omit certain things, because it’s proprietary,” says co-author Charith Mendis, a PhD student at CSAIL. “However, when you use data, you don’t need to know the documentation. If there’s something hidden you can learn it directly from the data.”

In training, the Ithemal model analyzes millions of automatically profiled basic blocks to learn exactly how different chip architectures will execute computation. Importantly, Ithemal takes raw text as input and does not require manually adding features to the input data. In testing, Ithemal can be fed previously unseen basic blocks and a given chip, and will generate a single number indicating how fast the chip will execute that code.

To do so, the researchers clocked the average number of cycles a given microprocessor takes to compute basic block instructions — basically, the sequence of boot-up, execute, and shut down — without human intervention. Automating the process enables rapid profiling of hundreds of thousands or millions of blocks.

The researchers found Ithemal cut error rates in accuracy by 50 percent over traditional hand-crafted models, reducing to 10 percent, while the Intel performance-prediction model’s error rate was 20 percent on a variety of basic blocks across multiple different domains.

The tool should allow developers to generate code that runs faster and more efficiently on an ever-growing number of diverse and “black box” chip designs, says Mendis. For instance, domain-specific architectures, such as Google’s Tensor Processing Unit used specifically for neural networks, can be analysed. “If you want to train a model on some new architecture, you just collect more data from that architecture, run it through our profiler, use that information to train Ithemal, and now you have a model that predicts performance,” said Mendis.

“Modern computer processors are opaque, horrendously complicated, and difficult to understand. It is also incredibly challenging to write computer code that executes as fast as possible for these processors,” says co-author Michael Carbin, an assistant professor in the Department of Electrical Engineering and Computer Science (EECS). “This tool is a big step forward toward fully modeling the performance of these chips for improved efficiency.”

In a paper presented at the NeurIPS conference, the team proposed a new technique to automatically generate compiler optimizations. Specifically, they automatically generate an algorithm, called Vemal, that converts certain code into vectors, which can be used for parallel computing. Vemal outperforms hand-crafted vectorization algorithms used in the LLVM compiler.

Next, the researchers are studying methods to make models interpretable. Much of machine learning is a black box, so it’s not really clear why a particular model made its predictions. “Our model is saying it takes a processor, say, 10 cycles to execute a basic block. Now, we’re trying to figure out why,” said Carbin. “That’s a fine level of granularity that would be amazing for these types of tools.”

They also hope to use Ithemal to enhance the performance of Vemal even further and achieve better performance automatically.

www.csail.mit.edu

All the latest quantum computer articles

Thursday, February 27, 2020

Research tool uses machine learning to predict how fast code will run

No comments: