Microsoft has developed a real time engine for artificial intelligence using the latest 14nm Stratix 10 field programmable devices from Intel. The technique removes software from the deep learning loop to reduce the latency.
Project Brainwave was disclosed at this week's Hot Chips conference as a major leap forward in both performance and flexibility for cloud-based serving of deep learning models. It was designed for real-time AI, which means the system processes requests as fast as it receives them, with ultra-low latency. Real-time AI is becoming increasingly important as cloud infrastructures process live data streams, whether they be search queries, videos, sensor streams, or interactions with users.
Competitors Xilinx in FPGAs and NVIDIA in graphics processing units have both been accelerating these AI deep learning algorithms into embedded and IoT applications
The Project Brainwave system is built with three main layers:
Project Brainwave also uses a powerful “soft” DNN processing unit (or DPU), synthesized onto commercially available FPGAs, rather than than a hardened, fixed DPU. Although some of these chips have high peak performance, they must choose their operators and data types at design time, which limits their flexibility. Project Brainwave takes a different approach, providing a design that scales across a range of data types, with the desired data type being a synthesis-time decision.
Even on early Stratix 10 silicon, the ported Project Brainwave system ran a large GRU (gated recurrent unit) model—five times larger than Resnet-50 that Microsoft showed a few years ago—with no batching, and achieved record-setting performance.
The Project Brainwave system is built with three main layers:
- A high-performance, distributed system architecture;
- A hardware deep neural network (DNN) engine synthesized onto FPGAs; and
- A compiler and runtime for low-friction deployment of trained models.
Project Brainwave also uses a powerful “soft” DNN processing unit (or DPU), synthesized onto commercially available FPGAs, rather than than a hardened, fixed DPU. Although some of these chips have high peak performance, they must choose their operators and data types at design time, which limits their flexibility. Project Brainwave takes a different approach, providing a design that scales across a range of data types, with the desired data type being a synthesis-time decision.
The design combines both the ASIC digital signal processing blocks on the FPGAs and the synthesizable logic to provide a greater and more optimised number of functional units. This exploits the FPGA’s flexibility with highly customised, narrow-precision data types that increase performance without real losses in model accuracy. It also also allows research to be included quickly thorugh the synthesisable elements, with updates in a matter of weeks.
The AI incorporates a software stack designed to support the wide range of popular deep learning frameworks. But Microsoft has developed a graph-based intermediate representation that converts models trained in the popular frameworks, and allows them to be compiled down to the FPGA infrastructure.
Even on early Stratix 10 silicon, the ported Project Brainwave system ran a large GRU (gated recurrent unit) model—five times larger than Resnet-50 that Microsoft showed a few years ago—with no batching, and achieved record-setting performance.
The demo used Microsoft’s custom 8-bit floating point format (“ms-fp8”), which does not suffer accuracy losses (on average) across a range of models, running at 39.5 Teraflops on this large GRU, running each request in under one millisecond. At that level of performance, the Brainwave architecture sustains execution of over 130,000 compute operations per cycle, driven by one macro-instruction being issued each 10 cycles.
As the system is tuned over the next few quarters, the team expects significant further performance improvements.
Microsoft shortly plans to detail how Azure customers will be able to run their most complex deep learning models at this level of performance.
Microsoft shortly plans to detail how Azure customers will be able to run their most complex deep learning models at this level of performance.
Related stories:
- Another startup aims for embedded AI
- ARM takes aim at embedded AI
- NVIDIA pushes artificial intelligence into embedded designs with Jetson TX2
- Echelon system uses AI to collect traffic data through street lighting
- Edge analytics vital for security says Greenwave
- Xilinx pushes machine learning and AI to the edge for embedded applications
- Startup aims to bring artificial intelligence to IoT nodes
No comments:
Post a Comment