Four keys to machine learning on the edge

By John Fogarty, advisory software engineer, Base2 Solutions

Machine learning is hard but moving your ML model to your embedded device can be even harder. Here, we’ll discuss a few pain points in this process, and some up-front planning that can save you a world of hurt. Addressing these issues early in the design process is key to getting your new gadget out the door.

Base2 Solutions John Fogarty

Most likely you will develop and train your machine-learning models using one of the big four (Google, Amazon, Microsoft, IBM) service stacks, one of the many MLaaS platforms (C3, BigML, WandB, Databricks, Algorithmia, OpenML, Paperspace, PredictionIO, DeepAI, DataRobot, etc.), or you’ll roll your own using some variant of Anaconda/Jupyter and ML frameworks such as Keras, TensorFlow, PyTorch, Caffe, MXNet, Theano, CNTK, Chainer, or Scikit-Learn.

At the end of the day, you’ll produce training datasets, your ML model, the model’s parameters (bias and weight matrices), and various APIs and data formats for executing the model.

How do you get from this set of tools, code and data using many different formats, sources, licenses and execution environments into something that you can execute entirely inside some little box—one that may (or may not) be connected to the internet ever again?

The initial code for your model will be written in Python, R, MATLAB, Lua, Java, Scala, C, or C++. Python is the most common; engineers and data scientists often start with R, or MATLAB; the others tend to be used for niche applications. This source is, in turn, reliant on specific APIs and functions provided by numerical and ML frameworks; these frameworks (or at least the subset of functionality you rely on) must also be ported to the target hardware.

For your embedded system, there are several specific concerns:

Can you convert your model to C/C++ or interpret it using a C based interpreter?

Can you take advantage of whatever parallel hardware is available?

Will it run fast enough to satisfy the application needs?

Will it fit into the limited hardware’s RAM and persistent storage?

Will there be enough power/battery life to make this all viable?

Let’s start by talking about the box itself. The constraints imposed by the low-level hardware/software stack bubble all the way up to the model-development process.

1 – One size does not fit all

Embedded ML now covers a wide range of devices and deployment modes. We can’t begin to cover all the device variants in this article, but here’s how I generally classify them:

AMD64 or Intel IA64 CPU; essentially a desktop in a funny box
ARM or ARM derived system with an embedded Linux
Custom hardware with Yocto/OE/Buildroot/LFS Linux
Custom hardware with a non-Linux OS
FPGAs and ASICs

These design choices are roughly in order of increasing development costs, but (potentially) decreasing final product price—especially if the volumes are sufficiently high.

AMD64 / Intel IA64 and Linux

If your embedded platform runs a full function Linux, there are few barriers to running your model. You’ll be more concerned with installation and performance, and to a limited degree any UI tweaks needed for your unusual interfaces.

When model inference is computationally expensive (such as video-image recognition) it needs significant hardware acceleration (usually a GPU). This requires the installation of cuDNN and CUDA which can be tricky on non-standard systems. When using Nvidia, GPU’s few, if any, code changes will be needed to run your model.

If you need limited acceleration, and your budget will handle it, you can add an Intel Movidius Neural Compute Stick. Greater acceleration may require extended hardware such as an FPGA like Flex-Logix InferX or Intel’s Arria. If you must go this route, plan to do so early, as limited ML frameworks will port easily. Expect significant rewriting of the model, either to TensorFlow Lite/ONNX or to the TensorFlow/Caffe subset supported by Intel’s OpenVINO toolkit.

ARM systems with a supported Linux distro

Many ARM SoC-based systems are released with relatively complete distros (Rasbian on the Raspberry Pi, Nvidia Jetson with Ubuntu and its JetPack L4T), or you can readily build one; both Gentoo and Debian (see catalyst and debootstrap respectively) provide toolchains for constructing the OS. Lightweight ARM distros such as Fedora ARM, Arch Linux ARM and others can be adapted to many devices with a bit more work.

While most Python libraries and many frameworks can be made to run on ARM platforms; there are some hurdles. Machine-vison libraries such as DLIB and OpenCV will often need to be built from source, often with subtle dependency and compiler issues to overcome.

If an ARM Mali GPU can be integrated to the device, you will see significant power savings and inference acceleration. Caffe and TensorFlow are supported (to a limited degree) with OpenCL.

ARM developers get a C-level support tool from Project Trillium, their machine-learning tools platform. These tools are based on the CMSIS ARM hardware-abstraction layer. The CMSIS-NN is a C library (see this paper) for implementing NN layers.

Of special note: Android can be a viable choice if your device supports a touchscreen. This is a Google-centric offering, with an emphasis on TensorFlow Lite, and the Android NN APIs. Expect to recode your model into C/C++.

Custom hardware with a customized Linux

When your system is based on less common CPU architectures such as PowerPC, Xtensa, Blackfin, ARC, SPARC, Microblaze, or NIOSII, you will probably build a custom Linux distribution with one of these toolsets:

Yocto—a project that combines the Open Embedded Tool Core and Poky Linux Distro with a set of SoC/CPU manufacturer BSPs (Board Support Packages)
Buildroot—less general than Yocto, but easier to use in many cases
LFS/BLFS—Linux from Scratch / Beyond Linux from Scratch

This OS work can have a surprising impact on your ML requirements. Even C/C++ source code proves troublesome when porting to exotic platforms. To plan for this, you should have source access and rights to all frameworks and tools involved, as well as significant expertise to handle customization and versioning issues that will arise.

Custom hardware with a non-Linux OS

Some specialized applications simply can’t run a Linux variant; these are typically regulated markets and safety-critical applications (DO-178C DAL A, IEC 61508 SIL 3, IEC 62304, and ISO 26262 ASIL D), as well as those with severe environmental or real-time constraints (military, biomedical, aerospace, marine).

Wind River’s VxWorks dominates the space for both real-time and certified embedded systems. ML on VxWorks has, until recently, required model conversion to C, however V7 finally supports Python 3.8. The further the OS is from Linux, the more effort you’ll need to put into the port.

FPGAs and ASICs

Field programmable gate arrays (FPGAs) and application-specific integrated circuits (ASICs) are used in specialized environments, or very high-volume applications.

In this case, your CPU(s) and any accelerators will be composed of IP cores (purchased and programmed RTL source in Verilog or VHDL). Design verification and licensing costs will dominate your budget; your ML component may be insignificant in comparison.

For this level of system, we need another level of abstraction, which leads us to the next section.

2 – Models are math

Machine learning models are math, not code. Once you’ve trained a model, most of the interesting math is no longer needed, leaving you with a remarkably simple set of functions. These functions are parameterized by the large bias and weight matrices that allow the function chain to produce meaningful (rather than random) results. Many frameworks support conversion of the model graph and code into an ONNX (Open Neural Network Exchange) format that can be compiled by device-specific tools into C, or even RTL.

With proper planning, your specific model’s math and matrices can be made significantly more portable to edge devices. Sure, you can recode simple models to C with libraries like CMSIS-N, but automated translators are better…when they work. Nvidia’s TensorRT is optimized for its own products, but generalizes a C++ API that is implementable for most devices. Intel’s OpenVINO is specialized to machine vision, but can translate models to a general form for multiple CPU/GPU/VPU and FPGA systems.

The most exciting new development in model compilation is the Apache TVM, an open, general compiler stack for optimized, embedded model generation. Keras, MXNet, PyTorch, TensorFlow-based models are directly supported. The VTA (Versatile Tensor Accelerator) extends TVM to generate code for FPGAs and ASICs.

3 – Faster is slower

Moving a model to your edge device is easy when your Python code is small enough, and the limited performance of a standard software stack is fast enough. Unfortunately, embedded applications are usually constrained by limited hardware and grand expectations for performance, so you need to perform more inferences with bigger models than off-the-shelf solutions can handle.

Here’s where hard management tradeoffs between desires, expectations, resources, features and (of course) time are required. Speed costs money; how fast are you willing to go? Increasing ML performance either increases complexity or adds more hardware; these take significant time to integrate, which can easily break schedules and cause missed release dates.

Whenever possible, overbuild your first few prototypes. Add more capability than you think you’ll need. Then see if you can reduce this through tuning and the various optimizations we’ve been discussing.

4 – You don’t own that

ML systems are complex, and your organization won’t own all the intellectual property used in your model and its associated tools. Everything from the underlying algorithms to the specifics of the application your product will serve could be covered by various forms of IP, such as:

Patents on hardware, algorithms and applications
Copyrights and Copyleft licenses for software components
Export controls on high technology
Attributions for science and technology in the public domain

Recent successes in neural-network-based machine learning are paralleled by an equally rapid patent-filing goldrush. According to this WIPO 2019 report, more than half of all AI-related patents have been filed since 2013. Initiate patent searches early in project development and negotiate terms or workarounds before you begin shipping products.

Pay careful attention to, and document, licenses for every open-source project used when building your model. GitHub repositories make this easy since virtually every significant tool is published with one or more permissive licenses: TensorFlow (Apache 2.0); BVLC (shared copyrights); Keras (MIT); CNTK (MIT derivative); Torch (BSD derivative); scikit-learn (BSD).

Embedded operating systems are composed of vast numbers of components. Some of the tools used to build them help you manage licensing requirements: Yocto and Buidroot will create a list of used licenses; both can detect license changes. Yocto projects can be configured to exclude GPLv3, modules to avoid potential copyleft issues.

AI-related technologies are increasingly subject to export controls. Of special note are the American AI Initiative and China Technology Transfer Control Act of 2019, both of which could result in significant barriers to freely distributing and manufacturing ML-based products abroad.

Being a good citizen means that, even where it’s not required by law, your ML code and documentation should reflect and reference any papers and supporting science used in developing your model. As a side benefit, as your market develops, this helps your engineers locate sources of inspiration for future projects.

So, what can we conclude from all this?

First, you should match the model to tools and frameworks supportable on the target hardware. You shouldn’t over-constrain the iterative design and tuning process, but once a model works well, its code should be ported to supported tools, side-by-side with the more general solution running in your servers. In most cases this means limited Keras/TensorFlow-lite, and cuDNN for GPUs. When selecting a model from a zoo, put special effort to verify ONNX compatibility; otherwise select simpler models that are more portable to C.

Every solution accelerator has its own toolstack; both immediate and long-term costs and time are associated with selection, testing and adapting your model to that stack. After selecting a framework, expect significant effort before your inference results match those in your servers.

And finally, remember that a lot of this work is a one-way street. Each retraining of the model will require some level of manual hand-holding before it will run again for the device. You can automate some of this, but the development pace of ML tools is rapid; you’ll already be working with a new toolstack well before you’ve optimized your current one. DevOps for embedded ML development is still in its infancy.