Welcome to the first part of an ongoing series I’m calling DevOps in a Regulated and Embedded Environment. This first part looks at the particular challenges posed by a particular embedded environment. Future posts will dig into the details of how the more interesting problems were solved and what we should have done differently given time and resources. In all, I expect this to be a four-part series, including this part.
To this point in my career, every DevOps project I’ve worked on has involved web applications. Web apps are by and large, fairly similar. They’re built into a package and deployed behind a web server. Whether they’re written in Java, Scala, or pure Javascript, they’re a fairly well-defined problem space for a DevOps engineer and countless articles have been written about the right way to do deployments, the right way to test, the right way to do everything really. But DevOps principles apply much more broadly, to software projects of all types. My most recent project required me to throw that all out the window and develop a DevOps pipeline for an extremely challenging environment. The answer, as it turns out, is to go back to basic principles.
Background
The project involves the manufacture of medical devices and naturally these devices need software to operate, both at the firmware level to control custom hardware, as well as at the user interface level. These devices go in people. There’s no room for any kind of mistakes. Certainty that everything runs exactly as expected is crucial and there are legal requirements in place to provide a check against sloppiness. The business is regulated by the FDA which is beginning to recommend an agile development methodology for organizations under its purview. We came in to help the organization evaluate its current progress towards agile adoption and to run a pilot agile project to move the software development team towards more predictability. One part of this endeavor was to introduce Continuous Integration / Continuous Delivery (CICD) to the software organization and provide them with the needed infrastructure and training for them to carry it forward for themselves.
Challenges
The challenges for putting DevOps and automated CICD into a medical environment included:
- Legal and regulatory constraints
- Software that required cross-compilation to the embedded platform
- Non-existent deployment infrastructure and restricted tooling
- Long feedback cycles
- Scalability and hardware resource constraints
There were several challenges here that I’ll talk about in depth over a couple of posts. First, legal constraints over what code can and can’t be released requires a change to standard git workflows. Though our standard recommendation will always be that branches should be short-lived and teams should make an effort to maintain a single line of development that just wasn’t possible in this case. I’ll cover the specific problems and what can be done in a future post.
Second, the software is written primarily in C. That means in order to get binaries into the target environment we have to cross-compile, touching multiple tool chains. Further, the build infrastructure is generated and controlled by the QNX folks, consisting of a set of Makefiles distributed with the QNX sdk as well as generated Makefiles inserted into the software projects themselves. In all, this spells fragility, and setting up an environment capable of doing builds for the organization was a step we only wanted to take once. Thankfully, we were able to duplicate this environment as necessary thanks to a helpful IT department.
Next, deployment infrastructure up to this point just did not exist. The scripts necessary to load code onto the target platform over the network had been written at some point, but were never maintained. Instead, a separate application was used to prepare usb sticks that could flash a new build, along with all required firmware and operating system components, onto the target. This required a physical dongle to be inserted into the device, as well as the usb stick, and required manual intervention during the 30 minute deployment process.
These things added up to long feedback cycles. Builds were performed infrequently and the entirely manual testing process was conducted over a period of multiple man-weeks.
But trying to shorten that loop wasn’t trivial either. I already mentioned the trickiness of getting a build environment set-up, but much worse was actually getting network-driven deployments going. The target environment is locked down and fairly restrictive. QNX powers Blackberry phones, so you probably expect a healthy amount of tooling to be available. However, due to security concerns, the QNX image being loaded onto these devices is extremely stripped down. Only the bare essentials to run the control application and associated firmware are included. This means sh, ksh, telnet, and ftp were available, but not bash, ssh, ruby, or python. Further, there are many peripheral components requiring firmware built by the team with their own finicky dependency concerns. In truth, the team believed this to be unsurmountable. Network deployments were just never going to happen. But thanks to the magic of expect, we were able to get this working in a fairly clean way. I’ll talk about this in depth in the third post of this series.
There were further challenges because the dev team was being introduced to entirely new (to them) practices. For example, the team was not used to unit testing and we introduced the practice with coaching and mentoring. However, this brought up further technical challenges because the unit tests required being built against the QNX and in-house libraries used to build the software itself. This meant building the unit tests produced an executable that could only be run in the target environment instead of on the Continuous Integration server. This, along with other automated testing needs creates a scalability and resource problem that I’ll talk about in my fourth and final part of this series.
Solutions?
Several of the problems I’ve outlined so far are going to take a lot to explain in enough detail for me to present a solution. I’ve therefore opted, as hinted at through this post, to break this up into a series where we’ll consider specific problems and solutions in all their gory detail. But there are a couple of points I want to re-emphasize because they don’t require nearly as much detail.
First, those unit tests – yep, it’s annoying that we can’t run them on Jenkins. But we’re going to need hardware for various other kinds of testing anyway. This makes unit testing, for a project of this kind, much more similar to the other kinds of testing a team might wish to employ (e.g. functional, regression, performance, etc.) in that they get run against the target platform and consume resources for the CI stack. That’s ok. As long as we’re actually doing Continuous Integration in its entirety we don’t lose anything, except maybe that developers might be a little more inclined to check in code without running the unit tests themselves. Make sure they know how to build the unit tests in their development platform and emphasize that breaking the build is a blocker for the whole team and this problem can be minimized as a team process issue.
Second, that really long feedback cycle I mentioned is ultimately something that’s never going to go away. You can’t get real feedback from users until various regulatory approvals have been granted and the government is never going to be as fast as we’d like. But the real takeaway here is that we shouldn’t allow this to lengthen the parts of the feedback cycle that remain within the team’s control. We can find out what is and isn’t broken quickly and regularly by following solid testing principles – automate the tests that can be automated and run them as often as is feasible – as well as by maintaining a robust CICD stack that minimizes the headache of getting software into production. My real hope is that this series helps your team get there too.
This was ultimately a very challenging project with the lack of available tooling, non-existent infrastructure, and ever present regulatory concerns creating an environment without much available support from the larger DevOps community. Thus, the entire CICD stack that got built out was necessarily ad-hoc, hand-rolled, and entirely custom. There was, after all, nothing out there to lean on (that I could find, anyway!). Despite that, I want to ask in this series, what can we learn? This was my first attempt at doing DevOps for embedded projects and probably Coveros’ most advanced project in that space. Going back over what we did and what we can learn will be an enormous learning opportunity for me. So I hope you’ll join me as we explore DevOps in a Regulated and Embedded Environment.