As many people nowadays do, in my spare time, I develop applications for fun and education. And obviously, I constantly bump into various open source platform development communities. Everybody knows them: Linux community, Ruby-on-Rails community, and many others not necessarily that big and widely known.
As a part of my daytime job as a provider of software R&D services, from time to time, I face consulting requests from our clients. They wonder what we can recommend to improve in their software engineering organization, which typically has a number of products and projects running in parallel—from sustaining legacy products to developing platforms for future generations of products.
Beginning: Complicated from the very start
And I started wondering if replicating the open source community approach may help in building an efficient software engineering organization. It doesn’t apply to every case, but there are certain cases when it’s worth considering. Here is what I’m talking about. Some organizations (especially device vendors that have various versions of hardware for different markets and purposes running their custom embedded software) have developed the following setup:
- Common platform consisting of software components for performing typical tasks, from UI widget libraries to communication stacks, encapsulating proprietary algorithms to providing abstraction and realization for various data inputs, third-party devices, or applications
- Specific software builds for specific hardware based on that common platform but (1) augmented with device-specific algorithms and code blocks and (2) integrating, configuring, and customizing the required common components in a device-specific way
Often, there are more than two specialization/abstraction layers, some of which can be shared between a group of devices and some of which are unique for a specific device (e.g., the OS abstraction layer that provides a common interface for common algorithms for devices running on VxWorks, Linux, and Windows Embedded; several communication stacks; implementation of hardware abstraction layers for various generic hardware platforms; implementation of a common extension API that can be used by third-party applications, etc.).
Evolution: Loss of simplicity
Over time, the number of devices grows. Even when they use the same common component, they tend to use different versions of the component. The engineering team working on the set of products is split into teams that work on specific sub-platforms, specific device revisions, and special projects. Device release teams report defects and make change requests to platform teams. Platform teams work according to their roadmaps and deadlines and can’t devote enough resources to understand fast enough what somebody outside of their team wants to be fixed. They don’t have time to implement the fix or can’t do it in the middle of the “code freeze,” or they need time to make sure that the fix works for all supported devices, not just the one for which it was requested. You get the picture.
And don’t get me wrong. It happens not only to embedded software development teams that deal with variations of hardware platforms. It can also happen to a purely software product, for example, that has several custom versions created and maintained specifically for key clients or to a mobile app that has a common core and platform-specific code for Android, iOS, Windows Phone, Tizen, Blackberry, Bada, and whatnot.
But it’s hard to explain the details using the most generic language, so I will continue using my “single common platform – multiple target devices” example and refer to “common platform” and “device-specific code.”
Crisis: Signs of failure
As a result, the issues that the software engineering organizations face are often similar:
- The re-use ratio is low. Teams don’t share code and knowledge. Different teams continuously re-invent the wheel. Algorithms implemented by one team are not encapsulated in re-usable components, are not propagated to the common platform, and thus are not available to other teams.
- The lifecycle for defects and new functionality that involves changing the common platform is very long. Even when one team knows what needs to be changed in the common code, it takes time and a lot of red tape to put it into the roadmap and then implement.
- The code is more fragile. The team can’t wait for the common platform changes. Thus, they fix common platform problems in the device-specific code, and they implement the potentially common algorithms in the device-specific code. As a result, the specific part grows in size, and when the common platform changes, there is often at least something that breaks in the specific part.
Uncharted shores: Where the wild things are
One way to deal with this is to use certain approaches employed by open source communities. Here is the proposed approach:
- Break all software into components. Assign a team to each component that will work on developing and maintaining this component based on a roadmap assigned to it.
- The assigned team is responsible for making the component operational according to the spec on all supported targets (target definition depends on a specific case; basically, it is a combination of hardware, underlying OS, main third-party components like graphics libs, etc.).
- Each component can easily be re-used on all supported targets. Thus, a unified build environment for all targets is needed. It is important that it is easy to check that the introduced changes do not break something; thus, good coverage with automated tests is needed.
- All other teams that use this component in the products they build may make changes in the source code in a controlled manner.
- There is not a single “top of the trunk” version of each component available for other teams but a whole set of supported versions. The support for old versions is dropped as new versions are developed, but at least a couple major and a few minor versions are supported by the assigned team.
- Each team working on a particular version of the external product for a specific target takes a specific version of the component and freezes that version for the duration of the work on the external release. Migrating to a later version of the component is a separate task that should be planned for, as the new version may introduce interface or logic changes that break something in the external product.
Whenever a non-assigned team wants to make a change on the component, the following is done:
- [“Pull”] This team makes a separate branch of the version it uses for the target.
- It makes all the changes it needs to (fixes for discovered target-specific bugs, features they need now but are not in the roadmap for the near future, etc.). The usual quality assurance process for a specific product is applied (peer reviews, test traceability, test coverage, etc.).
- It uses it for its product as much as it wants.
- It runs the standard set of automated tests provided for the component by the assigned team (which tests it thoroughly and for multiple targets).
- [“Push”] It submits all changes for incorporation into the mainline to the assigned team. It also submits the automated tests developed for testing the changes in the component.
- The assigned team performs a review of the changes, discusses them with the submitting team, and accepts/rejects/reworks them. It also incorporates (after possible rework) the auto tests into the main set of tests for the component.
- The assigned team performs the necessary testing across all targets.
- The assigned team includes it in the officially released version available for all teams.
Synopsis: Per aspera ad astra
There are obviously more details to consider for this approach (such as freeze periods for certain versions, procedures for review, testing cycles for longer suites, etc.), but it is impractical and impossible to cover them all in this post.
What are the benefits? This approach:
- Improves integration between different teams and products inside the organization.
- Increases code re-use by promoting access to the code across the organization.
- Decreases time to market (TTM), as the new functionality and bug fixes can be made by the teams that need them “on the spot.” Decreases TTM by promoting the model when the new functionality is built on top of the existing rather than developed from scratch.
- Supports agile methodology by increasing the number of people who understand various parts of code (and thus can estimate it and be involved in working on it). Reduces dependency on individual engineers.
- Still keeps changes under control and introduces a natural “second opinion” code review by engineers specializing in a specific area.
- Allows “centralization” of architectural decisions without turning it into a bottleneck. Architectural decisions for a given component are ultimately made by a very narrow group of tech architects from the assigned team.
- Reduces tension between the teams, as each has a set of components that they primarily work on and have architectural control over.
I really like this approach, so I decided to outline it in this post. Maybe it will make somebody stop and think how this parallel teams/components setup can be improved in their organization by implementing certain aspects of the open source communities model.