Spend more on delivery, less on risk mitigation

Let’s do a simple Lean analysis of government IT system delivery projects. How much of our spend is on activities that directly create value, and how much is additional overhead? What percentage of our spend is value-creating?

The value-creating part of a software development project is primarily the actual development and testing of the software. Add to that the cost of the infrastructure on which it is run, the cost of designing and building that infrastructure, and perhaps the cost of any software components from which it is built. I mean to include in these costs the salaries of everyone who is a hands-on contributor to those activities.

The non-direct-value-creating part is primarily management overhead and risk mitigation activities. Add to these the costs of the contracting process, documentation, and a bunch of other activities. Let’s call these overhead. A great deal of this overhead is for risk mitigation – oversight to make sure the project is under control; management to ensure that developers are doing a good job; contract terms to protect the government against non-performance.

No one would claim that these overhead categories are bad things to spend money on. The real question is what a reasonable ratio would be between the two. Let’s try a few scenarios here. An overhead:value ratio of 1:1 would mean that for every $10 we spend creating our product, we are spending an additional $10 to make sure the original $10 was well-spent. Sounds wrong. How about 3:1? For every $10 we spend, we spend $30 to make sure it is well spent? Unfortunately – admittedly without much concrete evidence to base it on – I think 3:1 is actually pretty close to the truth.

Why would the ratio be so lopsided? One reason is that we tend to outsource most of the value-add work. The government’s role is management overhead and the transactional costs of contracting. Management overhead is duplicative: the contractor manages the project and charges the government for it, and the government also provides program management. Another reason is the many layers of oversight and the diverse stakeholders involved. Oversight has a cost, as does all the documentation and risk mitigation activity that is tied to it. When something goes wrong, our tendency is to add more overhead to future projects.

A thought exercise. Let’s start with the amount we are currently spending on value-creating activity, and $0 for overhead. Now let’s add incremental dollars. For each marginal dollar, let’s decide whether it should be spent on overhead or on additional value creation (that is, programmers and testers). Clearly we will get benefit from directing some of those marginal dollars to overhead. But very soon we will start facing a difficult choice: investing in more programmers will allow us to produce more. Isn’t that better than adding more management or oversight?

To produce better results, we need to maintain a strong focus on the value creating activities – delivery, delivery, delivery.

Who needs requirements?

On my other blog, I posted an entry on how agile approaches in a way dispense with the idea of requirements; instead a business need is translated directly into code (skipping the requirements step), with tests providing an objective way to see whether the result is acceptable.

This idea disturbs many government IT and procurement professionals. It shouldn’t.

Perhaps it will ease people’s minds to think of an agile process as something like a procurement done with a Statement of Objectives. In place of system requirements the government, throughout the course of development, presents the contractor with business needs, and the contractor is free to provide a solution without constraints. For the same reason that this is often good practice in contracting, it is also good practice in software development. I am not saying that agile procurements should be done through a Statement of Objectives (a good idea in some cases), just pointing out the underlying similarity in concept.

One objection I hear is that without requirements, we cannot contract for services. Even if we could, how could we have a fair competition, since contractors bid on how they would address requirements? The trick here, I believe, is to distinguish between contractual requirements and system requirements. There is no rule that says that the contract or the RFP must include system requirements. Of course it must include some sort of requirements. The requirements depend on the basis for the competition – for example, if a procurement is for development services, we can state requirements for the services – required skills and experience, management approach, etc. Or we can state requirements for the business needs to be fulfilled. Perhaps the following comparison is in order: if I wanted security guard services I could specify that the security guards need to prevent people we don’t trust from entering the building. The solicitation does not need to list the names of the particular people we don’t trust.

A second objection is that we need the requirements to know whether the contractor or the project team has performed well. That seems to miss the point. If the requirements are satisfied but the product doesn’t meet the business need, then no one has been successful. We should gauge success by business value produced, business needs met, quality of work, customer service, and so on. Or we can judge the contractor’s success at meeting the business needs developed in the “conversations” with users. We don’t need system requirements in the solicitation to do this.

The main point to keep in mind is that better results are obtained by working from business needs directly to system development. Best results are what we want. We might have to change how we set up our contracts to get there. There is no conflict, from what I can see, with the Federal Acquisition Rules.

DevOps and FISMA, part 2

In my last post I discussed how rapid feedback cycles from production can support FISMA goals of continuous monitoring and ongoing authorization. Today I’d like to discuss FISMA compliance and DevOps from another perspective.

In order to support frequent, rapid, small deployments to production, we must ensure – no surprise – that our system is always deployable, or “potentially shippable.” That means that our system must always be secure, not just in production, but also in the development pipeline. With a bit of effort, the DevOps pipeline can be set up so as to achieve this.

I find it helpful to think of security vulnerabilities or flaws as simply a particular kind of defect. I would treat privacy flaws, accessibility flaws (“section 508 compliance”), and other non-functional flaws the same way. I believe this is consistent with the ideas behind the Rugged DevOps movement. We want to move to a zero-defect mentality, and that includes all of these non-functional types of defects.

Clearly, then, we need to start development with a hardened system, and keep it hardened – that way it is always deployable and FISMA compliant. This, in turn, requires an automated suite of security tests (and privacy, accessibility, etc.). We can start by using a combination of automated functional tests and static code analysis that can check for typical programming errors. We can then use threat modeling and “abuser stories” to generate additional tests, perhaps adding infrastructure and network tests as well. This suite of security tests can be run as part of the build pipeline to prevent regressions and ensure deployability.

How can we start with a hardened system, when we almost always need to develop security controls, and that takes time and effort? I don’t have a perfect answer, but our general strategy should be to use inherited controls – by definition, controls that are already in place when we start development. These controls may be inherited from a secure cloud environment, an ICAM system (Identity, Credential, and Access Management) that is already in place, libraries for error logging and pre-existing log analysis tools, and so on. These “plug and play” controls can be made to cover entire families of the controls described in the NIST standard 800-53.

Start hardened. Stay hardened, Build rugged.

How DevOps supports FISMA (Federal Information Security)

The DevOps model is based on rapid and constant feedback, both from the development process and from the system in production. Continuous integration, user review, and automated testing provide feedback during development; production monitoring, alerting, and user behavior provide feedback in production.

The Federal Government has been moving toward an interpretation of FISMA (The Federal Information Security Act) that is very much consistent with this feedback-based approach. The National Institute of Standards and Technology (NIST) publishes guidance on how agencies should implement FISMA. Their publication 800-137 promotes the use of Information Security Continuous Monitoring (ISCM) and makes it the cornerstone of a new Ongoing Authorization (OA) program. A later NIST publication (June 2014) titled “Supplemental Guidance on Ongoing Authorization: Transitioning to Near Real-Time Risk Management” provides additional details. DHS and GSA have worked to create a Continuous Diagnostics and Mitigation (CDM) framework and a contract vehicle through which agencies can procure CDM services.

The core idea is that federal information systems should be continuously monitored for vulnerabilities while in production. Those vulnerabilities should be rapidly remediated and can be used to “trigger” security reviews based on the agency’s risk posture. In other words, we are moving from a process where security is tested and documented every few years to a process based on continuous feedback from production to a team that is charged with remediating and optimizing. It is, in other words, a DevOps system.

The title of the NIST publication indicates that there is more here than meets the eye. The intention is to move to a “near real-time risk management” approach that is based on frequent reassessments of risks, threats, and vulnerabilities. It moves the focus of security activities from documenting that required controls have been implemented (a compliance focus) to one of responding to a changing landscape of real, emerging threats (a risk-based, dynamic focus).

DevOps provides an ideal way to implement this new security approach. Continuous Monitoring for security vulnerabilities is just another type of production monitoring in the DevOps world. A rapid feedback cycle enables the DevOps team to respond quickly to the newly discovered vulnerability. Since the DevOps team has already shortened cycle time and automated its deployments, the vulnerability can be addressed as quickly as possible. As an added bonus, the system in production doesn’t need to be patched; instead the source system can be modified, and the entire system rebuilt and deployed to a new set of VMs, and the old ones torn down.

The influence can go both ways: by incorporating the ideas of triggers and business-based risk assessments, DevOps can be extended to include risk-based decision making.

Good technical practices are critical for government contracting

Good technical practices (such as those typical in DevOps environments) can help the government in contracting for information technology services. We should require these technical practices in our IT services contracts, and if we are investing in QA and independent verification, we should invest first on validating good technical practices. Let me give a few examples. For readers without a technical background, you should be able to find more information about these practices online. 

Good, state-of-the-art testing practices are important for more than the obvious reasons. Most tests should be automated and should follow the classic “testing pyramid” (many unit tests, somewhat fewer integration tests, and fewer tests at the user interface level). The automated tests themselves are just as important a deliverable from the contractor as the code itself.

There are many reasons why such automated tests are important in our contracting environment. The automated tests serve as regression tests that will speed later work on the system. If a second contractor does something that “breaks” the first contractor’s code, it will immediately be spotted; in essence, the tests can be said to “protect” the first contractor’s code. If a new contractor is brought in for O&M or future development, the automated tests serve as documentation of the requirements and allow the new contractor to be confident in making changes or refactoring – they are OK as long as the regression tests continue to pass. 

Scripted deployments and “infrastructure as code” serve a similar function. By providing automated scripts to set up the production environment and deploy code, the contractor is documenting the deployment process (and reducing the amount and cost of paper documentation!). No longer is the knowledge just in their heads (making it costly to replace the contractor). Deployment scripts can be tested, making them an even more valuable form of documentation. They can be placed under version control and audited, increasing security.

Continuous integration increases our ability to work with multiple contractors and gives us more confidence in a contractor’s status reports. By continuously integrating code we ensure that code from multiple contractors will interoperate, and we avoid last-minute surprises when a contractor’s 100% finished work fails to integrate.

A zero-defect mentality where user stories are tested immediately and defects are remediated immediately ensures that code the contractor says is finished really is finished. It avoids passing defective code from one contractor to another; reduces finger-pointing; and makes integrating code simpler. If we are comparing contractor performance it serves as an equalizer – if one contractor finishes 10 stories and leaves 15 defects while another contractor finishes 8 similarly sized stories and leaves only 12 defects, which has performed better? We can’t know. Zero known defects should be our expectation.

The last practice I will mention is the use of good design patterns and architectures that feature loose coupling. Good use of design patterns makes it easier for a new contractor to understand the code they inherit. By encapsulating pieces of the system it can make it easier to have multiple contractors work in parallel and even at different paces.

Together, these practices can make it easier to judge contractor performance, allow us to partition work between a number of contractors, and make it easy to switch contractors over time.

(thanks to Robert Read at 18F for some of these ideas)

The “business value” of government

Agile delivery approaches focus on maximizing business value rather than blindly adhering to pre-determined schedule and scope milestones. On the definition of “business value” the agile literature is appropriately vague, for business value is defined differently in different types of organizations. I would even argue that it is necessarily different in every organization – each company, for example, is trying to build a unique competitive advantage, and results that contribute to that advantage can be valuable (“net” value, of course, would have to consider other factors as well). A publicly held company needs to maximize shareholder value; a closely-held private company values … well, whatever the owners value. A nonprofit values mission accomplishment. What does the government value and how does it measure value?

The answer is not obvious. Mission accomplishment is certainly valued. But different agencies have different missions and for some agencies measuring mission accomplishment is difficult (James Q. Wilson’s book Bureaucracy is great reading on the topic of agency missions). If the Department of Homeland Security values keeping Americans safe, how can it measure how many Americans were not killed because of its actions? In an agile software development project, how can we weigh cost against that sort of negative value to determine which features are important to build?

To make matters more complicated, the government values many things besides mission accomplishment. Controlling costs, obviously. Transparency to the public and to oversight bodies. Implementation of social or economic goals (small business preferences, veterans preferences, etc.). Auditability – evidence that projects are following policies. Fairness to any business that wants to bid on a project. Security, which in the government IT context can extend to keeping the entire country safe. And through appointed political agency leadership, political goals can also be a source of value. Each of these values may add cost and effort to a project.

To maximize business value, we must consider all of these sources of value. If we limit ourselves to the value of particular features of our software, we are missing the point. Rather, as IT organizations in the government, we need to self-organize to deliver the most value possible, given all of these sources of value. The government context determines what is valuable. What we must do is find the leanest, most effective way to deliver this value. This is no different from the commercial sector – only the values are different.

Government as a low-trust environment

The US government is, deliberately and structurally, a low trust environment. Think about why we have a “system of checks and balances.” We have proudly created a government structure that is self-correcting and that incarnates our distrust of each branch of the government. Why is freedom of the press such an important value to us? Because we all want transparency into the government’s actions – not to celebrate its fine management practices, but to know when it is doing something wrong. Within the government, we have Inspectors General to investigate misbehavior, Ombudsmen to make sure we are serving the public, and a Government Accountability Office. To work in the government is to work in an environment where people are watching to make sure you do the right thing. It is a culture of mistrust.

That sounds horrible, and from the standpoint of classic agile software development thinking, it is unworkable. But take a step back – don’t we sort of like this about the government? “Distrust” has unpleasant connotations, but as a systematic way of setting up a government, there is a lot to be said for it. It is another way of saying that the government is accountable to the people. You could almost say – you might want to hold on to your chairs here, agile thinkers – that mistrust is actually a value in the government context. So where does that leave us if agile thinking wants us to deliver as much value as possible, but believes that agile approaches require trust?

It might sound academic, but I think solving this dilemma is critical to finding ways to bring agile thinking into the federal government. A typical IT project experiences this structural distrust over and over: in the reams of documentation it is required to produce, in the layers of oversight and reviews it must face, and in the constraints imposed on it.

I will argue that even in a low trust environment, agile approaches are still the best way to deliver IT systems. And that certain tools – borrowed primarily from DevOps – actually help us resolve the dilemma. Waterfall approaches fit well with mistrustful environments by holding out the promise of accountability and control – but they just don’t work. So how can we bring agile, lean, team-based processes into an environment that is structurally mistrustful, and realize our goal of a lean bureaucracy?