When Engineering Success Is Silent

When Engineering Success Is Silent

Often when we write or talk about a project, it is a fascinating new feature or uses a new technology or because we see it as super exciting in some other way. In the meantime, we don’t want to talk about it, but when a project causes a major failure, we also hear about it. Sometimes, however, some of the best engineering works are not attractive and even no one is noticed. In fact, it is even better if one does not notice; We reduced the risk.

How you can reduce the risk will vary from project to project – which makes it difficult. If we knew how to prevent bugs or break something, we would all do it. It is equally important to understand what risk you are reducing. You can spend the entire year and prevent a category of failures altogether, but if those failures don’t really matter, it may not be the best use of time. For this process, there are four steps in this process:

Understand your hunger for risk. Can you break some things?

Can you afford to run things a little slower?

Determine what type of problems you may run into. Will you bring down a service? Will you cause slow page load times? Will you lose data? Will you reveal the data?

Find ways to mitigate each of these problems. What can you do that will prevent this from happening? If you can’t stop, is there a way to detect early or reduce the number of people affected?

For each problem and mitigation, decide if it is worth the business. Is time needed to reduce time loss, what would happen if the worst happened? Can reducing risk also lead to greater risk?

Background

I was recently on a project where we had changed the entire structure used to check permissions on files and folders, but no one heard much about it because we also didn’t break anything. Managed. This was a piece of a much larger project that I talked about earlier.

According to the size and age of the box of almost every other company, we are working to break our monolith into the microsystem. As part of that, the first thing we did was to build a new permission service (I touch on that and here and here).

It’s great to have a new service and build all the new stuff that uses it, that’s also great, but until we get the logic of the current permissions out of our monoliths, we’re actually I do not fulfill my goal. So my team decided to tackle the project of checking the permissions for files and folders for the new service from Monolith.

Understanding our risks and understanding our problems

Our permissions framework allows us to test whether a currently logged in user is capable of performing a given action on a given object. For example, we can check whether a logged in user can view a specified file or CREATE_A_COMMENT on that file.

In the box, files and folders are the heart of what we do – if you can product on our product, if you are not able to use either. Similarly, our big selling point is that we are enterprise-grade, providing enhanced grain access control and security.

You can expect what I am doing here – if we can prevent access to either a file or a folder that a user must have, then it will be a problem and even worse, if we are using a file Grant access to or folder that the user should not access will be an even bigger problem. In addition, we check these permissions in a large part of our code and to go down was the service responsible for checking these permissions, which would effectively bring down the entire application.

The only good news is that our main application is so slow that we have a small amount of tolerance for latency. So, as we made our argument, we can slow things down a bit, but we can’t completely change the results of permissions and we can’t check permissions completely.

Our mitigation plan

While we would prefer to migrate everything at once, the reality of the situation was that our existing item permissions code is about five thousand lines of poorly understood fun. Additionally, as we have just established, we can do no wrong.

Limit our change set

Our first strategy to limit mistakes was to limit the amount we tried to change at one time only. We decided to start the migration with only a small slice of five thousand lines of code. The old code is currently divided as to why permission can be granted or denied. For example, if I can see a file, all the codes that decide it are together.