From firefighting to problem prevention: why you need observability in your lifecycle

Tired of firefighting? Observability is the key to shifting to proactive problem prevention in Salesforce DevOps. Discover why observability is essential, what areas of your Salesforce applications you should monitor, and how gaining this visibility can save time and stress, and deliver more.

Learn more:

As you can see, I'm Andy Barrick. We're gonna talk about observability. I work for GearSat with DevOps Architect. That is not the purpose of the, session, though. So let's dive into it.

If you've worked outside Salesforce in particular, and and and potentially in DevOps, you've, probably familiar with observability. We've heard just in the keynote there, obviously, the the the how it's a topic at Salesforce, and you may be hearing more about it now. But, if you if if this is the first time you've come across it, then, let's take a second here to just define it so that we're all all on the same page. I I don't think this will be radically different to what you what you've heard from Karen and Kevin just now, but, I think you can think of it fundamentally as sort of a instrumentation for your application. Right? We'll use information that your application generates or Salesforce generates, to get an understanding, really, of its internal state.

Typically, we'd be looking at logs, traces, you know, any data the application generates.

You would leverage those to highlight issues, your performance changes, and so on.

A a common parallel is with, the dashboard of the car.

There are, you know, that displays the state of various critical elements of of the car, how they're doing. You know? Warning light switch on if things require some urgent attention.

But, equally, knowing that things are just working fine is is really, it's important as knowing that things are potentially not quite as well on teaching on the edge of failure.

So you do get a bit of this on the Salesforce platform, standard. It's not necessarily that single pane of glass, and the information is largely org level. Right? So it's storage usage, API limit usage. They're on the company information screen. If you go to object limits, they're on each of the object pages and this sort of stuff. So the it's scattered around there, and it's not always in one place.

You know, sometimes you get notifications as an admin that you are approaching some of these limits, those card dashboard warning light things. You know, you use ninety percent of your storage, for example.

So, you know, it's an experience which which could be enhanced.

In this session, though, we're gonna dive into the transaction level, limit usage, which, or the health of transactions, we'll say, instead, which there there's less. There's certainly less in the way of some functionality around that.

And as per the title, we're gonna we're gonna take a little journey from things which help you react to when the failures have happened, right, when the car warning light when the car's broken down and to get you back going again. And we're gonna move through towards things that can can actually help, those fires starting in the first place.

So, again, this was slightly, previewed in the in the keynote quite coincidentally.

Kev mentioned this, but, this is quite a leading question, isn't it? So we we don't develop in production. Right? We we we we we develop in development environments, and they go through various stages of QA, sandboxes, potentially, so that we can test in more realistic increasingly realistic environments of production so that we know that when it gets to production, it's all good.

So if all that's true, then, you know, will there be bugs in production? Well, of course, there are. I think everybody on this call has probably encountered a production bug. So what happens with those then?

Well, in terms of managing them, the first, then we've got this venerable old piece of metadata validation. This is this is incredibly you know, it's it's been there forever, but it's it's actually incredibly effective. It doesn't even it stops the error from even happening in the first place. You obviously set these conditions as to what the the the valid data entry should be.

If it doesn't comply with that, then no processing happens. Salesforce stops anything happening, and the user gets a, hopefully, very informative message as to why that was and what they can do to make it work better next time. So that that's that's really cool. However, if we get into the transaction, if we actually start submitting stuff and it's not quite valid, there's some sort of problem, happens halfway through the processing, there's a couple of things that may there are a couple of tools that you can use.

We won't get too deep into the technical technicalities at all, but you've got the concept of try catch blocks in Apex or fault paths in in flows that can now, spot various types of exceptions in the case of Apex or just any sort of unexpected scenario in a flow. And you can define logic that allows you to for your application to react to that. And there's probably two k. There's two sort of streams out. There's, again, that idea of getting a valid usable message back to the user so they know how to avoid it or or, submit it correctly.

Or there oh, sorry. And there's the the idea of potentially logging some information that allows somebody to replicate it and understand support engineer or developer to to understand exactly what's going on and why that happened in the first place.

There's then the category of governor limit exceptions, which you've probably encountered. We can't do anything about. These are just gonna blow up, stop the transaction. We can't handle them. We can't process them. We can't react. They are just going to stop stuff happening.

Now, so so what happens in each of these cases? Well, if we're thinking about the unhandled exceptions or the exceptions in Apex that you haven't defined a a handling process for, which is a slight little nuance, You will typically find that a user has to report this back to you. Right? Some somebody gets in touch via some internal mechanism, a ticket at the email, whatever it might be, saying, hey.

This went wrong. Here's a screenshot possibly. Something's gone wrong. Or maybe you've got an overnight scheduled Apex job or something like that's failed.

Scheduled flow fails and and and there's an email somewhere with with a a message in saying, hey. This this didn't work.

Salesforce potentially generate the if you haven't done it in a in a flow path sorry, a fault path or or in your your catch block. And we'll we'll dig into these emails a little more shortly. But, ultimately, at that point then, you need to triage, and potentially fix this issue. Ultimately, because we're talking about production here, some process that's critical to the business, you assume, otherwise, it wouldn't exist, has not worked, and things are in an unusual state.

So with the bug then, we can we can think of, like, three stages of the life cycle of it.

The visibility, knowing that it happened in the first place. Right? That that whole the the user sends me an email or you get an email from Salesforce.

Then after once you're aware that the things happened in the first place, then you need to validate it. Is it user error? Is that genuine issue?

You know, what was the what was the root cause, essentially?

And then if all that points to, say, a bug or or user error, let's say there's a resolution that you actually tell the user that it was wrong or you've influenced some sort of fix. Now validation and resolution are essentially the same process, really, regardless of where the issue occurs, whether it's one of these government limitings or it's an actual sort of logic error.

The value that we're gonna have from observability to bring it back to that is really, to start with, in this visibility stage.

We agreed at the start, right, that, production bugs can happen. This is not something that we're this is some sort of, you know, obscure thing that never happens to anyone.

So, really, we need to know about them as soon as possible. If this is real customers trying to do real stuff, then that's you know, know, if they can't do it, that's a problem. So the first stage of observability is gonna enhance that experience and shorten the time from the issue occurring to it being validated and getting into that middle block where we're actually doing something about it. So we can summarize that, I think, quite nicely by suggesting that the first phase, observability, improving reactivity, and this is the firefighting section that we spoke about at the start. So let's have a look at a practical example of what we mean by this. If you think about flows, we think specifically about the emails that you might get when a flow, fails in some manner.

Salesforce will generate an email, and it will send it to whoever thinks is most is best suited.

That's either a defined list of users or whoever last, deployed the flow if that list is not set. And now in a world of potentially, you know, service style deployment accounts with no without a monitored email inbox, and sending an email to that probably isn't particularly helpful.

Even if you define the person for it to send, is that person available to do anything about it? It might be on holiday for two weeks or, you know, out of the office for some, some reason. He he's sending he they might not be, but they might have an email inbox that's full of ten thousand of these already. So is is that really the best mechanism for this? So what we're doing here at GearSat with flow observability is providing a mechanism, a solution to to increase that visibility, that first stage of that life cycle, shorten that time from the issue occurring to somebody knowing about it. Hopefully, when we'll see a screenshot of it next, I think it's, safe harbor and all of that, the general availability schedule is for for next week.

When you see it, hopefully, you'll agree that it provides a much more, usable and functional process for for surface scenarios and getting the visibility out regardless of the mechanism for how that is then validated and resolved for now, although later slices may well increase that. And, ultimately, at that time from occurrence to action is reduced because all that time, the clock's ticking with some broken business process that a a user, a customer is actually trying to get something done. So let's have a look at that. If you're familiar with those flow exception emails, you this, hopefully, has, obvious benefits over a clogged inbox.

We've got the we've got the types of the, the exception, dates. We're capturing information around how many times you've seen it, trends, etcetera. You can dig in on the right hand side of the view areas slide to actually go and have a look at this stuff. But the principle here is all about increasing visibility of these exceptions.

The email is not a particularly pleasant mechanism for handling this, but Salesforce can't really do huge amounts more for this, I think, by default. I think that's a fair a fair, expectation.

But we can improve the experience, improve the visibility, that observability, thinking about this single pane of glass, that dashboard to say, is everything working well? Think about the spirit of what we defined observability as. This is far easier to observe the state of any of the flows, how they're getting on, than a a a collection of emails that are not tied together.

So that's flows. What do we think about Apex? Well, again, if we're thinking about on handled exceptions within Apex, the flow emails, have a lot of information, and we're able to pause those in gears, to to to extract a lot of that.

If if you if you're familiar with this, then there's actually a record in Salesforce called the flow interview where it actually saves a lot of state. So that that information is recorded nicely for us. When when we think about Apex transactions, that doesn't really exist.

When you get the email, there's a stack trace, which gives you the details of the last lines of code which were executed before the exception occurred, and you can sort of build up a sort of picture of where the code might have gone and how it might have got there. So this value must have been between that and that that line and that sort of thing. But it doesn't give you huge insight. There's only so many of the last, lines. You don't get any state of of variables, etcetera.

So the picture isn't quite as clear with Apex.

When you're developing, potentially, with if you hit one of these scenarios, you'd switch on the Salesforce Deepgram logs. These contain far more detail. They've got all that. They they they track, variable changes and all this sort of stuff.

But, ultimately, they're reactive. You know, you can count encounter an error in development, turn the logs on, retry the action, and then see what the logs generate. There there's limits around the logs you'll probably be familiar with. So you can only have so many, so much space taken up by logs.

Each one can only be twenty megabytes, I think, and they only last a maximum of twenty four hours. So there's there's a lot of restrictions about it. So it's not feasible to have these running all the time in production. And, equally, you can't really safely or or or expect people to say, oh, here's an exception.

I'll revert the data in production, and I'll try it again. You know? So, you know, revert it. Let me switch the logs on and and and try it again.

In development environments, there's obviously far less risk around doing that sort of stuff, so it's it's it's much more usable to do that. And even if the exception is something like pressing a button, it's not necessarily changing data or inputting or reverting data.

Then, if if there's that delay in visibility, you know, the user gets an accept an exception, has to send an email to you, or whatever the mechanism is, it's possible that you then switch the logs on and go back to them and say, right. Let's try it. They might not be available to do that. Or even worse, they try it, and the error doesn't happen. So you've lost the the sort of context of what was going on at the time. There's still presumably, this bug still exists in the system.

So to get around that, a lot of people have taken the approach of proactively log in the state of the application is a better option ultimately than trying to replicate it. You store more data, but the complications around replication can be avoided using that approach. And you can do this, of course, for, handle exceptions too.

So we might say that here that logging is gonna improve that reactivity by being proactive, which maybe sound a bit of a tautology, but, ultimately, we're logging information in advance so that it's there if we need it to react to a particular situation given the those restrictions that we spoke around around Salesforce standard login.

There is obviously a trade off here between the amount of data that you're storing, but that trade off is hopefully offset by the ease and reduction in time to actually get fixes analyzed, validated, and resolved.

So what do we log? Well, it's essentially your decision. As you say here, I don't like reading slides out, but I think that's a very important point.

Point. Ultimately, though, you the the Salesforce logs are a good guide to this stuff. There there there's lots of other information around it, around what the platform is actually doing at any time, which you often try to filter out when you're looking through them. But, ultimately, if you think of the process as some sort of data input transformation, ultimate output at the end, then it seems reasonable to say you should log the data that's going in, monitor it as it's being transformed, and then the result is is is obviously there as well so that you can actually build up that picture of process that it went through, the code that was called, the flows that were executed.

As we said, there's this trade off potential with with, platform limits. We spoke about data, obviously.

Platform event, we won't get too much into that, but platform events are a very handy mechanism for being able to log this stuff without it all getting rolled back when an on handle exception occurs.

But they they they're limited as well. Platform event consumptions are limited. But, again, we're we're talking about using some more of this stuff to actually cut down expensive bug issues and replication process in production. You can obviously purge your storage quite regularly to keep that down because, let's face it, hopefully, ninety nine point something percent of the time, you're not gonna need any of these days because everything worked perfectly.

If you're wondering about, the of implementing something like this, there are frameworks like Nebula Logger, for example, just choose one very popular one that offer all of this. You don't need to go and reinvent the wheel and create it all yourself.

I think what's important is to speak to a point that Karen, mentioned in the in the keynote just there.

Ultimately, this is all gonna end up as data in Salesforce. And the the the real great thing about that is that you can report on it. You can create dashboards. You can process it. You can put triggers and loads on it yourself to to to slice it up however you wish. But, ultimately, this is data that you're gonna generate from your application that's gonna tell you about its own internal health.

Here's a picture of Nebula.

There's a huge amount of information that it actually, captures and stores for you, which is all really, really important and useful. But I wanted to focus on this tab particularly for the purposes of observability.

If you can see, hopefully, you can, there's a number of, items here. This is, an implementation I did with some very simple process as as you can probably tell because of the amount of limits I've used here. But it's captured for me a snapshot of, in the transaction, the amount of CPU time, DML rows, DML statements, heap size, etcetera, SOQL queries, SOQL rows retrieved, that were used at that point of the transaction. So I'm in completely in control of when these snapshots are taken.

When we think about constructing applications as sequences of components, as we'll touch on very soon, the the taking a snapshot at the start and the end of a particular activity and then calculating the delta between them and knowing how much that consumed can become really, really useful information, when it comes to, working out application health, scalability, and stability.

So let's dig into that a little bit more, for the last section. We've moved really now into this reactive, sorry, into the proactive section. We've looked so far enabling better reactions to production errors, whether that be, the the the flow observability that we saw, reading the emails, or logging so that we can react better. Right? So that's sort of in the middle ground between the two. We're doing something a little bit proactive in logging so that we can actually react better.

Production errors will happen as we dis as we discussed at the start. So we want to reduce the time taken in each of those three stages of validation, visibility, and resolution.

But as with great DevOps, as we said at the start, we can shift some of this left. Right?

And if we understand the business processes that are being operated on within the org, then we can use the principles around composition and observability and the data collected in the login, as we've just seen, to identify some potential unhandable exceptions before they even occur. So we don't even get the email. We're actually stopping them spotting beforehand.

And what's underlying this, is the idea of composition.

So the point here is that an application is really a collection of processes, each of which is a series of components arranged in a specific order. Now I'm no electrical engineer, but I think these, cylindrical ones are capacitors. I don't have chosen those at random, but a capacitor has a defined job. It, like, has some sort of input into it and some sort of output.

You can measure what it does. It's got operating range of what it takes in and range of what it'll put out. But on its own, as an independent thing, it doesn't doesn't really do much. Right?

It relies on an input in order to give an output.

And the each of these circuits, if you think of these sort of circuits as modular things, they exist and they work because they are a sequence of these individual components put together to re to take some overall input and generate some specific output.

But each of the components is capable of doing that as well. We bear that thought in mind.

However, though, if we're gonna relate this to code and Salesforce applications, we don't actually want multiple instances. We don't want fifteen capacitors or whatever that might be there, because that's gonna introduce the risk of divergence and make things difficult to maintain. If we think about some logic, let's say, to discount an opportunity, we might want that to be invoked when, let's say, the opportunity reaches a certain amount or potentially when the when a particular opportunity line item's added.

Or maybe even, let's say, its parent account gets to a particular tier, and the all its open opportunities get discounted.

That shouldn't be a different bit of code or flow or functionality each time. It should be the same one because, ultimately, the process is exactly the same.

Therefore, the functionality if we if we think about this a bit more, the the functionality then becomes disconnected from the business process just as a capacitor just isn't a thing, really. It doesn't have any sort of real function of its own outside of a a wider circuit.

The the logic to, discount an opportunity, Let's call let's say for the purposes of this that it's a a method in Apex, but it could be a float. It could even be an agent potentially at some point in the future, maybe not necessarily that that, functionality.

But an agent can equally be a component.

It will take some input, and it will generate some output. And we can test that, and we can put all sorts of conditions around it. When these things are isolatable and identifiable like that, then we can measure them.

So let's have a think about that.

So being able to measure these is the foundation of what we're actually looking at now. Processes or arrangement of these components, and then we can we can measure we take those snapshots at the start of and the end of each component usage, and we can widen that out as we'll see in a second to actually get a full picture of what each component can do and its impact on an overall process.

Because your application is gonna be defined of, built up rather of defined processes like that, you can all measure you can measure these individually and have this whole process level view of, of of the impacts, what's needed to actually do that operation.

If you think but there's gonna be a lot of these, right, in your in your applicant. If you think about when an opportunity is closed, for example, there's all probably all sorts of conditional logic that may be invoked depending on some state of the data.

The number of permutations might grow to a number that's actually very difficult to maintain at that process level because you might any any potential state of an opportunity might cause a different path to be taken.

And testing those can become quite complex, and we may not be able to actually cover all of those in in testing processes. And some of them may emerge in real life, and that's that's how we end up with the production, defects.

But if you understand what each of the components are doing, then we can get a much better handle on what's gonna what's gonna take place. So if we if we try and explain it using a picture here, if you think let's say these these operations so within the opportunity before update, after update, and before the and before updates on the account trigger, we've got some logic. Something happens. Right? If we take snapshots, as we saw in Nebula, let's say, for example, we we we calculate the usage of all those limits as as they start and as they finish, then we have a we have a something of an idea of how long how expensive that particular operation is.

It's not necessarily deterministic. Well, it isn't deterministic on Salesforce. We might run it a thousand times and get different values. But as your app as that's naturally invoked over the course of your application being used in production, or you can even do this, obviously, before in in test environments.

You can gather this data and average it out and get a much more consistent value for what you might expect this operation to take.

You can then roll that up, of course. We can then combine those and get a picture of the opportunity trigger consumption as a whole if we want to capture that or the account trigger consumption and then the overall consumption of the saving of an opportunity.

So we can build these up and and understand exactly what the how they're composed and what each of the components needs to to do its job.

So if we go back to that opportunity discount, idea, let's say that there's some new requirement within the the the flow from, an account tier being updated to cause the the opportunity discount process, and it adds, you know, add some logic in for that particular case.

What you can do in this world, though, is you can assess the impact of the change to that if it that particular bit of functionality on the other two use cases. You you aren't gonna get to a point where you develop this thing specifically in the context of the account flow. You test the account flow. Yes. This is all this is all fine. You get to production.

Oh, no. When we add it when the discount is applied on an opportunity line item, we've gone over the CPU limit or something like that. You've actually got data where you can extrapolate. Even if you're working in a test environment or you want to try in a full copy, before production, you could see that that change is gonna increase that particular component's CPU usage by twenty percent.

And we already know that it's got the dashboard warning light coming on on the on the opportunity line item flow.

With that sort of insight, you can make sure that in advance, you're fully aware of the impact of potentially changing that bit of logic because you know where it's used. You know, that intrinsically itself isn't expensive potentially, but it's being used in a process somewhere where the end to end process is expensive. And so you don't really have a lot of scope to increase it.

So you don't ultimately, you don't want transactions, which probably using ninety five, ninety eight, ninety nine percent of some limit or other because they're gonna work fine. You were gonna have without without observability, you will not have that warning light flashing on saying, hey. This process uses ninety nine percent of the CPU limit, let's say.

It you won't get that visibility because you won't have logs on in production because of the reasons we said. It's like driving a car without any warning lights coming on, and just one day, it fails. That's not what you want.

This this concept really, enables you to do really effective volume tests as well. Like, your full copy environment or if you wanna do formal scale testing, you've actually got really good information on the, processes to be able to say that's the impact of of these changes.

And or you can extrapolate that data from smallest datasets into your production volumes and see how it will work.

So we're we are within the last minute, so I just wanted to wrap up by finishing off with these three points about observability.

The fundamental things your application tells you, right, the the as we mentioned at the start, the data within your application is giving you this insight into how it's working. It's that instrumentation dashboard for your application's health. We wanna shortcut the time to resolution.

That clock's ticking when there's a production issue. The quicker you can get it fixed, the better. That is based on having really early visibility of it happening in the first place. And finally, the sort of really mature exciting end of the scale here is being able to generate data out of your application and understand what that means for your work processes.

Okay. Well, thanks very much for your time. It's a bit of a whistle stop tour, but it's a it's quite a new field within Salesforce for sure. One, as we heard Karen say, they're taking a lot of interest in. So this is a a really developing area. Thanks very much for your attention, and enjoy the rest of the summit.

Compare & Deploy

CI/CD Pipelines

Backup & restore

From firefighting to problem prevention: why you need observability in your lifecycle

Compare & Deploy

CI/CD Pipelines

Backup & restore

DevOps done right

Ebooks & whitepapers

Webinars

Blog

Podcast

DevOps report 2025 New!

DevOps training

Help center

DevOps assessment

Why choose Gearset

Customer stories

Integrations

Security & trust

Events

DevOps Leaders

Feedback forum

New from the blog

How to search Salesforce metadata using native tools, APIs, and Gearset

New from the blog

How Gearset delivers real value with AI

New from the blog

Salesforce Audit Trail and Field History Tracking complete setup guide

From firefighting to problem prevention: why you need observability in your lifecycle

Description

Transcript

Contact us

Customer support