NEW GUIDE: Defining and measuring mobile SLOS. Best practices for modern DevOps, SRE, and mobile teams.

Get your copy
OpenTelemetry

In love with OTel and observability panel recap

Watch this OpenTelemetry expert panel to learn about the most loved (and hated) parts of everyone’s favorite observability framework. And to play “Fling, Marry, Kill” with logs, metrics, and traces.

Recently, a fun group of OTel experts and enthusiasts gathered to share their love for OpenTelemetry and observability.

Since the event happened during the week of Valentine’s Day, we also decked ourselves out in red and pink clothing and ran some love-themed polls. Because let’s be honest, when love is involved, subtlety goes right out the window. Here are some of the findings from our polls:

  • The audience’s favorite way to say “I love you” to the terminal was: echo “I love you”.
  • When asked to play “Fling, Marry, Kill” for logs, metrics, and traces, the audience universally wanted to kill logs. However, our panelists were split on this decision, with three members agreeing with the audience, but two members opting to kill metrics instead.
  • The audience’s favorite secret crush (i.e., a software skill they’d like to learn more about) was machine learning and AI. However, none of our panelists were tempted by those sweet LLMs.

Also, fair warning that at one point polyamory was mentioned in reference to observability, and how we should be going beyond o11y, to po11y, when building the future of observability.

Wherever you are on your OpenTelemetry or observability journey, there’s definitely something for you in this wide-ranging discussion. Here’s just a small sample of topics we covered:

  • How getting started with OpenTelemetry can be a challenge, including some pros and cons to relying on auto-instrumentation as a first step.
  • Why the community is so important when it comes to OpenTelemetry, and why you should share feedback with the SIGs and contribute back wherever possible.
  • What our Observability Prince Charming looks like, including greater support for OTel-native instrumentation, easier ways to connect OTel components together, and diverse options at the SDK and API levels without compromising on interoperability.
  • Where AI can provide the most value when it comes to observability.
  • What some good approaches are for migrating legacy apps to OpenTelemetry.

Check out the full panel below and scroll past the video to see a few of the best quotes from our discussion as well as to access the full transcript. See you at the next one!

Watch the full video here

Key quotes from the panel

Dan Gomez Blanco

  • “If I were to wake up tomorrow and all my problems were solved, I think what I’d like to see is a bit, well, a lot more of OTel-native instrumentation. So what that means is that, you know, we talked about recently in multiple sort of online threads of what’s better – compile time instrumentation or eBPF or, you know, any type of agent. I think what I would love to see is…less and less of that type of instrumentation that is added later by somebody else and more of instrumentation that’s added by library owners themselves.”
  • [Answering Fling, Marry, Kill on logs, traces, and metrics] “But the problem with metrics is that I’m not sure if that’s the same problem with flings… that they’re used in the wrong way. So like people sometimes relying on metrics for what they shouldn’t be relying on, which is, you know, debugging like specific transaction-level signals, right? What you should be using traces for. And I’m not sure I would kill logs either because if you think about logs as like log records or like to describe events that are discrete events, then you still need them as well. So I’m not sure I would kill anybody.”

Marylia Gutierrez

  • [Answering Fling, Marry, Kill on logs, traces, and metrics] “I would also like to marry traces because I feel like that is the thing that has the most longevity. You have the best relationship there. You see the full picture. But I see metrics as a fling just because it’s the first thing usually you see. They get your attention and you get excited about it because it’s just like… I think it’s the easier one to get into. Usually people wouldn’t say like, monitoring, observability. They usually go to a metrics number kind of thing.”
  • [On having a love-hate relationship with OpenTelemetry] “You are able to do so much with OpenTelemetry, and that is something that is great. And at the same time, that is the hate thing like… ‘Should I use OpenTelemetry for everything that I’m touching?’ Now, I don’t know, I’m monitoring the water on my plants. I’m gonna use OTel for that, too? Like everything in my life now is OTel.”

Hazel Weakly

  • “One of my gripes with OTel is that OTel feels a lot of the time like it was lovingly designed by a bunch of people who think that running a Plex server on a Linux desktop is user-friendly for printing out files.”
  • “And that is actually one of the biggest things that I saw as a disappointing factor of the observability 2.0 versus 1.0 sort of messaging, which is that observability 1.0 was almost tried to be phrased as the stepping stone to 2.0. They’re both useful, but the thing that’s actually massively different is observability 1.0 tends to be zero configuration drop-in. […] And then observability 2.0 is like, ‘We’re so much better, we’re so much superior. All we need to do is read this 80 page manual, do these 15 different steps, download these 15 packages, and then go on the IRC. Not the other one, this IRC server.’ […] You’ve already lost me.”
  • “And then one of the other things that really gets me about AI is a lot of people think that AI is a creator, like a tool of creation. It’s not a creation tool, it’s an amplification tool. It amplifies what you have. It works really well with what you have. And one thing that it does well is it takes a semantic representation of a concept from one technical domain, transfers it into another technical domain, and keeps the semantics. Anything else is a crapshoot.”

Adriana Villela

  • “Another thing that I did have in mind as well in terms of my OTel Prince Charming would be a world where we don’t have people throwing shade at OTel. A world where, instead of throwing shade at OTel, let’s all come together and continue to make OTel even more awesome. […] Let’s work together and improve it. And that means making sure that vendors support OTel-native ingest because, especially if we want to fulfill on the promise of vendor neutrality, let’s ensure that the real differentiator isn’t going through a song and dance every time you’re trying to get data into a vendor, but really making sure that that vendor has something unique to do with your data, right?”
  • “Auto-instrumentation is lovely, but you can get so lost in the auto-instrumentation. It’s like, “What’s relevant? What’s not? Where do I start digging?” And I think that can be extremely challenging. […] And I think this is where vendors can really differentiate themselves, right – is, “Hey, I noticed that there’s an issue here. Perhaps you would like to turn your attention in this direction.” Perhaps this is where AI can assist us.”

Hanson Ho

  • “I think it’s important to actually have an understanding of what we’re dealing with. These are systems. People are systems. People use systems, and to communicate each structure, you need a language. OpenTelemetry offers letters. They don’t offer words. We put the words together in semantic conventions and giving meaning, create these words. And what I love about the future of o11y is actually po11y.”
  • [Answering Fling, Marry, Kill on logs, traces, and metrics] “So my take is a little bit outside the box, especially for mobile. Metrics capital M on mobile, not super useful. Metrics lowercase m derived from logs and spans, well, that’s where the power comes from. The pre-aggregation on the client side reduces the usefulness of the context that you eventually want to have to slice and dice. It makes sense when back pressure is a problem on servers and you have lots and lots and lots and lots of samples. So you got to make sure you reduce the noise. On mobile, the interesting things don’t happen that often. And using metrics to kind of pre-aggregate, I think unnecessarily erases a lot of the context. I would kill capital metrics, but I would get lower case metrics from everything else.”

Mic drop moment of the panel

Hazel Weakly: “Everybody’s just looking at all the potential of OpenTelemetry, all the potential of observability, and just going, yes, but this is only useful in this one tiny little box. And the thing that I want, my concern, is for people to stop caring about the tiny fucking box.

“Like break outside of the box. Be the spoon. Just, you need to understand that it’s about the humans, the human connections, getting people understanding something and working together. OpenTelemetry is just one of a giant list of methods of figuring out how to understand the system.

“And all these methods, all of them, compound on each other and are massively useful the more people are able to reference them, utilize them, and get something from them, and combine them. If you take all of your OpenTelemetry, all of your observability stuff, your monitoring, your learning, and then you hide it and squirrel it away so that only the SREs can use it… Then, you know, this one tiny little part of the organization sees the data.

“You’re paying what? 10-20% of your entire infrastructure budget for something that 5% of the organization is going to use, 10% of that 5% are going to be actually deeply competent in, and 10% of that 10% are going to be an expert in. That’s the worst possible way to spend 20% of your budget.”

Favorite exchange

Hanson: “None of our software is built by one person. It’s built by a bunch of people put together. And in order for the software to work well, for the observability to work well, it has to be po11y.”

Hazel: “I can confirm, as a polyamorous, queer, hedonist. It is the superior lifestyle.”

Favorite answer to Fling, Marry, Kill

Adriana: “Traces are like my ride or die. They are the backbone for me. Metrics would be my fling. And I think there’s a lot of people out there who, and I think a lot of people started with, I will say quote unquote, observability on metrics and put a lot of weight on metrics. And I feel like we kind of need to shift them towards traces and all the goodies that can come out of that. And also like you can derive metrics, certain metrics from traces. I would kill logs, but I will say this. I’m not saying that logs aren’t important, but in my perfect, you know, world of observability, what I would love to see is, you know, because OpenTelemetry supports span events, which are logs embedded in your traces, then you can have your cake and eat it, too.”

Favorite “love at first sight” moments with OpenTelemetry

Dan: “At Skyscanner, we used to be an OpenTracing shop. […] We had a moment that we said, ‘Okay, well, you know, it’s time to migrate to OTel.’ And the fact that we were able to do that, and I think we were like doing dozens of services that were just bumping a version of a library, an internal library that would basically configure the SDK, right? So what we did is, I bumped the version up and then in a matter of like 15 days, I think we got like more than 300 services onboarded onto OTel.

“So I think there is a talk that I did about this, it was like, seeing the graph of adoption of OTel going like dozens of services a day because they just needed to merge a dependable bump. That was when I said, ‘Okay, the design principles of OTel actually are paying off here, which is having that API decoupled from the SDK.’ So that was my, I guess, love at first sight with OTel.”

Hanson: “I was able to answer questions about how performance affects mobile apps and KPIs. Observability and context allowed us to basically quantify what matters. And when things get bad, we know why it got bad.

“And when things improved, we know what the long-term effects of it is. If you don’t measure it, you don’t actually know what’s going on. And observability and instrumentation in general allows us to answer questions about things that we may not have originally formulated. So that’s when I saw it and when I saw what it could do. It was amazing.”

Resources for learning about OpenTelemetry

Here are the learning resources our panelists mentioned:

Full transcript

Colin (00:00) All right, so we are here. Hello everyone, and welcome to today’s event, “In love with OTel and observability.” I’m Colin Contreary, I’m the head of content at Embrace, and I will be today’s moderator. We’ve got a wonderful panel of OTel and observability experts here. And they’re all deeply, deeply in love with the same thing, which normally would be a problem. There’d be competition, I don’t know, things would happen, but it’s okay because what they’re in love with is just making sense and understanding our software systems.

That’s a good thing for all of us to be in love with. So this discussion you’re here for today is all about sharing some love for OTel and observability. We’ll cover what we wish that the future of observability is, what parts we love and maybe have love-hate relationships with, and what made us fall in love with observability in the first place. We’ll also have some fun poll questions throughout, which, actually, let me start one of those right now.

So you can answer that question as I continue my intro. And we’d love to answer any questions you have from the audience as well. So ask your questions in the Q&A section. We will answer them either during the panel if we have a free minute or at the dedicated Q&A section at the end. And so with all of that out of the way, and as your vision adjusts to the sea of reds and pinks you see on your screen, I think it’s time to begin.

So as you check out the poll question, let’s go around panelists and introduce each other and maybe share a fun tidbit. Maybe your answer to the poll question or something you love besides OTel and observability, because I’m sure that there are many things there. So why don’t we start with Dan.

Dan (01:48) Hello, I’m Dan. I’m a principal engineer at Skyscanner where I lead observability and I also help teams adopt best practices in operational readiness and so on. And I’m also part of the OpenTelemetry Governance Committee since November 2023. And I also work with Adriana in the end user group, in the End User SIG. And something that I love almost as much as observability as drumming. Drumming, I mean, if you’ve seen, like, any photos of me drumming around, I wouldn’t probably be surprised. So yeah, so that’s my other passion, drumming.

Colin (02:30) Nice, awesome. All right, drumming. Why don’t we go to Adriana? Why don’t you go next?

Adriana (02:34) All right, hey, my name is Adriana Villela. As Dan said, we work together in the OTel End User SIG. We are co-maintainers along with Reese Lee. I am a principal developer advocate at Dynatrace. I do love all things OTel, but my other passion is rock climbing. Whenever I visit a different city, I will always find the local bouldering gym and go there.

Hazel (03:05) And very early in the morning too, I might add.

Adriana (03:08) Yes, that’s right. I go early in the morning, too, ‘cause that’s really the only good time to go with all conference madness. So yeah.

Colin (03:16) Interesting. That’s a very cool tidbit. Awesome. Hazel, would you like to go next?

Hazel (03:22) Sure. My name is Hazel Weakly. I have thoughts, lots of thoughts. They never stop thinking, and they never stop thinking. I currently am a fellow at the Nivenly Foundation and I do programming stuff for money elsewhere. And so one of my favorite things is OpenTelemetry, of course, or OTel, but it’s really not that… it’s helping people understand their systems. It’s helping everybody tie things together. It’s that magic that lights up in someone’s eyes when the entire company can get that alignment and really understand what they need to do next to be impactful. In terms of other things I like and other things I love, my favorite hobby is swing dancing. I love swing dancing. I actually go to a whole bunch of swing dancing conventions, have a lot of fun doing that, and I lead and follow as a swing dancer.

Colin (04:12) Wow, we are three for three on very like, fun, active activities. I remember I did swing dancing once in college. It did not go well. That is awesome, Hazel. Marylia, would you like to go next?

Marylia (04:24) Yeah, sure. So my name is Marylia. I am a staff software engineer at Grafana, and I work on several different groups in OpenTelemetry. So I am a maintainer for the Contributor Experience group. I’m also an approver for both the JavaScript SDK and the Database Semantic Conventions and also an approver on the Portuguese localization for documentation.

So touching a little bit of everything, which is something I also love, getting this spread of things that I can do. So besides OTel, I was thinking like things that I could do, like things that I like, for example, I watch a lot of TV shows. I’m on the lazy side, watching TV shows or like playing video games. But since I saw this background chocolate, I cannot stop thinking that I want to eat chocolates. So I think I’m just gonna change and just say chocolate because that is all in my mind now.

Hazel (05:20) I actually have this box of Belgian chocolates that I got as a speaker gift. Last week I was at three different conferences and so I traveled to Europe, but I have this giant box of Belgian chocolates and I’m trying desperately to not eat all of them at once. They’re very good.

Marylia (05:24) I went to, yeah, we had a team offsite in Amsterdam and I spent like the weekend in Belgium and I bought a bunch of chocolates and like, I’m also gonna buy some chocolates to like friends and family. It did not last by the end of the trip. I ate all of them and I did not bring anything back to other people. So sorry.

Hazel (06:01) I might have promised my son some candy. It’s not happening. But don’t worry, I didn’t tell anybody else. So we’ll just keep it between us and the entire internet.

Adriana (06:14) As long he doesn’t remember, then you’re good.

Hazel (06:17) Well, he’s three and half. And so the odds are good, but he’s also my kid. So I could be wrong.

Adriana (06:22) So there you go. Could go either way.

Colin (06:26) Nice. That’s awesome. Hobby, chocolate. Hanson, why don’t you round us out?

Hanson (06:33) My name is Hanson Ho, I work at Embrace. My focus is mobile observability. I’m a little bit of a sicko for that kind of stuff, because it’s fun. What I love is overscheduling myself, but I’m not going to talk about that. And so I’ll talk about my love of my football club, Watford FC. Yeah, sometimes love has a lot of interesting twists and turns. And it’s not all smooth, but at the end of the day, when you love something, you love something. So what can you do?

Colin (07:03) Nice. Awesome.

Dan (07:05) I have to say that, you know, when I saw that the theme was like, you know, wearing red, I wasn’t expecting Hanson to wear anything else, but a Watford F.C. top.

Colin (07:16) That’s true. We made it difficult if your main love was sports and none of your team’s colors were red. So apologies there, but it looks like we lucked out. Let me end the poll really quick. Let’s just see which one won.

Looks like “echo ‘I love you’” one. So very interesting. I’m curious. Is that PHP? Is that just bash? It should have been clear like what language I meant, but “echo ‘I love you’” is the preferred way to say “I love you” to the terminal. Interesting. Thank you all so much for participating.

And now let’s get into the good stuff. So let’s start with our first big topic today. And what we’re talking about is what is everyone’s Observability Prince Charming? So that ideal thing that you want it to be, right? The perfect form of observability. So Dan, I love the way you phrased it as well earlier. If you woke up tomorrow and all your observability problems were completely solved, what would that look like? Let’s dive into it. Let’s start, Dan, would you like to kick us off?

Dan (08:15) Sure, yeah, think if you, yeah, if I were to wake up tomorrow and all my problems were solved, I think what I’d like to see is a bit, well, a lot more of OTel-native instrumentation. So what that means is that, you know, we talked about recently in multiple sort of online threads of what’s better, compile time instrumentation or eBPF or, you know, any type of agent. I think what I would love to see is the sort of, like, less and less of that type of instrumentation that is added later by somebody else and more of instrumentation that’s added by library owners themselves. And I think that’s one of the things that I love about OTel is that it allows library owners to describe what they, you know, how they want to describe their library, how it works. What is a recommended, what is a recommended telemetry that they recommend their users of that library use for operating that particular library.

And not just the library, it could be a third-party system, it could be the runtime or the operating system, all basically being able to emit signals in a native way. So as an end-user, what I’d love to do is basically be able to go there and configure it with, well, as simple as it can be to configure the OpenTelemetry exporter, so the SDK side of things. And then let all that telemetry that comes out of the box to just natively come out of the systems that we need to observe without having to add anything on top.

Colin (09:52) That does indeed sound great. What do think we are, like four months away from that? That’s like just around the corner, right?

Dan (09:57) Yeah, that’s just basically just change all the open source software that is out there to rely on the OTel API. It’s simple. Yeah. Q2.

Hazel (10:07) AI will help this, right?

Colin (10:11) Well, I know Hazel, I know you wanted to chat about AI, incorporating AI in the, I forget how you phrase it, the right way or avoiding the wrong way. Like, how will AI save us here, Hazel?

Hazel (10:21) Well, so AI will save us because everybody’s going to use it wrong. They’re going to implement everything. It’s going to break all the code because they’re just going to ship it. And then as the economy is crashing, as it’s– I’m kidding. Will AI save us? No.

Colin (10:40) I know a similar point, Marylia, that you had shared the idea of making it very clear which types of components and things that end users can use. Can you kind of share a little bit about that as well?

Marylia (10:53) Because what I was thinking, like, OK, what is that, as Dan mentioned, like, what is the goal, like, the end? Like, what is, like, everything is solved. So I saw, like, getting there, like, a few different steps. Like, for example, short term, people, are things that people are currently working on right now. So for example, people working on the config file. Oh, these are solving something, like, more immediate. And then we have, like, OK, what is my really end goal? And a lot of times, people say like, okay, use OTel, okay, I’m gonna start with what? Do I just put the SDK? Okay, cool. Okay, now it’s more complex. Do I put the Collector? Okay, okay, put the Operator? Like what I’m supposed to operate? What? It’s like, it starts to get like so many components that it gets, like, complex.

So you have to, like, learn so much just to, like, start and the more complex your system, which is likely what it is the majority of cases, they’re gonna keep adding more and more. So just something like, you wanna add, this is the one thing that you need or something that will, like, analyze your system like here, just do this thing and you’re done. You don’t have to install 20 different packages on different places. That, of course, I guess in four months, this one too, apparently. Yeah.

Colin (12:14) Yeah, exactly. I’m curious, Adriana, do you hear stuff like that feedback-wise about the difficulty of, like, stitching all the different components together?

Adriana (12:23) Yeah, definitely. I think the two main pieces of feedback around OTel, especially when I was on the other side of things, was, yeah, it’s the getting started, having that seamless startup experience. I think once you get past that, it’s like, ooh, magic. But yeah, I think having something a little bit more prescriptive is very helpful.

And shoot, what was the other thing that I was thinking? Another thing that I did have in mind as well in terms of, you know, like my OTel Prince Charming, would be a world where like we don’t have people throwing shade at OTel. A world where we have like, you know, instead of throwing shade at OTel, let’s like all come together and continue to make O-Tel even more awesome. Because I think that, you know, I think it’s a good sign that people are starting to throw shade at OTel in the sense that, it’s getting traction, right? It’s getting attention. But let’s like, in the spirit of, like, Valentine’s week, let’s come together and show OTel some mega love by like, okay, you got an issue with how things are done.

Let’s, you know, let’s work together and improve it. And that means like, making sure that vendors support like OTel-native ingest because especially if we want to fulfill on the promise of like vendor neutrality, let’s ensure that the real differentiator isn’t like going through a song and dance every time you’re trying to get data into a vendor, but like really making sure that that vendor has something unique to do with your data, right?

Yeah. Oh, and I remembered my other thing is one of the challenges of OTel and I hope we can solve this. I think it’s along the lines of what’s been said before is, know, there’s auto-instrumentation is lovely, but you can get, like, so lost in the auto-instrumentation is like, what’s relevant? What’s not? Where do I start digging? And I think that can be extremely challenging. And I think having the means of like, pointing people – and I think this is like where vendors can, like, really differentiate themselves, right – is, like, hey I noticed that there’s an issue here. Perhaps you would like to turn your attention in this direction. Perhaps this is where AI can assist us.

Marylia (15:01) Yeah, even for the auto-instrumentation, I have people using like, I want to start with the very basic, just to learn. It’s like, okay, try the auto-instrumentation. And then they have to like, just set up like environment variables or some basic things that they started adding. They’re like, okay, so I’m not using auto-instrumentation anymore because I’m having to set up the config manually. So it’s not auto, no, no, it’s still auto, it’s not like fully auto-instrumentation, still need to do something. So even like the more basic one you require some work.

Dan (15:31) I think there is like a level of like, you know, the return on investment of that initial, you know, you put a little bit of effort in, but you then are future-proof in your instrumentation, right? I think that that little bit of like, you know, effort that you put in is returned like 10 times when like in the future you get like more and more things coming out of the box, more like, you know, more people contributing to OpenTelemetry as well. As Adriana was saying, like a lot of people just might, you know, throw shade on it, but then, well, without even raising an issue.

So it’s like, well, at least, know, like contribute to it in some form. But, but yeah, I think that is that some people still have that sort of like point of view of like, I just want to drop something in here. And then it will magically, you know, it will magically know that I’m using this particular vendor. And I like, well, that’s where, you know, the things are being done to make that easier, right? To, to ease that sort of like, as know, Marylia was saying the the work on the OTel config, on the configuration file stuff. But yeah, but I think, you know, like being able to configure that in a way that it’s easier is different from like here is, you know, well, not having to do anything. So there is a little bit of, a little bit of effort that I think then, you know, get, you get back 10 times.

Colin (16:51) Nice.

Dan (16:52) It’s like a relationship, right? You need to put a little bit of effort into your relationship and you get that back 10 times.

Hazel (17:00) I like the relationship thing, but also it brings to mind one of my gripes with OTel in that OTel feels a lot of the time like it was lovingly designed by a bunch of people who think that running a Plex server on a Linux desktop is user-friendly for printing out files.

And so I actually really like the printing analogy because it’s hidden something out of your computer into the real world, right? Normal people. They go to the store, they buy a box, and that spits out ink on paper. They plug it into the wall. They walk over to the computer, they hit the giant print button, stuff comes out. If it takes any more steps than that, they don’t want it.

And then we have all these other people going, yes, for graphic design nerds. We are print design nerds. Have you ever thought about calibrating the chroma of your ink to like the design of your paper for this set? And you’re just like… button. Paper. What? And they’re like, “Oh, no, no, no, it’s fine. We have a configuration helper project for you. We made this so hard for you. We have a configuration helper project. You just wire that in together. And all you need to know is this 400 page textbook on, you know, a theory of paper. Don’t worry about it. You can ignore most of it, but section 27 is really relevant.” And you just keep going.

Hanson (18:37) Yeah, experts designing things for themselves is not the same as designing something for somebody who is just getting into the thing.

Hazel (18:47) And that is actually one of the biggest things that I saw as a disappointing factor of the observability 2.0 versus 1.0 sort of messaging, which is that observability 1.0 was sort of almost tried to be phrased as the stepping stone to 2.0. They’re both useful, but the thing that’s actually massively different is observability 1.0 tends to be zero configuration drop-in.

You more or less just install something, don’t have to do a whole lot. And then magic comes out. Is it useful magic? Yeah, but it’s basic. And then observability 2.0 is like, we’re so much better, we’re so much superior. All we need to do is read this 80 page manual, do these 15 different steps, download these 15 packages, and then go on the IRC, not the other one, this IRC server. No, no, everybody’s actually on IRC. That was not a typo. IRC. You’ve already lost me.

Dan (19:49) There is an aspect as well though on the, and then going back to the AI topic, where like people think that AI will even solve the problems of instrumenting, know, custom code or like basically giving you anything that’s wrong with your system, right? So you put AI and it will tell you what’s wrong. But I think that what we’re missing here is, well, you first need to produce data in a way that AI can use, right?

And that could be using like semantic conventions, or that could be using your own custom stuff that you care about, because that’s the most interesting type of telemetry, right? The one that tells you about your business, what your users care about. I was recently asked if I thought that there’ll be a point where AI will be able to tell you what SLIs you should care from, you know, from about, and it was like, there’s no point that an SLI will tell you what to care about if the data is not there in the first place.

So if you’re like, I don’t know, if you’re doing like, I come from Skyscanner, we do travel searches. So if you’re not pushing data about your travel searches or your flight searches, how is any AI going to tell you that the most important thing about your system is flights? If you don’t even, it doesn’t even know that you do flights, right? So I think that aspect of custom telemetry is also quite important, but in a way that is standard, in a way that can be correlated with all the other stuff that comes out of the box.

Hazel (21:15) And then one of the other things that really gets me about AI is a lot of people think that AI is a creator, like a tool of creation. It’s not a creation tool, it’s an amplification tool. It amplifies what you have. It works really well with what you have. And one thing that it does well is it takes a semantic representation of a concept from one technical domain, transfers it into another technical domain, and keeps the semantics.

Anything else is a crap shoot. But it means that, for example, AI does extremely well with RAG, which is literally just, “Give me the relevant information and then tell me what to do.” You can’t create it, you actually need to develop a context. So what is one of the most relevant pieces of context you can have in a code base when figuring out what to do in a single function? A call stack.

The call stack is super relevant. It’s not in the code directory tree. And the call stack is also not necessarily confined to one particular service. But it turns out that the call stack notion generalizes almost perfectly to trace. So if you can build a tool that can take the trace of a system, like any one arbitrary trace, dump all the source code associated with that trace into a thing, annotate the source code with the trace data, correlate the two, you have like a perfect “RAG in a box” generator that is only generic, once and everything, and then updates with their test suite.

Nobody’s built it yet because nobody knows what to do with this thing. But everybody’s just looking at all the potential of OpenTelemetry, all the potential of observability, and just going, yes, but this is only useful in this one tiny little box. And the thing that I want, my concern, is for people to stop caring about the tiny fucking box.

Like break outside of the box. Be the spoon. Just, you need to understand that it’s about the humans, the human connections, getting people understanding something and working together. OpenTelemetry is just one of a giant list of methods of figuring out how to understand the system.

And all these methods, all of them, compound on each other and are massively useful the more people are able to reference them, utilize them, and get something from them, and combine them. If you take all of your OpenTelemetry, all of your observability stuff, your monitoring, your learning, and then you hide it and squirrel it away so that only the SREs can use it. The only, know, this one tiny little part of the organization sees the data. You’re paying what? 10-20% of your entire infrastructure budget for something that 5% of the organization is going to use, 10% of that 5% are going to be actually deeply competent in, and 10% of that 10% are going to be an expert in. That’s the worst possible way to spend 20% of your budget.

Adriana (24:29) So you’re saying bring OpenTelemetry  to the masses, which I mean, as it should be, right? Because it shouldn’t just be in that little corner. It’s an everybody concern. You’ve got to get them jazzed about it.

Dan (24:41) And also how you use that in like, you know, all the different parts of, and I guess that’s what Hazel, you were talking about. You can use it in all the different parts of the software development lifecycle, you know, from, this is not just about operating system, but understanding systems and even talking about like different groups of humans that need to agree on those dependencies between teams. Well, you know, if the systems that you operate, give you those dependencies, exactly what is in the critical path, what’s not. I think there is a lot to explore there for like, you know, not just the, “Hey, something broke in production, we’re going to fix it.” Yeah, absolutely.

Colin (25:24) I do want to make sure we get to our next question, our next topic. But Hanson, we haven’t heard from you in case you wanted to chime in a bit about your Observability Prince Charming really quick and then we just–

Hanson (25:35) Just to build on what everybody’s saying, I think it’s important to actually have an understanding of what we’re dealing with. These are systems, people are systems, people use systems and to communicate each structure, you need a language. OpenTelemetry offers letters. They don’t offer words. We put the words together in semantic conventions and giving meaning, create these words. And what I love about the future of o11y is actually po11y.

We want different things at the API level, SDK implementation level, correct abstractions, also correct specific opinions at the SDK level so that you have expectations going into a relationship with another system. What are you going to get from it? What are you providing to it? And having this understanding allows us to have different instrumentation APIs and vendors work together so that the system plays nicely with everyone. And we’re not going to be having a, you know, a relationship that is, you know, coercive and dictated from one side. I can’t leave it because I’m, you hard coded a bunch of stuff that only works with a specific vendor. It has to be po11y. That’s how it’s all going to work. Cause none of our software is built by one person. It’s built by a bunch of people put together. And in order for the software to work well, for the observability to work well, it has to be po11y.

Hazel (27:00) I can confirm, as a polyamorous, queer, hedonist. It is the superior lifestyle.

Hanson (27:10) You need maturity to kind of be in a complex distributed system or relationship. But once you get there, it is potentially really awesome. But you need the helpers. You need to wrap up. Back to what everybody was saying is you can’t just drop people in and say, figure it out. It ain’t going to work. It’s going to work really badly. And you’re gonna be like, “Oh no, OTel sucks. Observably sucks.” No, no, no. It’s just you just weren’t ready for it.

Hazel (27:39) It’s great when it works, but it does require so much extra work and effort into it. And you have to get used to rewriting some of your assumptions. Like, for instance, my calendar is eventually consistent, which is the worst kind of calendar. But also, what else is eventually consistent? You’re logging. Nobody expects that to be eventually consistent until it is. And yeah.

Colin (28:04) Nice. When we got it, sorry, I missed my clean window when we got when we made it all the way back to the love theme of polyamory. That was the time to move on. So we’ll see if it happens organically each time. But it is time to move on to our next topic. Let’s start with a nice palate cleanser. I’m going to launch a quick poll that we can chat a little bit just to have a nice little icebreaker. So panelists, this was shared with you ahead of time. You should have your answer here and in the game.

“F, Marry Kill.” F stands for fling. That’s what it stands for. So audience, yeah, if you want to let us know, if you don’t know this game, there are the three, currently there are three primary telemetry signals in OpenTelemetry: logs, metrics, and traces. “F, Marry, Kill” is a game where you have to pick one that you can have a very short, brief, but passionate dalliance with. You can try it once in your code base.

Marry is, you’re gonna, that’s your ride or die. You’re gonna stay with them forever. That’s your number one signal.

And Kill is, well, you gently push that signal off the cliff, never to be seen again. So, panelists, “Fling, Marry, Kill” for the three OTel signals. What is it gonna be?

Hazel (29:18) As a polyamorous hedonist, I cannot possibly marry only one, but also I’m not really going to marry any of them. So I’m going to have a fling with all three. Because that would be ideal to me. Maybe not all at the same time, but only if they want. But when we’re talking about again, I’ll have a fling with all three. But if I have to, if there becomes like a jealous triad happening, I will regrettably have to kill metrics first.

I know, I know. The thing is, the thing is I have noticed in my career that metrics are the one thing that everybody really wants to have more than anything else. But it’s also the hardest thing to actually get right. And it’s the thing that almost always gets used as a weapon unintentionally, but as a weapon to inflict pain upon software developers. So I tend to, that’s the one that I kill first.

There’s actually a funny story. One time, I was at a company, and we had this monitoring and alerting setup. It was so bad. Any non-200 error threw a Severity-1 incident, like a single four, single five something, Severity-1 incident, just absolutely bonkers. And it was burning out all the developers. And so finally I had to go to the engineering leadership and I said, we have two options.

One, we invest enough time required to get this thing actually usable and workable.

Or two, we just turn off all the alerting. And the reason I said that is the second option was it turns out all of our alerting, all of our monitoring, all of our everything was so inefficient that, despite burning out all the engineers, the most reliable method by far, by several days, was the inbox that customers had used to email us to say the things weren’t working. That worked far better than the thing that was burning out all the engineers.

This is so common. This is so common. Your most reliable alerting method is actually wtf at company.com. That is one of your highest quality metrics, highest quality alerting sources. That’s one that actually matters. And so if that one is disagreeing with your system metrics, you can turn those off. And so we did. Engineers are happier and the system actually got more reliable because they were less sleep deprived.

Colin (31:51) Wow, all right. That was quite the case to throw rocks at metrics until it perishes. Everyone else, I want to hear what everyone else thinks. Dan.

Dan (32:01) I would agree that, you know, that metrics are probably, well, I would probably marry traces. I think most like because the backbone of like, you know, basically all correlation, right? So you have traces and then when that distributed tracing and that context being propagated with trace context, I guess probably marry trace context more than spans themselves as, you know, isolated. If you were to take it as isolated signals, right?

Hazel (32:26) So you like a girl that’s well-traveled?

Dan (32:33) Yeah, exactly. Like that’s been around, you know, like distributed systems, yeah. Many different sub domains. Yeah. So I think, the, uh, I mean, I would definitely put that there. think metrics are basically something that is that I would agree that is something to get, to get to that is sometimes difficult to get right. Now at scale, you tend to, I mean, my, point of view is that at scale you tend to benefit from that pre-aggregation at a, at a specific, like, replica level of, like, some high level signals.

But the problem with metrics is that I’m not sure if that same problem with flings is that they’re used in the wrong way. So like people sometimes relying on metrics for what they shouldn’t be relying on, which is, you know, debugging like specific transaction level signals, right? What you should be using traces for. And I’m not sure I would kill logs either because if you think about logs as like log records or like to describe events that are discrete events, then you still need them as well. So I’m not sure I would kill anybody.

Marylia (33:38) Mine is a little similar to that. I would also like marry traces because I feel like that is the thing that has the most longevity. You have the best relationship there. You see the full picture. But I see the metrics as a fling just because it’s the first thing usually you see. You get your attention and you get excited about it because it’s just like, I think it’s like the easier one to get into. Usually people wouldn’t say like, monitoring, observability. They usually go to a metrics number kind of thing.

So I think that is the thing that calls the most attention. So go with that for the fling and the logs, I guess we’ll have to be the one to kill, but it would be more on the wrong usage of logs. Because a lot of time people just send a bunch of stuff and I see people also using like, “Every time this event happens, I just send this message on the log.” And then they do like a search for like how many times that message show up instead of having a metric with the counter. So again, like the logs can also be wrongly used and because there’s so much information there can be harder to read than a metric. So I’m sorry logs, you’re dead.

Hazel (34:58) It feels like logs are the messy, somewhat unstructured person who doesn’t quite have their life together. They’re super-energetic. You love them a lot, but they’re kind of chaotic. Whereas traces went to therapy.

Dan (35:19) Are logs like the toxic relationship that you need to get away from at some point? You need to move ahead.

Colin (35:27) I do want to hear from Adriana and Hanson really quick on this and then we’ll get into the next topic because we got to see, “Is logs universally going to be going to be slain?” Or no metrics was for Hazel, sorry.

Adriana (35:40) So I was gonna say, I agree, traces are like my ride or die. They are the backbone for me. Metrics would be my fling. And I think there’s a lot of people out there who, and I think a lot of people started with, I will say quote unquote, observability on metrics and put a lot of weight on metrics. And I feel like we kind of need to shift them towards traces and all the goodies that can come out of that.

And also like you can derive metrics, certain metrics from traces. I would kill logs, but I will say this. I’m not saying that logs aren’t important, but in my perfect, you know, world of observability, what I would love to see is, you know, because OpenTelemetry supports span events, which are logs embedded in your traces, then you can have your cake and eat it, too.

Colin (36:41) Yes, very much true. Hanson, your take.

Hanson (36:45) So my take is a little bit outside the box, especially for mobile. Metrics capital M on mobile, not super useful. Metrics lowercase m derived from logs and spans, well, that’s where the power comes from. The pre-aggregation on the client side reduces the usefulness of the context that you eventually want to have to slice and dice.

It makes sense when back pressure is a problem on servers and you have lots and lots and lots and lots of samples. So you got to make sure you reduce the noise. On mobile, the interesting things don’t happen that often. And using metrics to kind of pre-aggregate, think unnecessarily erases a lot of the context. I wouldn’t, I would kill capital metrics, but I would get lower case metrics from everything else.

And what I would marry is actually context. Context is not part of the pillars, but that’s what ties everything together. Whether it’s a log, it’s a span, doesn’t matter what it is. If you have context and if the telemetry, the signals you’re sending, provides a specific signal, well, literally, you can start doing a lot in terms of slicing and dicing, in terms of replaying the entire session. That’s what I would marry. Not logs. I want logs and spans, but under the context umbrella.

Colin (38:18) Sorry, I was trying to unmute and it was not working. Great, thank you, Hanson. I think I needed like 16, I need fling, marry, kill, date, weekend, tryst. Once we get into like span events, I didn’t know what I’d opened. Let me end the poll really quick. Let’s see what everyone else thinks. Oh, we have a tie. Okay, well, the audience universally wants to kill logs. So that much is clear.

Some would rather marry metrics, some would rather marry traces. Funny that those two came up because we have an audience question about that that I would love to answer now before we get into our next topic. Ashwin asks, “How would you all convince legacy apps which were written 10 or more years ago to adopt metrics and traces?” I would love to hear from our panel their thoughts on that.

Dan (39:09) I can go with that. I think one of the ways that I’ve seen that basically that can work is if you drop in sort of like instrumentation from OpenTelemetry that can start to inject trace ID and span ID into logs. then you’re able to tell people, “By the way, that’s correlated now to your traces.” So keep the traces out of the box. You don’t need to add anything custom there. So I tried to get that sort of like client span / service span and then basically start to get that context around and then show people that, “Hey, by the way, now you’ve got your logs in here, but they have a bit more context. You no longer need to do your type of correlation between logs.”

So to give you an example, like I’ve been in companies where we did that type of correlation manually. And I say manually, it was like custom tooling to inject headers, extract headers, have a correlation ID that will be passed around, that will be put into NDC and Java, and then we end up in logs. And all that basically so you could go and look at specific logs for a transaction, right, for one single correlation ID. And then say to folks, “Oh by the way, now you don’t need to do that. You’ve got a standard. You’ve got W3C standard for trace context propagation. You can have that in your logs, but as well, what if you invert it and instead of looking at logs first, you now look at traces first. Your logs are still going to be there. You can go to them if you need them.”

So I think that is a nice way to segue people into that world of like everything happens with some context. Right. And then on the metric side, well, I guess that’s you know, that’s basically the story of like, where do you start with metrics as an aggregated view? And then you see those correlations to or how a particular regression in a metric may correlate to individual traces that were captured at that point in time. So I think being able to convey that story of you go from the more aggregated views to the more high granularity views, that’s one way to convince people to get into the world of metrics and traces. I’m sure if anybody else wants to jump in.

Marylia (41:34) Yeah, I was also just gonna add like… because like this way you just don’t tell like, you have to replace every single thing that you did, like we don’t scare people, like you can also add things step by step and see what makes sense for you. This way you don’t feel like, “Oh no, everything I did for several years now, just throw it in the garbage.” Like no, no, you have to, you can keep like being like step by step.

Colin (42:01) Nice. Let’s, sorry, go ahead, Hazel.

Hazel (42:03) I don’t actually want to answer this, but I want to answer it with a little bit of a story slash analogy. I’ll keep it short. I have, in my experience, found that there are three specific things that you need in order to convince a company to do some sort of major task. You need a magic jewel, a path through the woods, and you need a tourniquet.

So the magic jewel is different for every single person. But this magic jewel is something specifically that captures their attention and helps them go, “This is exciting.” So at one company, my magic jewel was someone was describing a really convoluted path to fork and modify the code of something and blah, blah, blah. And it sort of, to fix a downstream dependency that we needed. And in the time that he described it verbally, I did the same using the tool that I was trying to get adopted.

As he described it, I completed the task and he was like, I think this might take three weeks. And then I handed him the finished thing, and I said, “Does this work?” That was the magic tool for that. There ever after he was chasing the jewel. Leadership also has a magic tool. It’s going to be different. Some of your ICs are going to want a different tool, but if you can find the jewel, if you can find the gems, you’re good to go.

But then you need a path through the woods. You’re going to start off lost in the woods somewhere. You can’t get onto a golden path. You can’t use a green field. You’re somewhere in the woods. I need to have a path through it in order to get to the nice place, the happy, the everywhere. And so this path, most personally, needs to be able to be taken each step at a time. We don’t have this ability to take each step at a time on a path and then stop and be fine.

You’re never going to make progress because nobody’s ever going to want to start. If they know that they can get on the path and then get off the path whenever, working on the path, stop halfway through and not break anything, you’re good to go.

When we say, “Let’s rip out everything, do a full rewrite of all of this stuff. And finally, once it’s all done, we’ll be able to use it–” That won’t work. People won’t get on the path because that’s just scaling a cliff. It needs to be step by step.

Then finally, the tourniquet. Because once you get the jewel – the motivation – once people have the path and they know sort of where to go and if you’re confident they can take it step by step, you have a problem, which is that they’re bleeding. They’re spending a lot of energy staying right there. And so what you need to do is you need to figure out a way to stop the bleeding in one area so that any new code, any new development in the application is on the path.

Don’t actually worry too much about the old stuff. It sounds counterintuitive, but in an application, the power law distribution applies so deeply. And there’s actually research to back this up. The power law distribution is really saying, the thing that’s most likely to break is the thing that you just touched. The thing that’s most likely to be updated again is the thing that you just touched. The thing that’s most likely to need more documentation or need more support is the thing that you just touched. That one scary bit of code in the corner that was last tested in 1994, you don’t need to worry about it. Honestly, it’s either never going to be touched again, or you can deal with it when you get there. The new stuff that you wrote last week, instrument that with OpenTelemetry. Don’t worry too hard about selling people on this stuff from 1994.

Colin (45:51) Yep. It’s in the closet. It’s humming along. It’s happy. It’s somehow surviving on stuff. Let’s just let it be there. Great point, Hazel. I want to jump into our next discussion topic. So here we go. What do you absolutely love when it comes to OTel and observability?

And before we get into that, I realized that some of the things that we love might also not all be sunshine and rainbows. It might be somewhat of a love-hate relationship. So if that is how you want to address this topic, it’s not only acceptable, that would be fantastic. And I might find a bag of popcorn and eat it while you chat. So what do you love about OTel and observability? And maybe what’s a love-hate relationship? And let’s start with Marylia, if you’d like to kick us off.

Marylia (46:42) Sure. I think I can even go back a little to the complexity that I mentioned before. So the good thing is, you have a lot of options. So that is a good thing of what you can do. And at the same time, you have a lot of options, which one I should be picking? So I think that is like you are able to do so much with OpenTelemetry. And that is something like that is great. And at the same time, that is the hate thing like, “Okay, how I actually do it with… Should I use OpenTelemetry for everything that I’m touching? Now, I don’t know, I’m monitoring the water on my plants. I’m gonna use OTel for that, too? Like everything in my life now is OTel.”

So I think like that is something that is really great. And having like all the community helping for the same thing and having like this, like there is always this gap. And people just like starting working and became like from this small group to being the second most contributed project on CNCF. And at the same time, you’re like, okay, on my day to day, like, “This competitor is creating this thing. No, that competitor… Okay, let me just stop because I actually have a meeting with that competitor and we are working on the same thing and we are actually pair programming.” So I do, like, a bunch of pair programming with people that are technically competitors. So that aspect is also really nice from the community.

Colin (48:14) Nice. Adriana, a lot of head nods. I know you want to jump in on this.

Adriana (48:19) Yeah, yeah, I couldn’t agree more with Marylia on that, like, the community is amazing. And I think OTel is one of those really unique communities where I think a lot of the times open source projects can be very much driven by what like one company or organization. And I think OTel is one of those rare occasions where it really is truly community driven like Kubernetes. And so there isn’t anyone like vying to be dominant. And if there is someone trying to be dominant, like that’s squashed right away because by design, the folks who lead OTel are very, very mindful of making sure that, you know, it’s not like one community, or sorry, one company trying to get the spotlight. And I really appreciate that. And one thing like early on in my OTel journey that really made an impression on me was before I started working for an observability vendor, I was on the other side of, we were trying to bring OpenTelemetry into the organization where I was working at, and I was running an observability practices team. And this was back in 2021. So OTel was still, like, pretty new and getting conflated with OpenTracing.

And, you know, I was like, “Hey, let’s, we, we, we should use OTel. We should use OTel.” And the developers in the organization were like, “But it’s, it’s too new. I don’t know.” So I reached out to Ted Young and Liz Fong-Jones in the OTel community, and you know, they’re, they’re competitors. They work for competitors, right? And I said, “Hey, can you both, like, come and talk to this company in the spirit of community to really lift up OTel.

And they didn’t even hesitate. I mean, this is what I love about this community is that they came together and we had a great discussion. I was even worried, you know, oh my God, what if the developers don’t have any questions to ask and now I will have done this for naught? And the questions just kept flowing and I’m, like, oh my God, this is wonderful. Like it really inspired some great conversations. And for me, this is like one of my most cherished OTel memories. And I think this is why it’s such a great community and continues to be like this even to this day.

Colin (50:50) Nice. Yeah, that’s awesome. I want to kick it to Hanson. Hanson, what do you most love about OTel and observability?

Hanson (50:58) Well, I’ll talk about what I love first, which is, I think it’s so flexible. The standard is easy to play with. You can experiment. You can have your little flings and see, not fully commit, just like, you know, play with it. “Hey, do I want it this way?” And if you do, cool. You know, maybe, you know, convince other people and you can make it a standard. If not, it doesn’t work out. No worries. The protocol itself and everything around it is built-in for you to try it out before you buy it.

And that is so huge. You don’t have to necessarily overcommit right away. What I don’t like is more, I wouldn’t say I don’t like it. I would say I think it could be improved in terms of the supported use cases. I think it was designed by experts to solve very specific use cases and then got expanded a little bit. And now when OTel is being brought to more people, the APIs and the SDKs start to be a bit ill-fitting for some situations.

And as the community grows, more people find OTel, the protocol, useful. I think, you know, naturally it’ll improve, is my hope. Not by itself, but with help with the community, which everybody spoke so highly of, which I totally agree as well. So, you know, it’s an opportunity. I don’t hate it. It’s an opportunity for improvement. And we’ll get there.

Marylia (52:29) If somebody is listening, that is your cue here. If you’re waiting for a sign, tell us your use cases. A lot of times we create our semantic conventions, we create prototypes, and we put it there waiting for feedback. Somebody, give me feedback. I want to know if this is good, if it’s what you want. So if you’re hearing, yeah, reach out.

Colin (52:51) Nice. Dan, I think we touched on a little bit earlier, but talking about that, like, configurability, I know file config is a big thing in your book. I don’t know if you wanted to touch on that a bit.

Dan (53:03) Yeah, I think it’s basically been said before, but the the love-hate relationship between like, it’s so extensible and so, you know, like the OTel SDKs and the way that, you know, the design of OpenTelemetry is complex for a reason, right?

And that’s the love-hate relationship there, which is like, I love that you can do everything with it, but I also hate that this can be difficult for end-users to do and adopt. So yeah, we’re doing things in terms of like how easy it is to configure and how easy it is to share those configurations and those templates and so on for, to make it like a drop, like I think that you can drop somewhere, right? And it’s a lot easier. But yeah, it definitely is that love-hate relationship.

I would say as well, another thing that I was thinking, and I think we haven’t talked about a lot, is semantic conventions as well, right? And there’s not a love-hate relationship there. I think it’s a challenging one because on the one hand, you’ve got semantic conventions that are awesome. I think that is the thing that is driving that tooling from vendors that are adopting them on top of standard data, right? And so I feel we’re in a moment like…

I don’t know, before we had SI units, right? And we had people measuring things and I don’t know, like the cubit, like meter, so like the, what’s it, the length of your arm? Was that some sort of, like, some unit of measurement before we had something standard? Anyway, so I think we were at that moment before we’re now getting into the SI units sort of like world and how that has, well, what it did to science, which is like empower an age of scientific discovery, right? I think, you know, having those semantic conventions in the first place is going to empower so much, well, you know, advancement in the world of observability. But it’s challenging because I think everyone here would agree that there’s no, there’s nothing more difficult than putting a bunch of engineers in a room and telling them to, like, agree on some naming. So it’s, it’s difficult. It’s difficult. And it’s not just the naming, by the way, I think, you know, people sometimes get a little bit confused with the name semantic conventions, but like it’s about how we actually measure things. Is it a metric or is it an event or is it a span or like, you all these discussions that need to happen. So it’s challenging, but I think we’re getting there. There has been a lot of good work being done and it’s great to see.

Marylia (55:35) Yeah, even when it is defined, you were saying, I was just remembering some of the meetings that we have a SIG that is just for database metric. So we know it’s a metric, we know the things we want to calculate. So, like, okay, what are the attributes? And then I go like, easily that thing. Okay, cool. Because I’m thinking about SQL, like, no, no, but that doesn’t exist on NoSQL. Okay, what is the name that matches the both thing? Okay, now this is the name. But about Redis? What about Elasticsearch? Elasticsearch is a database. Should we consider it? And then we spend meetings, like several hours, just like, is this the same thing as this thing? So it can take forever. So this might also take long to stabilize thing because you want to make sure that it’s going to fit on several cases, people are going to give their feedback. And then when you start creating the prototype, you realize, oh wait, you cannot implement this. That is not possible in that particular language. Like, no, go back, go back. And yeah, it can be challenging.

Colin (56:37) Nice. I want to move towards our third discussion topic, but I did just want to mention again, I’m so glad that you brought up that point, Marylia, about… reach out. The OTel community is all about community. So feedback we have, we think there’s about 25 SIGs, who knows, every single day there might be four more, five more. There’s so many use cases for observability and OpenTelemetry’s assistance there. So many hardworking people. Reach out to the people on this panel.

Join the SIGs. All the calls are open. The notes are open. So definitely get involved. The project is as great as the community makes it. I did want to, let’s launch our third poll really quick. And then we’ll get into our final discussion topic as everyone answers. This is more, I’m just curious. So the poll question is, what’s everyone’s secret crush?

I mean, the software space is huge. We can’t be experts in everything. What’s that secret crush that you really wish you had more time to learn and get good at, but you just don’t. And we did touch on AI in this chat. I’m curious if the audience will be AI. Maybe Hazel will chime in about AI. I don’t know. We’ll see. But while the audience gets into that, I’m just curious panelists, what’s your secret crush?

Hazel, you can go first. I didn’t mean to tee you up if AI is not your answer, but I’m curious. What’s your secret crush?

Hazel (58:00) My secret crush, let’s see. I actually have a decent amount of experience in basically all of these. But the thing that I would probably learn the most about, I would say. So one of my dirty secrets, if people go, “Hazel, you know everything.” I go, I actually keep a list of things that I don’t know, because I know it’s small, but it is there. And one of those top of mind is actually SQL.

I’m not that good at SQL. I know. I can do the basics. I can do, but like the fancy, the fancy stuff, can’t do that. I can learn it whenever I wanted to. It’s just never actually come up. And so I would learn SQL things.

SQL is also one of the things that is, like, least suitable for AI, which I think is really fun. Um, AI, it would be interesting to learn more about how to integrate AI usefully into things. But one of the things that’s challenging is all of the interfaces around AI are really, really limiting for automation. And most of the ways we can actually integrate it are just not super useful. And for me personally, the thing that I love more than anything else is reproducibility.

Like it defines my entire software stack. It defines how I keep myself, you know, sane when I’m using a computer. If I hit the button, it better do the exact same thing that it did before, or I will rip apart that entire computer and fix it until it does. And so the fact that everybody’s just like, yeah, let’s take a stochastic parrot soup, sprinkle it everywhere and just go, “Yeah, absolutely.” Melts my brain.

AI would be the toxic person that everybody’s like, they’re kind of dangerous, but they’re really good in bed. And I’m like, “Hmm, maybe.”

Colin (59:59) Nice, interesting. All right, all right. Really quickly, because I do want to get into the last discussion question. Maybe people just very quickly just say their answer really quick without, unfortunately, all of the beautiful context. Dan, what’s your secret crush?

Dan (1:00:15) I think data, data observability is an area that I think is evolving a little bit behind like the rest and distributed systems and all that, but it’s going to become more and more important. The more that in data lineage and all these things about like, how do we observe like those consumption patterns of data, which will be more important with the rise of AI as well and regulations as well in the, the EU, for example, to actually be able to demonstrate that you’re, that you’re on top of your data. Right. So, yeah. Looking forward to seeing more in that space.

Colin (1:00:47) All right, Adriana.

Adriana (1:00:50) For me, it’d be mobile because we rely on our mobile phone so much. I think it’d be cool to learn more about that someday when I have time.

Colin (1:01:00) Yeah. All right. I hope you have a lot of time. Marylia, how about you?

Marylia (1:01:06) I’m going to say desktop applications just because it’s a secret crush for me, like database, mobile, like this are not secrets. I’m touching all the time for, for several years. So at that point it’s just a crush, not a secret. So I guess for secret, wouldn’t be like desktop just because I never actually touch on it.

Colin (1:01:24) Nice. Hanson, what about you?

Hanson (1:01:25) I have no secret crushes. They’re all out in the open. And I will say runtime environment, especially on mobile, is my very not-so-secret, but major crush. Runtime environment runs everything. So we need that.

Colin (1:01:40) Nice. Awesome. Interesting. Well, hey, machine learning and AI won. So, all right then. We knew it was coming. We knew it was coming, and now it’s here. All right, let’s kick off our final discussion topic. So, this is what was your “love at first sight” moment when it comes to OTel and observability? So what really event happened that really got your curiosity peaked about it that made you want to invest in learning it, that it slowly became this all-consuming desire that we all have for it. I’m very curious about this. Why don’t we start with Adriana? Would you like to start?

Adriana (1:02:19) Yeah, so what piqued my interest in observability? So for me, um, honestly, it started, I guess, to a certain extent with, uh Charity Majors, um, ranting on Twitter and I got addicted and then, um, and then I got a job where I was, um, uh, running an observability practices team. And at that point, in spite of, like, consuming all the things that Charity wrote, I still didn’t quite get it. And so at that point, I’m like, I got to run this team. I better damn well know what it’s all about. So I educated myself, and as part of my self-education, wrote many a blog post about observability, which led me to OpenTelemetry. And that is how I fell in love.

Colin (1:03:10) Nice, awesome. Marylia, would you like to go next?

Marylia (1:03:14) Sure. So for me, I actually started my career really focusing UX because I really wanted, my mind was always like, okay, you always have to like use those systems, but why they have to be so hard and ugly. So I focused a lot on that like beginning of my career. And then with time, I started working with, like, things about observability, but I did not realize it was observability.

So I was creating like internal systems for people to like monitoring, observe your system. But I was like, this thing is kind of interesting. And then I moved jobs again and my job was to be, the responsible for the observability in the company. So I became like the manager for observability on was Cockroach Labs. So was basically all observability there. And I was like, as soon as I really touching more and more on that, it really clicked like, that is what it was missing for me. That is the part that I really like and I want to continue focusing just on that part. And that’s why then again, move jobs. Move to Grafana, I can focus just on observability, not any other areas. But yeah, that is what it was. Came from the, wanted to help out the user to actually use their system and make their life easier.

Colin (1:04:39) Nice. That’s awesome. Hazel, what about you?

Hazel (1:04:44) For me, this one is a bit of an unusual answer because the way I find things is I don’t know anybody else who does this, but I just think about what the answer should be. I write it down and then I go find it. And that is pretty much always where I look for it.

So I’ve done this with so many research papers, it’s hilarious at this point. But so when I first saw two Charity Majors talking about observability and talking about everything related to how tracing works, how instrumentation works, how developers do this thing of finding information, understanding the system, I was like, oh yeah, I’ve been looking for this. There you go, there it is. There was never any lack of understanding. I just kind of came in.

With the whole, I know what I’m looking for, but nobody seems to be talking about it. And then she started talking about it then I was like, Oh, perfect. Finally, someone is doing this.

The first magic moment for the tool was probably BubbleUp, from Honeycomb’s tool, where you can just select over it and then just see the difference in delta between all the information. I really liked that because you can build it in about a week. But it is so well dialed down, so well polished, and even though it’s a simple concept, you probably never would have thought about it, and you have no way to really explain much of a time saver it is or how magical it can feel. And those are my favorite types of things to build.

Colin (1:06:23) Nice. Yeah, definitely getting blown away by how someone visualizes data in a way that you never even thought just makes it so much easier. Yeah, that’s a great example. Hanson, what was your love at first sight moment?

Hanson (1:06:37) Well, the long version of this, you can hear my talk at KubeCon EU London in April, but the short version is I was able to answer questions about how performance affects mobile apps and KPIs. Observability and context allowed us to basically quantify what matters. And when things get bad, we know why it got bad.

And when things improved, we know what the long-term effects of it is. If you don’t measure it, you don’t actually know what’s going on. And observability and instrumentation in general allows us to answer questions about things that we may not have originally formulated. So that’s when I saw it and when I saw what it could do. It was amazing.

Colin (1:07:33) That’s awesome. Dan, we haven’t heard your answer yet.

Dan (1:07:37) I guess love at first sight was perhaps when I started to work in a… there was a work stream that I was leading at some point that was related to performance optimization on the the client side on the browser. And I think it was my first interaction with the performance API with basically doing at that time, basically collecting our own events or like doing a bit of like data pipelines for run data for real user monitoring and seeing the impact as well on KPIs as Hanson was saying it that is something that is directly affecting the user, right? And being able to see well, you know, if we’re able to lower time to first byte here or if we’re able to lower the largest contentful paint, some of the core web vitals, then we’ll be able to improve our conversion rates. And there was a correlation happening there just from like two simple metrics, right?

And then it starts into, I guess we didn’t have that at the time, but I was thinking, I wish I had a way to connect what’s happening in the frontend, in the client side, to all the backend, right? So then now we do have it. That is the, because then I started to, after that, I moved onto Kubernetes resource optimization, which is like the two ends of a spectrum, right? You run from like the client side all the way down to optimizing auto scaling algorithms and so on.

So yeah, I think basically that sort of linkage between the client side and the backend and how it all can affect basically the user and understanding it all was what really made me get into observability. And I think love at first sight with OTel as well, guess it was, we, at Skyscanner we used to be already an OpenTracing shop. So, all of our tracing was based on OpenTracing. We had a moment that we said, “Okay, well, you know, it’s time to migrate to OTel.” And the fact that we were able to do that, and I think we were like doing dozens of services that were just bumping a version of a library, an internal library that would basically configure the SDK, right? So what we did is I bumped the version up and then in a matter of like 15 days, I think we got like more than 300 services onboarded onto OTel.

So I think there is a talk that I did about this, it was like, seeing the graph of adoption of OTel going like, you know, like dozens of services a day because they just needed to merge a dependable bump. That was when I said, okay, you the design principles of OTel actually are paying off here, which is having that API decoupled from the SDK. So that was my, I guess, love of first sight with OTel.

Colin (1:10:28) Wow, that’s awesome. 300 services. That is a lot of services. Now, I know we don’t have time for everyone to do this, but the original question was love at first sight. But in part of our prep, we realized, oh, there’s also those heartbreak moments that happen when you’re first falling for something. So I did want to get a little bit into that. So I know, Hazel, you have a heartbreak moment with OTel and observability. I’d love to hear that before we get into some audience questions.

Hazel (1:10:55) The heartbreak moment for observability. I think there are two big ones for me. One was when I was an engineering leader, I was the interim head of platform engineering in my company and an expert in OpenTelemetry. So I was like, perfect. This is a chance to really, really push this through. And one of the things that you learn is it’s so hard to actually get people over the initial hurdle.

Even if you make it a one click install, even if you reduce the friction, the time required to fix all of your stuff to a low enough bar, they can start working on OpenTelemetry is often actually more work that is worthwhile to even adopt the project in the first place. If you’re, for example, if your dependency tree is kind of broken and you can’t really install any more dependencies, you can’t even necessarily install a somewhat complicated dependency path.

And if that is a whole bunch of extra steps and that might cause a whole bunch of code reviews, okay, well now I need to take the super fragile wraparound… too much work, too much work. And then repeatedly as someone in leadership, the value proposition of OpenTelemetry is like the worst value proposition out there, or just anything in observability.

Because it’s, you start paying money immediately, and then you start paying a whole bunch more money for all the data that you’re sending over there, and then you don’t know enough to know what you’re spending right now and what you’re not. You don’t know enough to actually use it correctly. And so you have this massive gap of increasing spend over time until you gain enough expertise to start fighting down the volume. And then when you get that point, you might have a almost manageable bill, but now you still have the problem of, “How many people are actually looking at the data? How many people are actually learning from it?” Like in one company, we had about 300 engineers of which 70 people had at one time ever logged into the observability platform. Of those people, only about 10 were using it semi-regularly. And we averaged about 200 page views a month in the UI. For a bill of about $300,000 a year.

That’s not uncommon. When I talked about the power law distribution earlier, that’s what I meant. And so, how do I justify to the rest of the leadership? Hey, yeah, we’re spending a quarter million on this, but don’t worry, Greg super loves this. Greg is all about it. Greg has used this thing 80 times last week. Nobody else. Really?

And the other heartbreak moment was, it’s hard to implement it, right?

But when you start sitting there and you start wanting to use it as an engineer, that’s where it can get frustrating because OpenTelemetry has a very fixed lifecycle model and a very sort of rigid… It creates a cliff between expectations and reality. So if you are on mobile, as Hanson knows, a lot of the assumptions of OpenTelemetry break. If you are a streaming platform versus a batch platform, all of the assumptions break.

If you are a deeply nested microservice tree rather than a tree of max two or three, most of the assumptions break, the tooling gets really hard. If you use auto instrumentation, a lot of the tooling doesn’t work super well because it really doesn’t want a tree depth of more than about four.

All the auto-instrumentation gives you a tree depth of three. So all the promises don’t really pan out super well. But the most frustrating one was I can’t write library code and instrument it with OpenTelemetry because the library code has to be completely refactored. If I’m using the same code in a batch process versus streaming one, I’m not going to write my library twice. And so then I cannot actually instrument most of our library code in a way that was useful.

And I ended up turning off almost all the observability in our streaming services because they were useful in the classic backend. But that was not the solution I wanted. We lost most of our observability and all the distributed aspect of it because you could not actually make it work in the current observability model. And that sucked.

Colin (1:15:37) Wow. Yeah, that is rough. Well, you brought up an interesting point, Hazel, which I think ties perfectly into the Q&A. I’d love to transition. And I feel like time-wise, we might have time for everyone to chime in on one question. We’ll see if we can get to two. But you talked about the difficulty of instrumenting observability with OpenTelemetry. I know all of y’all started years earlier where there was even fewer resources to learn this. And obviously each year has more and more people learn about it and write about it and create educational material. Hopefully it’s getting easier and easier. But we did have an audience question about OpenTelemetry, which was, “I’m still learning OpenTelemetry. What resources do you recommend in order to ramp up about it?”

So definitely as people here who have gone through those rough patches of trying to like learn and incorporate it, I’d love to know either what resources when you were first learning you found helpful, or if there’s new resources that have been out the door, you know, this year, the past six months that you think are great starting points. I’d love to go around and let’s hear what resources we can share for people to learn OpenTelemetry better. Would anyone like to kick us off?

Dan (1:16:49) I would– [Gestures to Adriana] Well, okay, I’ll let you go.

Colin (1:16:50) Adriana, yeah.

Adriana (1:16:52) Can I do, like, shameless self promotion?

Colin (1:16:55) Yeah, of course.

Adriana (1:16:56) Okay, cool, cool. All right. I have an O’Reilly course. It’s a video course for you video lovers out there on observability with OpenTelemetry, and Hazel actually helped me review the course. So, big thanks to Hazel for that. Anyway, so if you, if you like video, if you have an O’Reilly subscription, you can check it out. I would also say, big plug to the OTel docs.

Folks, they are a small team, but they work really, really hard to really continue to improve the docs experience at opentelemetry.io. And even since, you know, three years ago, when I started contributing to OTel, I’ve seen, like, a huge improvement in the docs.

So that’s a great starting point, and I’m gonna do one more shameless plug. I have a blog on Medium, and I’ve written articles on OTel and observability as part of my own learning journey. And I refer back to those every so often. So anyway, if folks are interested and want a more visual, or a more textual approach to learning OTel, that could be a good option.

Dan (1:18:05) I think I would second as well the opentelemetry.io… Just go to opentelemetry.io, you’ll find the path to get started and the main concepts of OTel. Absolutely. And as well, you know, now there is multi-language support there as well with people like Marylia working on the Portuguese one. But yeah, so I think the docs have been improved dramatically and there’s a lot more work coming up soon.

Even like a getting started… documentation on getting started as well, which is great to see. And so yeah, absolutely opentelemetry.io. Yeah. I would say that I also do have a book around. It’s called “Practical OpenTelemetry.” Selfless plug. If anyone… that’s a bit more opinionated on the, on some of the–

Marylia (1:18:57) I want to show… [Holds up a copy of “Practical OpenTelemetry”]

[The book cover cannot be clearly seen, as it blends into the Zoom virtual background.]

Adriana (1:18:58) Oh, I think it’s–

Colin (1:19:01) [Laughs] We can’t see it.

Hazel (1:19:03) Hold it up in front of your shirt.

[Marylia moves the book in front of her shirt, and the panelists can now see it.]

Hazel (1:19:06) There you go.

Colin (1:19:07) Perfect.

Marylia (1:19:08) Yeah, I have a bunch of books on OpenTelemetry.

Colin (1:19:14) When we share this panel, like the on-demand version on social, maybe everyone can share their resources that they link to so that we get them all in one juicy place.

Marylia (1:19:23) I also started with the documentation, the official one. And because I noticed, at the time when I started, there was not the Portuguese version. So I started creating… I also have like a blog post and every single post that I created has the English and the Portuguese version. So I was explaining like, what is OTel? And I’m going through all the components and I’m creating both versions. So people that don’t speak English can learn as well.

Colin (1:19:53) That’s awesome. Hazel, what are your recommendations?

Hazel (1:19:58) My recommendations don’t exist. Everybody else actually have recommendations. What I will recommend is actually this is sort of a callout into the community of I’m going to describe what I would like. Then hopefully it happens. So I would love if CNCF projects in general, but particularly OpenTelemetry, had a learn.opentelemetry.io URL.

There’s no reason that “how to get started” should be any more steps than one URL. There’s a URL, go to the URL. “How do I learn this programming language?” There’s one URL. If there’s any more steps you’re gonna lose them. And OpenTelemetry is almost like its own language. It just needs one single thing.

The second one would be, I would love an OpenTelemetry demo CLI. We have a whole bunch of documentation on getting started. We have an OpenTelemetry demo. We have a whole bunch of example applications. Nothing’s tying it together. Hit the open telemetry demo CLI, pass in the language, then you just get an example app on your computer. You build it. It works, it runs, you can see it. And then you can step through the non-instrumented version. You can instrument it and it’ll actually diff the example app for you step-by-step, so can see everything and can interactively follow the tutorial.

And then it gets better. And then one click wire all this nonsense up into the whole OpenTelemetry demo backend aspect, run in kind or something like that. Just make it so easy to get a Hello World baked into something and then step-by-step see all the instrumentation aspects of it.

And then you can edit your Hello World app, and it’s all great. We’ve already done this with, like, Create React App, you know, Create Next App… the frontend has a whole bunch of things like this. There’s a whole bunch of templating solutions out there, but nobody’s actually built something like this for OpenTelemetry and we really should.

Colin (1:22:12) Yes, another, it’s like full circle Hazel like, what’s the ideal, what’s the ideal learn state? I love it. But fortunately the full circle is not complete because we haven’t heard from Hanson yet. Hanson, how do people learn about this stuff? Come on, how should they?

Hanson (1:22:26) Hazel read my mind. For me, code is the best way to explore it. It allows you to kind of do it the way you want to do it. I want a place I can click a button, sync code, and see what it is, have it actually run, have me modify something, and see it change. Different people want to learn different things. It’s really hard to guide people through for all use cases. But if the code is there and it works and you can see it in your preferred language or platform,

For those of us who like to do that, rather than, you know, have something really structured, that would be amazing. And there are places like that, like OpenTelemetry Android has a demo app. You can sync it and take a look at what it’s doing and, you know, and you could see how do I do it. Well, you do it like that.

Dan (1:23:14) We, I mean, there is the OpenTelemetry demo as well that I forgot to mention as a resource that is super, you know, super helpful to like spin it, you know, you can create it in Docker Compose or create a kind cluster like deployed in Kubernetes and play with it. And then the… you do have like multiple vendors that have forked that, that repo. And then they give you like their little config to, “Hey, this is the same demo.” It just pointed to a different vendor and you see that in your particular vendor of choice or open source backend. So that is something I forgot.

Adriana (1:23:51) And the OTel certification. I don’t think there’s, I think it’s just the certification exam, but if you would like to be OTel certified, I think it just got launched around the time of the last KubeCon in North America. So if certifications are your jam, that is in play.

Colin (1:24:13) Nice, awesome. So many resources. So you know what? We’re gonna do one better than trying to figure them out on social media. I’m gonna share all those in the follow-up email after this panel. So all y’all will get them in one sweet place. And we are coming up on time. So unfortunately, that’s all the time that we had for the panel. I feel like it flew by. It was such an engaging, wonderful, lovely discussion. I wanna give a big, big, big, big, big, big, big, big thank you to this awesome panel. I don’t know how many bigs that was, but it wasn’t enough.

This was so awesome. Thank you all for being here. Thank you all for sharing the love for OTel and observability. Like I said, you’ll be getting an email after this with those links. We’ll also have a survey in it and a chance to win a fun raffle prize. We would love to get your feedback about what you enjoyed the most about the panel, what could be improved so that we can make them better as we do more of these. So we would love your feedback if you have a chance to fill out the survey.

So with all that said, once again, thank you to this wonderful panel. Thank you for being here, for participating in our fun polls, submitting questions, and we will see you in the next one. So thank you, everyone.

Embrace Mobile observability with OTel

Check out Embrace's open source SDKs that are built on OpenTelemetry.

Learn more

Build better mobile apps with Embrace

Find out how Embrace helps engineers identify, prioritize, and resolve app issues with ease.