WEBINAR Dec 16 | An OTel Carol: Past, Present, and Future of OpenTelemetry. Panelists discuss what OTel delivered in 2025, what improved, and what’s coming next.

Sign-up

Building an ergonomic OpenTelemetry for JavaScript

OpenTelemetry’s model of traces and spans does not fit well with JavaScript’s event loop-driven design, so how to improve support for the browser?

This article was originally published on The New Stack. Part 1 and Part 2.

OpenTelemetry has been, in my opinion, one of the most engaging developments in the software community over the past few years. It’s proven incredibly valuable for instrumenting distributed systems, microservices, and complex architectures. Because of it, teams are able to understand their systems with increasing efficacy and share that understanding across the organization. 

With its rapid adoption, OpenTelemetry is becoming increasingly prevalent on the frontend as well. However, we run into a problem: It feels awkward to use, particularly in the browser.

This isn’t necessarily anyone’s fault. It’s a natural consequence of having so many different languages using a single API; something is bound to feel off. The OpenTelemetry spec does state that APIs should feel idiomatic to a language, but the design awkwardness persists. I’m not sure why, but I suppose that when you put the needs of every community together along with the common denominator of language functionality, you inevitably end up with something that doesn’t feel quite natural in any given language. 

That said, there’s a tremendous opportunity to build on top of this foundation and provide something that frontend developers would find more ergonomic. Several languages have already done similar work: Ruby, Go, and Java have fairly ergonomic OpenTelemetry integrations, for example.

These ergonomic implementations share common factors: Language-specific functionality is used to create conveniences on top of the common API, and common control flow patterns fit naturally into the state machine that OpenTelemetry expects. 

Sometimes, the language doesn’t have particularly common control flow patterns (like Haskell or Ruby), but both languages have the flexibility to shape control flow in ways that allow the instrumentation libraries to remain ergonomic despite that potential friction.

In fact, I’m going to state a bold claim: The heart of OpenTelemetry is context management, which is a concept that is intentionally separated from the rest of the spec specifically so that context can be implemented in the most sensible way for the runtime environment. Despite the intent, we don’t seem to achieve the benefits of that separation of concerns in reality. 

If we are to get those benefits and unlock truly ergonomic telemetry instrumentation, developing the ability to separate the control flow that OpenTelemetry expects from the control flow that makes sense in your program is essential. If there’s one thing I’d love for people to take away from this article, it’s that we would benefit massively from disaggregating context management, data instrumentation, and control flow in our systems.

There’s a trade-off here, and it can be tricky to navigate. If you take the state machine of OpenTelemetry’s desired control flow and push it into the libraries themselves, they can become extremely cumbersome to use. On the other hand, if you rely on propagating that control flow implicitly, you’ll run into problems when OpenTelemetry’s required control flow differs from your program’s natural control flow.

API friction in OpenTelemetry

When control flow is tied to the way you annotate and instrument your code, you have to change code structure to match what OpenTelemetry expects. For JavaScript, that’s simply not something it does well, particularly on the frontend. 

JavaScript also has the unique constraint of needing to provide the “same” language in the browser as well as in Node.js. On the frontend, you have an event-driven browser runtime that’s designed to do heavy lifting for you. Because of that, it’s fairly limited in terms of asynchronous code, threading context, and managing low-level details. After all, the browser is supposed to handle all of that, and the browser APIs were originally designed in a world where frontend code was very simple. 

Now that we have complex code on the frontend, you can run into mismatches between what you’d like to do and what the browser makes easy. On the backend, you have Node.js, which quickly deviated from the browser in order to add certain APIs that were necessary for running on an operating system, such as process handling and thread context; these deviations happen to make instrumentation significantly easier, but have no complement in the frontend (yet).

Even though Node.js might have better facilities for enabling ergonomic OpenTelemetry implementations, JavaScript is still deeply event loop driven by design. OpenTelemetry’s model of spans and traces really doesn’t fit well with that pattern. As a consequence, it’s difficult to set up OpenTelemetry effectively in JavaScript. 

The biggest improvements would probably require language changes. But if we step back and think about what we can do without changing the language, I like to frame it around two concepts: “annotation without structure” and “don’t make me think.”

One of the most natural APIs for OpenTelemetry is to start a span, execute work inside that span, and have the entire span wrapped up cleanly inside a parent function. If you have very clean, synchronous code, your life will be fairly nice. However, JavaScript was designed originally to be executed on a certain event, be invoked by the browser runtime, and then exit. Consequently, the most natural instrumentation API for JavaScript is `console.log`. Every time you stray further from the ergonomics of `console.log`, you make your life harder and fight against the language’s natural patterns.

Go, by contrast, has a `defer` keyword that allows you to create implicit scoping in a semi-explicit way without breaking the control flow of your language. It also provides a context object that lets you thread context through your application without manual propagation. This is perfect for OpenTelemetry (and instrumentation in general). Java has support for thread-local state, decorators, and metaprogramming, which allows one to build an ergonomic API on top of the foundations of OpenTelemetry’s base API.

You can see a fairly stark difference between ergonomics with the following (somewhat pointedly chosen) examples:

// https://github.com/open-telemetry/opentelemetry-go-contrib/blob/main/examples/dice/instrumented/rolldice.go#L38-L40
var (
    tracer  = otel.Tracer(name)
)
func rolldice(w http.ResponseWriter, r *http.Request) {
    ctx, span := tracer.Start(r.Context(), "roll")
    defer span.End()
    // rest of function
}

 

VS.

// https://github.com/open-telemetry/opentelemetry-js/blob/main/examples/opentelemetry-web/examples/fetch/index.js#L60-L65
const webTracerWithZone = providerWithZone.getTracer('example-tracer-web');
 
const singleSpan = webTracerWithZone.startSpan('files-series-info');
context.with(trace.setSpan(context.active(), singleSpan), () => {
  getData(url).then((_data) => {
    trace.getSpan(context.active()).addEvent('fetching-single-span-completed');
    singleSpan.end();
  });
  // rest of function
});

 

While this example is chosen to show off the pain points, we can see what happens when friction occurs between a language’s feature-set and an API’s specification. Ideally, we’d like the code for the Go example and the JavaScript example to be nearly identical in ergonomics.

Annotation without structure

So how do we facilitate the idea that the easiest instrumentation in JavaScript should feel like `console.log`, particularly when you don’t have a nice way to thread context? Asynchronous context in JavaScript is somewhat lacking on the backend and entirely absent on the frontend. You also don’t have thread-local primitives or the ability to share state implicitly in the language. So, what can you do?

I think the key is to look at the underlying specifications and protocols of OpenTelemetry. It turns out that traces, spans, span events, and logs all build on top of an underlying primitive… That is, they’re all basically just events. In fact, almost everything in OpenTelemetry is events all the way down. 

  • Logs are events that are missing an EventName, for example. (Pedantically speaking, in the spec events are LogRecords with a non-empty EventName).
  • Spans are events with certain types of metadata and semantics about how you should compose and build them.
  • Traces are a series of spans, which are, again, just events.

In other words, the semantics around how you have to write the events, what order to send them, and what information to put in the events are essentially the only thing causing this friction in the OpenTelemetry API. If you remove some of the restrictions around how you structure your events and enable the OpenTelemetry SDKs to push some of the metadata burden onto the collector, you can solve a lot of the complexity by moving the state machine management from your code control flow into the SDK, or potentially even the language runtime itself. You could even do this in a way that puts the burden of stitching spans and traces together onto something that could be designed to be stateful; while the OpenTelemetry collector is currently stateless, it would be a natural place for handling that state.

My big idea here, which might sound controversial, is this: What if we throw away the idea that spans and traces have to have a certain begin-and-end structure that corresponds with code structure? Instead, what if we annotate everything in a way that allows the state machine of beginning and ending spans to be handled in the collector?

Don't make me think

If we accept that OpenTelemetry is just events all the way down, then in theory, all you really have to do is to write something like `trace.info()` and pass in an object. That’s it. You can have `trace.info`, `trace.error`, `trace.warn`, and so on. Given the ubiquity of logging information, it shouldn’t be terribly surprising that most logging-like APIs are a quite natural interface for instrumentation. It works perfectly for JavaScript and many other languages, including those that don’t have the ability to provide metaprogramming style abstractions, thread local state or other facilities that make OpenTelemetry’s API more ergonomic.

While addressing the OpenTelemetry API’s design might make instrumentation more ergonomic, improving that alone isn’t really sufficient for improved ergonomics. It’s a huge improvement! But certain additional functions would be really helpful. It’s still challenging to design telemetry and propagate it in a way that’s maximally useful. 

Taking inspiration from other types of instrumentation can give us some ideas of what might be useful here: What if there was a function to add metadata to the root span, regardless of where it is? Implementing this function would be tricky because span immutability is deeply central to the OpenTelemetry API and violating that would break other things. 

However, another way to approach that would be durable reference values to sets of attributes which could then be mutable. That would also potentially massively cut down on network bandwidth as well.

Going further into the inspiration and ideas: What if there was a function to add a new span, but only if a parent span didn’t already exist, or otherwise attach things to the existing span? What if there was a function to take data and add it to every child span, maybe even recursively? What if there was a way to write instrumentation for a single function, but have that instrumentation remain useful when that function is called in a loop? And how could you write all of this without having to create custom processors or custom code to glue everything together in a way that makes sense for your use case?

Imagining the future

This brings up a fascinating design space for me: If I were to look far into the future and imagine what telemetry could look like, what might be possible?

Let’s take a step into this hypothetical future and imagine… What if the instrumentation library didn’t really exist in the traditional sense, and the code you were writing was actually going to be generated on the fly and rewritten by the compiler? You could use this to make instrumentation code very lightweight and minimal, essentially being custom-built for your application at compile time. You could use the compiler’s information to insert code structure automatically, add lifetime annotations, control flow information, callstack data, and maybe even rewrite the telemetry to make more sense for your application’s needs.

This could facilitate very advanced instrumentation rewriting, minimization, and compression as well; imagine a source map-style construct where you send binary pointers to certain common sets of data. You could even imagine normalizing or denormalizing telemetry automatically. Or enabling code to be written for both streaming and batch use cases without code changes. Collectors could roll and unroll telemetry for you, collapse certain pieces of data or even completely rewrite the telemetry tree as needed.

Browsers and language runtimes could also improve existing limitations by ensuring proper thread-local storage support, context propagation, and support for context propagation inside async-like scopes. 

The ability to propagate information in a “magical” metadata object could also be a huge facilitator for building these types of structures, which could be thought of as similar to the reference of metadata idea I brought up earlier. If the language runtimes included explicit instrumentation support as well, then that instrumentation could be written in a heavily optimized manner, which would enable garbage-collected languages to benefit from low-to-zero overhead instrumentation. I find this exciting because it is an actual goal of the OpenTelemetry project, so the potential isn’t out of reach, but it’ll take a lot of coordination.

In addition, I’d love to see a world with extremely rich data for local debugging and the ability to naturally reduce that data for production deployment. Then you wouldn’t blow up everything with verbose debugging data when deploying to remote servers, but when debugging something locally, you could easily go all the way down to the system call level or even inspect the hardware to get extremely granular views of every interface, no matter how highly abstracted. 

Just because a language is high-level doesn’t mean you shouldn’t be able to examine the details when you need or want to. I think we could build languages in the future that allow this ergonomically, letting you instrument code for production while getting rich instrumentation for development, without having to instrument the system twice. Luckily, much of what is described here is an explicit outcome of the upcoming OpenTelemetry Profiling signal, so I hope to see a lot of progress in the next few years.

If instrumentation were more tightly embedded into languages, then one could also imagine the uniform integration of other metadata, such as: debugging information, performance profiling, feature flags, marketing data, security events, and experimentation data. Instead of each platform building its own SDKs and needing their own implementations, they could use instrumentation features built deeper into the language runtime itself. You could instrument once and feed that data into various platforms – from observability and monitoring to security and experimentation – all using the same code.

I like to imagine that future as being one where cross-functional collaboration is more accessible and where understanding the complex system being built becomes a truly companywide endeavor. I’d love to see that happen.

Getting to better

All of that starry-eyed and sparkly future musing is great, but we might not realistically see those types of changes for years or even decades. It’d also require a lot of coordination, and it’s not clear whether the communities involved even want this to happen. So let’s step back and return to reality. The future is fun to think about, but what can we start with today?

Here’s my thinking: Since events already exist in OpenTelemetry, and since almost everything is events under the hood, we could build support for “just sending events” as an OpenTelemetry specification – think of it as an alternative representation of spans, traces, logs, and span events. This would give us a ton of freedom to write whatever library SDKs needed for a language while retaining full compatibility with vendors. We’re pretty close to being able to do this today, as all it would require would be a modified SDK implementation and modified stateful OTel Collector. If those two things happened, vendor compatibility would stay the same, and we’d get to experiment with what an event-based representation could look like.

Some observability vendors are already adopting this telemetry flexibility mindset in their products, particularly those involved in the mobile space. One great example of that is Embrace’s User Journeys functionality, where you can create custom user flows from existing telemetry you’re already collecting without having to restructure your code. If you’re interested in learning more, you can check out their on-demand webinar.

An event-based representation of telemetry data would also facilitate interoperating between traces, spans, logs, and span events. It would also allow us to more easily migrate from logs to traces using the same code. This doesn’t mean that I think we’re going to get rid of the current SDK, however. It’s very useful for the backend and it also represents a well-done “pay now” API where the client does some more work to handle the stateful nature of telemetry and, in exchange, the collector can be stateless. That means it’s very easy for vendors to be OpenTelemetry-compatible.

I can easily see a future in which people choose between the “pay now” model where state complexity lives on the client vs. a “pay later” model where state complexity is in the collector, depending on what makes sense for their environment. High-volume microservices likely benefit from a “pay now” model, and the frontend works best with a “pay later” model.

Putting the two together and being able to tie it all into a coherent context would unlock the next generation of understanding our systems. I can already see bits of this starting to happen and it makes me really excited for the future of OpenTelemetry in the browser. 

I’ve mentioned in other blog posts that the Cloud Native Computing Foundation (CNCF) has a new dedicated Browser Special Interest Group (SIG) for it, which is actively working on improving browser support. I think that is going to create fascinating developments, and truly look forward to seeing what it becomes one day. If you’d like to learn more about what the Browser SIG is working on, check out this on-demand webinar. As always, the magic of OpenTelemetry for me has been in its community, and especially in this community’s willingness to come together and build a better future for everyone. Come join the party!

Embrace OpenTelemetry for Mobile: What’s now and what’s next

Learn how industry professionals are adopting OpenTelemetry for mobile and weaving it into their bigger-picture observability strategies.

Get the report
Related Content