Tracing API’s: Combining Spring’s Sleuth, Zipkin & ELK

-

Tracing bugs, errors and the cause of lagging performance can be cumbersome, especially when functionality is distributed over multiple microservices.

In order to keep track of these issues, the usage of an ELK stack (or any similar sytem) is already a big step forward in creating a clear overview of all the processes of a service and finding these issues. Often bugs can be traced by using ELK far more easily than just using a plain log file – if even available. Optimization in this approach can be preferred, as for example you may want to see the trace logging for one specific event only.

Here Sleuth comes in handy.
Depicted from systems as Dapper, Zipkin (we will come to that one later) and HTrace, Sleuth can ensure that every request event that is initiated in a Spring application, will start a new trace. Every internal event (that will possibly trigger an event to another application) will initiate a new span and closes after the event is finished. As the request event is finished, the trace will be closed and a new one will start eventually when there is a new request event initiated.

Getting Sleuth to work in your Spring-Boot application, is as easy as putting the dependency in your Maven’s pom.xml, as described at https://cloud.spring.io/spring-cloud-sleuth/:


    
        
            org.springframework.cloud
            spring-cloud-sleuth
            2.0.0.M5
            pom
            import
        
    


    
        org.springframework.cloud
        spring-cloud-starter-sleuth
    

    
        spring-milestones
        Spring Milestones
        https://repo.spring.io/libs-milestone
        
            false
        
    

As you integrate Sleuth in all the (micro-)services, it will automatically pass the trace ids to the next to be called service (if you are using RestTemplate; for other rest service implementations, you may need to add your own configuration), so that it inherits the value. This ensures that all logs behind one starting event can be grouped under one trace. By itself this already comes in handy, as you can filter in your logging by traceid and thus easily oversee the escalating flow of one event over the services.

All four services can be filtered by the identical traceid, but the services will have different spanids. (Only service A is set to debug level, other services to info.)

When coming to investigating performance issues, we can already analyse the logs – with the help of the traceids in place and take a look at the timing of the services. But here Zipkin can come in handy. Zipkin is – as the authors say – a distributed tracing system. It runs as Spring Boot application and its default storage is in-memory, but you may as well choose one of the database options available to store your data, with help of their spark job. You can start using Zipkin locally by using the docker image available: https://hub.docker.com/r/openzipkin/zipkin/. The server is default available on http://localhost:9411/zipkin.

Implementing Zipkin in your (micro-)services is also adding a dependency to your Maven structure:


    org.springframework.cloud
    spring-cloud-starter-zipkin

Of course, you may want to change configuration such as the base-url and those properties are available by spring.zipkin.* .

As Sleuth already opens and closes spans by events that are opened and closed (namely, requests done with RestTemplate), Zipkin can be integrated to visualize those spans correctly.

Each trace can be found separately and you can also search per available service.
So, let’s look at an example call done to service a:

Searching in service a to find our trace.

When you have found the trace of interest, you can click on it and an overview of the underlying spans is shown:

View of same call as previous.

As seen in this example, we see the trace with spans from service a, to b to d & a to c. Also, we can see the time it took to perform each call (or span).

When a trace has an error, you may view the trace and find out it is red instead of blue.

A trace that has gone wrong. Inside this trace, you can still find the called services with their spans and find out were it could have went wrong.

Spans of the errored trace.
You may click on a span and a popup with extra information will open. When you click again on “more info”, you will find the traceid & spanid. This traceid can be used to look into the logs in ELK again to find the exception.

Extra information of a span.

Finding the error in the logs. Expanding the message will give the information needed to trace the exception that is thrown.

Of course, finding bugs or exceptions can be perfectly optimized without the use of Zipkin and as suggested by https://peter.bourgon.org/blog/2016/02/07/logging-v-instrumentation.html, it is important to consider the difference between logging and instrumentation tooling.

But, when it comes to creating an overview of multiple called microservices over time and tracing any timing issues, Zipkin can really increase the ability to gain insight of performance issues. So let’s look into an example:

A call to service a is taking more than 30 seconds.

There are complaints that a call to some endpoint of service a is taking around 30 seconds – which is unexpected.

As seen in the logs, we can already confirm the call is taking from 18:16:05 to 18:16:35.
The cause cannot be fully confirmed by looking at the logs, although an estimated guess can be made.
In Zipkin, all we need to do is select service a (the root service), filter the traces by datetime and endpoint call and the culprit can be viewed in an eyeblink:

The culprit. Here we can see which value Zipkin in combination with Sleuth may add to the use of microservices.

One last thing to add, is something we need to customize: the possibility to group by conversation, with several follow-up calls that need to be grouped, as they form a cascade of events. When a user wants to start such a conversation and wants to trace that whole conversation, it will be convenient if the user can use a level higher id than traceid – which will only return one call of the conversation.

With Sleuth we can add baggage items & tags:

Baggage items are items you can add to a trace and will be propagated over the trace context; thus also passed over the services. But, they are not passed to Zipkin.

Tags, on the contrary, are normally used for spans to tag them and are passed to Zipkin. By adding a webfilter (or any other interceptor you prefer), we can create our own id system, which can provide a grouping over multiple calls – a conversation:

@Configuration
public class SleuthConversationConfiguration {
 
  public static final String TAG_NAME = "X-B3-CONVID";
 
  @Bean
  @Order(TraceFilter.ORDER + 5)
  public GenericFilterBean customTraceFilter(final Tracer tracer) {
    return new GenericFilterBean() {
 
      @Override
      public void doFilter(ServletRequest servletRequest, ServletResponse servletResponse,
          FilterChain filterChain) throws IOException, ServletException {
        if (servletRequest instanceof RequestFacade) {
          //Get header conversation id IF available -> overrules any other one
          String existingConversationId = ((RequestFacade) servletRequest)
              .getHeader(TAG_NAME);
          Span span = tracer.getCurrentSpan();
          if (existingConversationId != null && span.getBaggage().get(TAG_NAME) == null) {
            span.setBaggageItem(TAG_NAME, existingConversationId);
          //IF the conversation id is not available in header of request, we try to get it from baggage item
          } else if (existingConversationId == null) {
            existingConversationId = span.getBaggage().get(TAG_NAME);
            //If not available, generate one for us
            if (existingConversationId == null) {
              existingConversationId = UUID.randomUUID().toString();
              span.setBaggageItem(TAG_NAME, existingConversationId);
            }
          }
          //Add conversation id as tag in order to make it available in Zipkin
          tracer.addTag(TAG_NAME, existingConversationId);
          //Add to MDC in order to make it visible in logging
          MDC.put(TAG_NAME, existingConversationId);
          //Will continue span
          tracer.continueSpan(span);
          //Will ensure the conversation id is also available in the response
          HttpServletResponse response = (HttpServletResponse) servletResponse;
          if (response != null) {
            response.addHeader(TAG_NAME, existingConversationId);
            response.addHeader("Access-Control-Expose-Headers", TAG_NAME);
          }
        }
        filterChain.doFilter(servletRequest, servletResponse);
      }
    };
  }
}

The part with the headers is optional; for service a it is convenient, as the user can pass its own id to the service, in order to ensure that the conversation id will take its place in all the calls. But, for service b, c and d, Sleuth will already ensure that the id is passed as baggage item. Therefore, the part in which we try to get it from the header is unnecessary, and the code could be as follows:

@Configuration
public class SleuthConversationConfiguration {
  public static final String TAG_NAME = "X-B3-CONVID";
 
  @Bean
  @Order(TraceFilter.ORDER + 5)
  public GenericFilterBean customTraceFilter(final Tracer tracer) {
    return new GenericFilterBean() {
 
      @Override
      public void doFilter(ServletRequest servletRequest, ServletResponse servletResponse,
          FilterChain filterChain) throws IOException, ServletException {
        Span span = tracer.getCurrentSpan();
        String existingConversationId = span.getBaggage().get(TAG_NAME.toLowerCase());
        if(existingConversationId == null){
          existingConversationId = UUID.randomUUID().toString();
          span.setBaggageItem(TAG_NAME, existingConversationId);
        }
        tracer.addTag(TAG_NAME, existingConversationId);
        MDC.put(TAG_NAME, existingConversationId);
        tracer.continueSpan(span);
        filterChain.doFilter(servletRequest, servletResponse);
      }
    };
  }
}

Further, you need to ensure that you add this extra field in your logback configuration (or any log configuration you may use) of all the services that use it by using %X{X-B3-CONVID:-} as property value; else you will not see it in your logs!

If you configured all correctly and do some calls with a header “X-B3-CONVID: OUR-UNIQUE-IDENTIFIER”, you may see the following in the logs:

Using our own unique identifier will provide a way to get all logging of the conversation – thus, of multiple calls that are given in a cascade of calls.

In Zipkin, we can provide the name of the tag we created to filter all the traces.

This way, all traces will be shown and we can look into them if we want to gather more info of their performance and/or error handling.

Hence, Sleuth & Zipkin are helpful tools to create a more concrete overview of the distribution of calls over (micro-)services and can enhance the tracing of bugs or performance issues. With some additional tweaking, we can even create overcoupling ids to trace conversations. It will not replace ELK or similar services, but will be a helpful tool for monitoring your applications.

Useful resources: