Event Tracking with Finatra and Spark

In this article I will give a beginners guide to write an event tracking API with Finatra and Spark.

In this article I want to give a brief tutorial for beginners how to set up an event tracking system for client-side events for mulitple websites with Twitter's Finatra and aggregate batches of events via the computation framework Apache Spark. In this article, we will discuss the first part. I will write a follow-up article about aggregating the captured events via Spark.

TLDR: If you just want to run or see the full example code, visit this repository. The repository contains a Dockerfile, which is all you need to set up the Docker container including all the dependencies. Skip to the last paragraph for a detailed description of how to set up the container.

Event Tracking

Event tracking can be the first step in the ETL pipeline. After aggregation, important use cases could be descriptive statistics (e.g. page views, bounce rate, conversion rates etc), anomaly detection (e.g. abuse, crawling, hacking attempts) and A/B testing (e.g. different layouts).

Before I start with the requirements, I want to note, that in this tutorial I will focus on the implementation with Finatra, Spark and Docker, without adressing the very important aspects of load balancing, high availability and fault tolerancy. Treat this tutorial as a starting point for using these technologies for your own analytics stack. Also keep in mind, that you should respect the privacy rights of the user. If they enabled the ‘do not track’ option, you shouldn’t track them.

Requirements

Let us start with the requirements. In this example, we want to track client-side events on multiple websites and optionally multiple version (for A/B Testing). For sending client-side events to the tracking API we need to write client-side Javascript code. The code needs to take care of the user session, identifying the user with a cookie, and also sending events to our tracking API. In this project, we use the same client-side tracking code as in the previous article “Event Tracking with Flask”.

Tracking Pixel

Due to the Same-origin policy we cannot do cross-site AJAX calls to a different host (different to the host the site is served from), we have to use a tracking pixel also known as Web beacon. Basically, the tracking code integrates a tracking pixel into the DOM with the URL of the tracking server. The URL consists of the image location and tracking data as GET parameters. Keep in mind, that if you want to track events on a HTTPS website, the Tracking API also has to be HTTPS secured, see mixed content protection. For free SSL certificates, check out the Let’s Encrypt initiative. Further notice: You can also transfer event data via cookies, see here.

It makes sense to fetch the tracking tracking script from the tracking API. By using this approach, we do not have to maintain the tracking script on each website, rather we only have to maintain one location. So, on the server-side we provide to API endpoints: a) a GET-endpoint to get the tracking script and b) a GET-endpoint to retrieve the tracking pixel with event data included as GET parameters.

Implementation

Finatra is build on top of Finagle, provides an easy-to-use HTTP endpoint specification inspired by the Ruby web framework Sinatra or the Python equivalent Flask. Finatra also has an admin interface and can integrate with Twitters Zipkin. We will use SBT to automatically retrieve the dependencies and to build a fat JAR, including all dependencies. The build.sbt file contains project details as well as the needed project dependencies. The only code needed for spawning up a Finatra server is the following:

object TrackingServerMain extends TrackingServer
class TrackingServer extends HttpServer {
  override protected def configureHttp(router: HttpRouter): Unit = {
    router.filter[LoggingMDCFilter[Request, Response]]
      .filter[TraceIdMDCFilter[Request, Response]]
      .filter[CommonFilters]
      .add[EventDataController]
      .add[ClientController]
  }
}

The ClientController will serve the tracking code, see the file ClientController.scala. We define a GET endpoint and to serve the tracking script, see file user.mustache. For identifying the site, we include the site id and version id inside the tracking script. We can the mustache templating system to insert both into the script. In the ClientController we need to define the following method:

...
get("/user.min.js") { request: Request =>
  val clientReq = deserialize(request)
  if(clientReq.site.isDefined && clientReq.version.isDefined) {
    ClientView(clientReq.site.get, clientReq.version.get)
  } else
    response.badRequest()
}
...

The ClientView case object is annotated with @Mustache, see the file ClientView.scala and Mustache templating documentation. The deserialize function extracts relevant GET parameters and creates a case object from it. The tracking script is loaded and evaluated on the client side, the website, and fired events will be pushed via tracking pixel to the EventDataController. The EventDataController receives the events on the endpoint /h, see the according file EventDataController.scala.

...
get("/h", name = "pixel_endpoint") { request: Request =>
  val eventDataRequest = deserialize(request)
  warehouseService(eventDataRequest) flatMap { promise =>
    promise match {
      case Success(result) => {
        response.ok.body("").toFuture
      }
      case Failure(e: CouldNotWriteValues) =>
        warn("Could not write values; ask client for resending")
        response.internalServerError.toFuture
      case Failure(e) =>
        warn("Some unexpected error when writing values")
        response.internalServerError.toFuture
    }
  }
}
...

The WarehouseService writes the data as flat files or redirects the data to other webservices or message buses like RabbitMQ or Kafka, so other application can consume the live data. In the current implementation it saves the event data to CSV flat files in day batches, see file WarehouseService.scala and the specified folder in the application.conf. I hope that by showing and describing those three code fragments you now have a better understanding how you can define an event tracking service with Finatra. In the next section, I will explain how you can modify and run the code in the repository for you own event tracking stack.

Getting Started

In order to run the described code on your own server for event tracking you need to change some variables before you build the Docker container. Be sure you changed the tracking API URL in the tracking code, see file user.mustache. Next, we need to build the Docker container. Assuming you are inside the project root, the following commands build the image, and create and start the container.

# cd docker
# docker build -t trackingservice .
# docker run -d -v /var/data/tracking:/var/data/tracking\
   -p 8888:8888 trackingservice

If you want to use it in production, it is a good idea to use Nginx with a virtual host and a definition of an upstream to the Docker container. See the example Nginx site configuration in docker/nginx-upstream. As I mentioned earlier it is important to serve the API over HTTPS, otherwise we cannot track HTTPS sites. For this, you need to request a free SSL certificate from the Let’s Encrypt initative. If you want to build and run it in the cloud, I built it successfully on a 4 Core, 8 GB DigitialOcean droplet and I am running the container on a 1 Core, 512 MB droplet.

Conclusion

I am using Scala for quite a while now, but I am still surprised how powerful but easy the build tool SBT is, compared to XML extensive Maven, which I also used years ago. It simplifies a lot, just check the Dockerfile for building the container. Setting up routes in Finatra is a piece of cake and it comes along with an informative admin interface for monitoring. In my production environment, I also push the event data as well as tracking API metrics to an InfluxDB database and use Grafana for visualization. I hope you enjoyed this article. I will try to write the Spark follow-up soon.


One or two mails a month about the latest technology I'm hacking on.