Skip navigation

Jive Developers

2 Posts authored by: pmatern

Today we're announcing the Open Source release of Tasmo, a key part of Jive's cloud architecture. An early snapshot is available now on Github. We believe Tasmo is an important step forward in web-scale architectures, so we have made Tasmo available under the Apache 2.0 license and look forward to growing a vibrant Tasmo community.


As part of the continual evolution of Jive's core infrastructure, we've been heavily investing in new service-oriented technologies. The engine at the heart of this evolution is a service we call Tasmo. Think of it as a replacement for a relational database, and for many use cases it offers some significant advantages.

Jive has followed the same general evolution as much of the industry with respect to how we store and retrieve data in our applications:

Stage 1 --- Store and retrieve all data from a relational database.

Stage 2 --- Make more and more extensive use of sharding and distributed caching.

Stage 3 --- Use purpose specific, non relational storage technologies. In our case one of them is HBase, now in several of our production systems.

Tasmo is at the center of stage 4 of our systems evolution. It allows developers to declare their logical data model and handles the details of the underlying storage in HBase tables. Equally important, Tasmo effectively performs at write time the selects and joins that would traditionally be done at read time in a traditional database-backed web application. This translates into far less work being done while loading a page and therefore vastly improves application response time. Essentially, Tasmo performs the same function as materialized views in a relational database -- just without the database.


Our motivation for Tasmo stemmed from the knowledge that in our web application we display the same interconnected data in quite a few different ways, and that the ratio of reads to writes in our application is at least 90/10 in typical installations. Before Tasmo, the many different views of the same data resulted in a lot of different joins performed at read time. The high read to write ratio resulted in performing those joins redundantly many times over. It seemed obvious that in a read heavy system it would make sense to shift work to the write side and optimize for reading.




Tasmo based applications declare a set of events and views, where the events are the form data will be written and the views are the multiple forms it will be read. A particular view constitutes the joining and filtering of event data that is maintained at write time. Here are a couple event declarations and a view declaration:




A given event declaration is just a type name and a set of fields.  These fields can either hold literal values or references to other events. A group of event declarations which point to each other via reference fields form a structural graph. A view declaration roots on one event type and selects whatever fields of the root event are needed. A view declaration also spans reference fields in order to fold in data from related events. In other words, declaring a view involves choosing paths through the structural graph and selecting fields of the nodes at the ends of those paths. This is all performed at application design time.


How It Works


At runtime, applications fire instances of events to Tasmo. An event instance conforms to some declared type and carries the identity of the object to modify and one or more populated fields which hold only the updated state. The Tasmo service itself ingests batches of these events and writes down their values in HBase tables - both the literal values and the references. It then consults the view declarations to see what view instances it needs to create or update to reflect the changes made by the event. In order to do this, Tasmo traverses the stored object graph and writes down the modified view information path by path into the HBase table which holds the final output of the write processing.




When it comes time to read a view instance, an application just supplies the type of view to read and the identity of the root object to a library which knows how to read the view table. That library does a single column scan of a single row which holds the full tree of data that is the view instance. Under load, this is often orders of magnitude faster and more scalable than performing the equivalent database queries we used before. The resulting code path is also noticeably simpler - with no distributed caches sitting on top of the data and no relational to object mapping code.


Check It Out


Tasmo is available now as an early snapshot. There are modules which bootstrap all the parts together for writing and reading, as well as a module which allows you to just run it all in one process. Feel free to check out the code in our repo and examples and explanations in our wiki. Over the next several weeks we will be building out documentation, examples, as well as continuing to push many bug fixes and improvements. There is still a long road to version 1.0, but we'd welcome feedback and contributions!


Also, check out the SlideShare presentation for more details:


Some of our current work here at jive involves building systems which involve a bunch of logical services, each hosted in multiple processes per service. Knowing the state of the system involves using multiple methods to gather a gestalt sense of healthy or unhealthy. One simple method we use is to have a 'status' rest endpoint hosted by every service instance. In implementing the code backing this endpoint, we were faced with the questions of what exactly does it mean to be unhealthy, and what flavors of unhealthy do we need to differentiate? It was simple and obvious until we actually had to think about it. Here's what we came up with for states that would fail this simple status-check-via-rest-call:


Before we even access the status endpoint of a service there are a couple failures we can trap:

1. Process down. That's the most obvious one, right? If it's dead, that's bad. But what if it's being restarted as part of a deploy? Our status check will return false if we can't connect to the endpoint, but to avoid false failure alerts we have to coordinate between deploys and status check timing.

2.  Process is there, but the request for status itself times out. If the service can't serve an http request, that's also bad. It may not be any fault of the code or the host it's running on, it just might be overloaded, and the real problem is we need to stand up more of them.


Now on to the states which the status call implementation reports:

1. Request processing latency mean is over some configured threshold. Some of our services process rest calls, others read a stream of events off of a bus. Each of these types consumers track a round trip latency metric that we can use for this check.

2. Error level messages have been logged in the past X seconds. The idea here is that if any code in the service knows there is a problem, it will log it, so the fact an error level log message went out means something somewhere thinks it is unhealthy. Using this check also has a couple side effects. First, it eases the first step an ops or dev (or devops) person goes through when looking for trouble: figuring out whose logs to look at. When you have 50 logs to pick from it's nice to know up front which have recent errors. Second, if a developer logs at error level when they shouldn't, those messages are going to trip alerts and someone will come ask them about it. It's a good incentive to be conscientious about what log levels you use.

3. Error level messages have been logged while trying to interact with some other part of the system in the past X seconds. This is exactly the same as the previous check, but is a way for a service to indicate the difference between "I have problems" and "Something I rely on has problems." For example, if there is failure problem connecting to HBase, or Kafka, or making a rest call to some other service, the resultant error level log messages will be indicated separately. This helps us follow a chain of failures to the source of the problem.

4. Inability to log, send metrics, or send service announcements to indicate presence. We rely on a service being able to do these basic things to participated in the system, so inability to do them results in a status check failure.

5. Percentage of time spent in GC is over some configured threshold. Often we go to gc logs to see what's going on with a slow process, and this check gives us an indication of whether there might be a reason to look there.


In short, we've decided to track:

1. Dead

2. Unresponsive

3. Upset about itself

4. Upset about something else

5. Cut off from the rest of the system

6. Memory problems


Any service can add its own health checks which target some application specific state, but these are standard across all of them. They give us an indication not only that something is wrong, but just as important, they can point us in an initial direction when doing a root cause analysis.

Filter Blog

By date: By tag: