Skip navigation

Authors:bill.zhang, Senior Search Engineer at Jive Software, and matt.wheeler, Senior Software Engineer at Jive Software


Over the past year we have done some experimentation with different index structures to efficiently support multiple tenants (or customers) in a single search index.  We thought we'd share some of our findings.


Multi-tenant search with Lucene

There are many ways to achieve multi-tenancy using Lucene.  The regular way is to create a tenant ID field and apply a filter for that field to every query.  In this post, we describe another approach: prefixing each field name with a tenant ID so that every tenant has its own set of fields in the index. We believe this approach leads to a much better performance than the regular approach.


Implementing multi-tenancy using a filter field

Today most enterprise search engines supports "fielded search" (Apache Lucene - Query Parser Syntax): when you search for a keyword you can specify the "field" of the keyword, which roughly means which part of the document the keyword must appear in. For example, when you search for keyword "jive" the query that gets executed internally is a fielded query that might look like:

"subject:jive body:jive"

The actual query is more complex than that, but it is enough to illustrate the idea in this blog. The query string tells an internal index searcher: "retrieve the documents where "jive" appears either in its "subject" or its "body"". To support the fielded query, during the indexing process we split the documents into multiple fields: subject, body, tags, author_id, and so on (Apache Lucene - Index File Formats). Additionally search engines supports the definition of "filters", i.e., objects used to restrict which documents may be returned during searching (Filtering a Lucene Search). Therefore, we can put the tenant ID of the content's owner in a field "tenantId" in every index document.  Then for each query we specify a filter on tenantId to restrict the results to that tenant:



An alternative approach to multi-tenancy

We experimented with an alternative approach to multi-tenancy. Instead of using a filter to restrict the results we prefixed field names with the tenant ID.  Thus the query becomes:

"12345_subject:jive 12345_body:jive"

This may seem wasteful as there would potentially be tens of thousands of fields in a single index.  Below we describe the performance of such an indexing scheme. Search engines usually define a <field,text> pair as a term (Term (Lucene 3.6.2 API)), and its most basic function is to retrieve a "posting list", i.e., a list of docs that contain a term, for any given term. Since "12345_subject:jive" can be parsed into a single term, the docs containing the term can be retrieved quickly -- this is much quicker than retrieving the much bigger posting list of term "subject:jive".


This approach has two advantages when compared with the field filter approach:

  1. Performance. Besides a shorter posting list, it also avoids the need of injecting a tenant filter for every query; and
  2. Relevance. Mixing multi-tenant data in a single field reduces the accuracy of search relevance, because of an inaccurate document frequency (Similarity (Lucene 3.6.2 API))

To our knowledge this is a new approach that has not been published.  Here we describe our findings applying it to Lucene.


We started by identifying a number of factors with the approach that would cause various Lucene memory and disk structures to explode as the number of fields grew:

  1. Lucene allocates an array in memory for each field, called normalization factor.
  2. Lucene allocates a bit cache for filters per term/field combo.
  3. Lucene creates a field cache for each sortable field.
  4. Lucene's on-disk index size.


Basic Experiments

The experiment was to setup an index that had 100 tenants (customers) each with 100,000 documents. The experiment then walked through a random dictionary of terms. In order to excercise the boolean logic the queries were <query> OR <last query>.


Basic Index Sizes (without any optimization)

In the table the columns represent the following Lucene data structures:

  • frq   Contains the list of docs which contain each term along with frequency
  • cfs   An optional "virtual" file consisting of all the other index files for systems that frequently run out of file handles
  • tim   Part of the term dictionary, stores term info
  • prx   Stores position information about where a term occurs in the index
Index Schemeindex disk sizefrqcfstimprx
Just simple fields (not per tenant)10 Gb3.2 Gb19 Mb15 Mb6.1 Gb
With per tenant fields: 100 tenants each with 100,000 docs16 Gb5.9 Gb1.8 Gb1.2 Gb6.1 Gb
With per tenant fields: 10,000 tenants each with 1000 docs7.0 Gb19 Gb **2.3 Gb6.8 Gb

** This norm file can be compressed. See below.


Basic Query Results Timings

In the table there is the latency of random queries with two different index builds, with and without per field tenant data. With per tenant data the queries are 3-4 times faster.


Index Scheme

Total time (ms)

of searching 10,000 times

Total time (ms)

of searching 20,000 times

Just simple fields (not per tenant)13,74326,442
With per tenant fields: 100 tenants each with 100,000 docs4,7168,209



Specific Solutions & More Experiments

Norm compression

When we use per tenant field the size of the norm file (.nrm) is linear to the number of docs times the number of tenants.  As we saw in the above experiments when there is a large number of customers in a single index then the size of the norm file becomes very large.


Norms are stored as a per-field array, with one entry per document.  Thus when a document does not have any terms for a field, the array will contain a zero value for that document.  This isn't a big deal when most of the documents in the index have a value for each field, but is a problem when each field has relatively few documents (a sparse index).


We developed a scheme to compress the norm file. The basis of the idea is that although a single index file may contain data of multiple tenants, when we consider a single index document, it really contains data from only one tenant. Therefore, for n tenants, instead of having n arrays of norms for the same logical field (e.g., title), we can combine all norms of the field into one array. In this way, the size of the index file is no longer relevant to the number of customers -- just the number of documents.


We built a prototype based on the idea, and experimented with it by indexing the same input file with different numbers, with norm compression:

Number of doc per tenantNumber of customers.doc.pos.tim_nrm.cfs


  1. In rows 1, 2 and 3 the norm file size increases because of the increase in the total number of docs.
  2. Lucene changed file names in 4.1. For details see: org.apache.lucene.codecs.lucene41 (Lucene 4.1.0 API)


For comparison, we also tried indexing the same input file, without the norm compression:

Number of doc per tenantNumber of customers.doc.pos.tim_nrm.cfs


By combining the 1000 per-field norms into a single norms field on-disk we made the norms file 1000 times smaller.


Sorting by Field Value

Sorting by field value, e.g., by title, is a common function of enterprise search engines. However, given how Lucene implements sorting (based on FieldCache), it is not scalable when the number of fields used in sorting increases. To solve this problem, we experimented with a "hybrid" approach: use per-tenant fields for querying and storing content, but use simple field (not per tenant) for sorting and faceting.


We implemented "sort by title" function with both the hybrid approach and the per-tenant approach and ran experiments with them. Below are the details of the experiment:



  • Hardware: Mac Pro, 16G memory, SSD Hard Drive, 2.3G I7 CPU
  • Software: Mac OSX 10.8.2, Java 1.6
  • JVM: -Xmx1024m

Indexing setup:

  • 4000 tenants each with 1000 docs
  • the same input file as in previous experiments
  • add a new, non-tokenized field, to support sorting by title
  • the value of the new field is unique for each doc

Search setup:

  • 1000 tenants each with 10,000 queries
  • Same group of queries were executed for "sort by relevance" and "sort by title"
  • 32.1% queries return non zero hits
Approach.doc.pos.timQuery Latency,Sort by Relevance(MicroSecond)Query Latency,Sort by Title(MicroSecond)
Hybrid (per-tenant fields for querying & simple fields for sorting)2.45G2.64G2.19G73.478.5
Per-tenant fields, without sorting field (and therefore doesn't support sorting by title)2.44G2.64G1.87GDoesn't support
Per-tenant fields, including sorting field2.45G2.64G2.20GOUT OF MEMORY!

Filtering (by numeric range and/or by field value)

Filtering is another common function of enterprise search engines. In this group of experiments we implemented two filters: filtering by date and filtering by content type, in multiple ways, and tested their performance. The implementations differ in:

  1. Tenant ID handling: the hybrid approach (per-tenant fields for querying and simple fields for filtering) versus per-tenant fields for everything
  2. Filter strategy: how filtering is done internally, a very new Lucene feature. See
  3. Caching:
    • No caching: filter chain would be [NumericRangeFilter, TermsFilter] to apply the date and content type filter.
    • Use field cache: [FieldCacheRangeFilter, FieldCacheTermsFilter]
    • Cache individual filters: [CachingWrapperFilter(NumericRangeFilter), CachingWrapperFilter(TermsFilter)]
    • Cache filter chain: CachingWrapperFilter([NumericRangeFilter, TermsFilter])


  • Hardware: Mac Pro, 16G memory, SSD Hard Drive, 2.3G I7 CPU
  • Software: Mac OSX 10.8.2, Java 1.6
  • JVM: -Xmx1024m

Indexing setup:

  • 500 tenants each with 10,000 docs
  • the same input file as in previous experiments
  • added a new field to support filtering by date, and its value is a random date from 2009 to 2013 (both inclusive)
  • added a new field to support filtering by type, and its value is a random int (converted to string) from 0 to 9 (both inclusive)

Search setup:

  • 500 tenants each with 2000 queries
  • Date filter is applied on every query, and the date range is from 2011/01/01 (inclusive) to 2012/01/01 (exclusive)
  • Content type filter is applied on every query, and the content type is randomly chosen from all possible types
  • Without filtering: 69.3% queries return non zero hits
  • With filtering: about 56% queries return non zero hits
Per-tenant Fields?Filter StrategyFilter CachingQuery Latency (micro seconds)Comment
No FilteringN/AN/A94.5
HybridDefaultNo caching20-30 millisecond
HybridDefaultField cache80-100 millisecondSlower than not caching at all!
HybridDefaultCaching individual filters10-20 millisecond
HybridDefaultCaching filter chain89.2Faster than no filtering!
Hybridquery firstCaching filter chain61.5Best of all
Hybridleap frog filter firstCaching filter chain62.9
Hybridleap frog query firstCaching filter chain62.2
Per-customer fieldsDefaultNo caching774.5Not too bad!
Per-customer fieldsDefaultCaching filter chainOOM

Let's get to the announcement part first, then we'll deal with the minor update...


The Forms App, Now Pre-Installed for All Jive Cloud and Hosted Instances!


What this means is that all Jive customers can have access to the Forms App, and all its functionality, without requiring users to go the Apps Market and install the app.  The app will be readily, and immediately, available to all users of the site.


This has come about due to the great customer adoption and feedback given to the app which has brought it to this point, after many iterations and updates that have made it what it currently is today.  For example, see the following update announcements, all of which were brought about from ideas from customers like you:



Thanks again for all of the great feedback and support of the app, and as always, continue to provide any feedback for the app, or any questions you may have, here in the [ARCHIVE] Forms App group or in the [ARCHIVED] Apps by Jive! group.


Latest Update


There has been a minor update to the Forms App which has just recently been posted to the Apps Market.  This only affects Jive customers who are using the app but are not running the Jive web application in the root context (your Jive URL looks something like to access Jive, rather than  Now, the page redirect, where you are taken after you create a piece of Jive content, will correctly forward on to the correct page even if you are not running in the root context.  Note that this update requires the Jive v3 API, and only works for versions of Jive 6.0+, which includes all Jive Cloud versions.

Day two at Portland's 5th Annual Open Source Bridge, it has been a blast so far! I wanted to report in before I go into full panic mode in preparation for my talk PostgreSQL Replication: The Most Exciting Technology on Earth. I have spoken before, but I really work myself up beforehand. Speaking in front of smart people is thrilling and terrifying, but it'll be so much better for me not to ponder on that and instead talk about this awesome conference.


This is a high-tech conference, though the tracks have unusual names like Cooking, Chemistry and Culture. They cover a huge scope of topics too, from specific technical deep dives to career and life development and improvement, Open Source Bridge has something for everyone. There is a giant table of Legos in the middle of the hacker lounge, groups of smarties talking shop, soldering and socializing all over the place, really a lot of fun. The PyLadies, MongoDB User Group, R User Group and PDX-OSGEO - OSGeo Groups are all representing, as are geeks from all sorts of different cool Companies and industries.


Conferences are a great way to meet people and spend some time away to think about new ideas. As a speaker, I find that both the preparation to make a good presentation as well as the follow-up questions and conversations from the presentation ultimately make me better at being a DBA. Everyone has a different, unique problem they've faced, an important lesson they've learned, and an innovation solution they've applied, and sharing that information is what makes us all better. As an attendee, I find presentations to be a great introduction to a concept or technology (you can only go so deep in 45 minutes), and a way to think about the not-too-distant-future instead of what you are working on today.


If you don't make this year, no problem! Everything is videotaped and available after the conference, but if you can make it in person, you should just come and work in the Hacker Lounge, the geek energy is strong and overwhelmingly positive. I believe anyone can attend the evening BoF sessions and hacking lounge sessions, you just need to register. Check out the Open Source Bridge Blog for more information.


The Hacker Lounge Entrance

IMG_20130618_152511_086 (Small).jpg

Registration and Courtyard

IMG_20130618_152532_915 (Small).jpg


I'll aim to report again after I finish my presentation. If you're here, don't hesitate to say hello!

With just over a week to go before I speak on Postgres at Open Source Bridge, it is a great time to start freaking out and all at once think of all of the things I need to get ready for in the next couple of months.


At Open Source Bridge, I'll be presenting PostgreSQL Replication: The Most Exciting Technology on Earth . If you are familiar with replication, and you stayed awake through the title, then you know exactly what kind of thrill ride to expect. If not, and you plan on going, grab a coffee beforehand and strap on your seat belt. I'm going to talk about the types of replication, what different options are available for the Postgres engine, and what you need to consider when choosing a replication technology. I'll also cover some of the goodies in the 9.3 Beta with respect to replication. Still awake? You should go! It is Thursday 6/20 at 345PM PT.


PostgreSQL Replication - The Most Exciting Technology on Earth / Open Source Bridge: The conference for open source citi…


All kidding aside though, I am looking forward to my first time speaking at Open Source Bridge. I love meeting new techies, there is no shortage of great conversation and smart people with great ideas. I look forward to meeting you if you are there next week. I know for sure one session I won't miss is my colleague jesse.hallett's way more exciting Mod Your Android. I am sooo glad he is going before me too, though only a few sessions before me, so I should be in full "I'm speaking shortly time to freak out" mode. Especially when I'm planning to do a live demo.


Mod your Android / Open Source Bridge: The conference for open source citizens / June 18-21, 2013 / Portland, OR


As soon as I finish, I can focus on the next week, with my first O'Reilly Webcast! Intro to Raspberry Pi will be a presentation that covers the basics of the super-hot Raspberry Pi computer, why you want one, where to get one, what to do with it, and why your kid needs at least one. This will be for the beginner, afterwards I hope you'll be inspired to order your own and make something amazing. Tons of people, young and old, have come up with just amazing ideas with their Raspberry Pis. Check out the RPi Google Group if you want to get a daily dose of innovation. The webcast is Tuesday, June 25 at 10AM PT, but will be available online afterwards. I hope you can make it!


Webcast: Intro to Raspberry Pi


So far July is shaping up to be 3D printer month. On July 10th, I will be doing another O'Reilly webcast, 3D Printing for Everyone. A couple of years ago, I built an Open Source 3D printer, and ever since then I have been telling everyone how this is going to change our future. The technology has incredible potential, it can help us reduce waste, it gives everyone the potential to create and be innovative, and just like boring 2D printers went from horrible, constantly jamming, barely legible dot matrix printers with holes to inexpensive, fast and perfect quality laser printers, 3D printing is still a little wonky now, but the potential for Star Trek Replicators is truly on the horizon. The webcast aims to be a comprehensive introduction to 3D printing, where it is now, and where it is going. You probably can't have a 3D printer talk without going over the controversial stuff either, and I will be happy to offer my take.


At the end of the month at OSCON 2013 I will be going a little deeper into 3D printing, and I will have my printer with me and (hopefully) printing. I'll cover the open source models and the RepRap concept of self-replicating printers (yes, exactly like the Terminator). I would recommend this session to anyone that is ready to get a 3D printer, and wants to learn the pros and cons of building one from a kit vs. buying a completed one. If you are attending OSCON, I'd love to meet you, just let me know!


Webcast: 3D Printing for Everyone

My RepRap Printer - The eMaker Huxley: OSCON 2013 - O'Reilly Conferences, July 22 - 26, 2013, Portland, OR

One of our recent internal IT initiatives at Lexmark was to deploy the Jive platform for use across the company. And while our business areas absolutely love Jive (12k+ users), they didn't love having to go to multiple places or use multiple tools to search for content. The business need for a unified search solution was apparent shortly after launch.


So our Jive development team decided to embed our own enterprise search product (Perceptive Search) into Jive to access content across the following information sources (using out-of-the-box and custom connectors):


  • Lotus Notes
  • SharePoint
  • File shares
  • Google Drive
  • Databases (Oracle)
  • Engineering/R&D wikis (mediawiki, doku, twiki, etc.)
  • LDAP
  • CMS systems
  • Lexmark public websites
  • Legal patent image database
  • Federated results from social media sites (LinkedIn, Facebook and Twitter)


Once completed, our Lexmark users will receive the functionality they were asking for without having to leave our Jive site. Here’s an example of what it looks like when our users search in Jive (note: to ensure I'm not exposing sensitive information, I used javascript in the console to replace the real results with fake data before taking screenshots)




This is what it looks like when our users search in our Perceptive Enterprise Search site:




Here is a screenshot of our Jive app, combining Jive search results with Perceptive Enterprise Search results:


We wanted to keep the look and feel the same from a user standpoint, so we reverse engineered the CSS and added a few Perceptive Search touches (like the timeline and the stars).


For the next steps, we are adding  !App functionality (thanks to kenny.tucker and randy.lubin for the recommendation) and adding a “share to Jive” icon from our internal search results page.


If you are interested in trying this out, please send me a message in the community and we'll get back to you.

Filter Blog

By date: By tag: