Skip navigation

Jive Release Blog

4 Posts authored by: ajohnson1200

I've heard this question a number of times here on Jivespace and I think even a couple times on our intranet (which is called Brewspace) but I've yet to see it documented anywhere, mostly because no one has ever taken the time to explain it to our awesome docs guy Steve (by the way, have you seen the new docs? They're amazing!). I doubt any of you are losing sleep about it, but just in case, here's my simplified, albeit technical, take on things. If you've got a source build handy, I'd encourage you to follow along in your favorite IDE.


PopularityDeterminationTask is the class that handles the majority of the work. It runs every 15 minutes (as configured in spring-taskContext.xml, which means you can change it via a plugin in 2.5 if you want) and works closely with ActivityManager.  Every time it runs it does the following things:

  • iterates over the set of containers that exist in Clearspace (communities, projects, social groups and blogs)

  • calculates the popularity of each item in the in-memory copy of activities that activity manager maintains for the current day. The score (or popularity) of each item is calculated by adding together the score of each activity on an item, the scoring is as follows:

    • each comment (blogs and documents) or reply (thread) is given 5 points

    • each modification (documents only) is given 5 points

    • each view is given 1 point

  • calculates the popularity of items in each container stored the database (a popularity score is computed for each item based on the activity that happened that day every night at midnight, this data is stored in the jivePopularity table) for the last seven days. The individual day score is adjusted for each day old that the score is recorded. We call this decay: the computation is done in SQL but if it were done in Java it would look something like this:

Calendar c = Calendar.getInstance();
c.add(Calendar.DATE, -daysAgo);
Date now = new Date();
Date nDaysAgo = c.getTime();
long dateDiff = now.getTime() - nDaysAgo.getTime();
long dateAdjustment = dateDiff / 86400000; 
long decayedScore = score / (1 + dateAdjustment);
  • add the results of step 2 and step 3 together. So for example, let's say I published a blog post 2 months ago and through some sort of magic, it got 3 comments every day for the last 2 months (and it's already received 2 comments today that haven't been archived to the jivePopularity table). We only take into account the popularity score for the last seven days so the total score is going to be the sum of the following scores:



    Number of Comments

    Recorded Score

    Score With Decay

    6/5/2008 (today)




























    Total Score



  • Finally, each item is added to a SortedSet and then copied back to ActivityManager, where it lives until the next time popularity is calculated fifteen minutes later.


So when you see a widget that shows the most popular items for a given community / space, project, social group or blog, you're getting the n items who have the highest combined score, which is a combination of comments, replies, edits and views, decayed over the last seven days. ¿Comprende?

During the last couple weeks of the 2.0 development cycle we pushed some really helpful search improvements (some of them bug fixes) into Clearspace. There are a numberof posts scattered around our intranet (which is called Brewspace) where the actual improvements / bug fixes are discussed but I don't believe those improvements made it into the documentation (there are a number of improvements listed in the changelog, but no description of what the improvements are). Hence this blog post.


First, in 2.0, the default search operator was changed to 'AND' from 'OR', the end result being that if you did a search like this:

clearspace openid

Clearspace would look for all the blog posts, discussions and documents that contained the term "clearspace" AND contained the term "openid". Way back in Clearspace 1.0 the thought was that we should deviate from what Google does (they AND the terms you input) because we're not searching the entire web; our thinking was that most of our installations would only have a couple thousand documents, blog posts and threads and so we didn't ever want a search for 'clearspace openid ldap' to return nothing if there was a document that discussed two of the three. The reality is that when the search operator was 'OR' the number of results from a search query in Clearspace was almost always greater than 500 results (the maximum number of results we would return in a search): in fact, the more words you used in your query, the more likely that you'd end up with a large number of results, which in theory is great (we found a bunch of stuff that for you!) but in practice doesn't make for a great user experience (thirty four pages of search results? come on!). One of the articles (discussed below) had this to say about lots of search results:

Users sometimes measure the success of a query primarily by the number of results it returns. If they feel the number is too large, they add more terms in an effort to bring back a more manageable set.

So not only did "OR" result in more results per query, if you decided that you wanted to refine your query by adding a term, the number of results would actually grow, not shrink, which is the opposite of what you'd expect.


The funny thing about changing this default is that when we turned it on for brewspace (our intranet, no other changes had been made at that point), a number of people noticed right away and were amazed at the 'improvement'. It's really crazy how something as simple as "AND" versus "OR" could make such a big difference in user experience.


Before I move on, here are a couple interesting articles I found that talk about search as it relates to user experience:

  • Greg Linden, an ex-Amazon guy, wrote a great blog post a couple months agothat summarized a talk that Marissa Mayer gave about Google and their results page and the number of results per page and also added some notes about his own experience while working at Amazon. The bottom line from the presentation? Speed kills. The faster that we can return search results, the happier Clearspace users will be (although the comments on that post tell a potentially different story, don't miss'em).

  • A BBC News article from 2006 had this to say about search results:

At most, people will go through three pages of results before giving up, found the survey by Jupiter Research and marketing firm iProspect. It also found that a third of users linked companies in the first page of results with top brands. It also found 62% of those surveyed clicked on a result on the first page, up from 48% in 2002. Some 90% of consumers clicked on a link in these pages, up from 81% in 2002. And 41% of consumers changed engines or their search term if they did not find what they were searching for on the first page.

Takeaway? Relevant results are more important than many results.

The essential problem of search — too many irrelevant results — has not gone away.

More and more, our ongoing research is telling us that Search has to be perfect. Users expect it to "just work" the first time, every time.


One thing that was easy to add which has also come up a couple times was search by author. I'm happy to report that in 2.1 we added the ability to search for content authored by a specific user. So just like you can click 'more options' on the search results page today and choose what types of content you want to search for, you'll be able to select a user whose content you want to find and filter the results using that selection.  Side note: that functionality has always been in our API and if you're a URL hacker like I am, you can actually perform Clearspace searches using a pretty URL like this:

or if you want to search for any content written by the user 'aaron' that contains the word 'openID', you'd use this URL:


Another thing that I believe has worked in the past but that we haven't talked about is the idea of a simple syntax for search. Much like the operators you can use in Google (ie: ' lucene' will find all the references to 'lucene' on the domain ''), we now support the following operators: 'subject:', 'body:', 'tags:' and 'attachmentstext:'. While I admit that they're not the most user-friendly things to type in, it does give advanced users a little bit more flexibility. For example: you can now ignore tags if you want by doing a search like this: 'subject:lucene OR body:lucene'. The search syntax operators are scoped to be in the search tips documentation that sits right alongside the search box. Again, this is for 2.1.


Those were the improvements. Now the bug fixes (which just so happen to also really improve your searching experience). 

  • Search stemming doesn't seem to be working (CS-3645): Not sure how long this wasn't working, but the existence of this bug meant that if you put the word "error" in a document and then did a search for "errors", our search engine wouldn't find your document. Read more about stemming if you're curious about that sort of thing. If you're seeing this bug, make sure a) you upgrade to at least Clearspace 2.x and b) make sure that you're using a stemming indexer. The default analyzer does not stem. You can change the indexer by going to the admin console --> system --> settings --> search --> search settings tab --> indexer type.

  • Group search results by thread setting in admin console doesn't change search behavior (CS-3656): I'm not sure how long this feature has been around, but there is a search setting in the admin console that gives you the ability to group all messages in a thread into one result in a search results page so that the messages in a single thread don't overwhelm the search results (since messages share the subject and tags from the thread, that actually happened quite a bit). Fixed in 2.0, I highly recommend turning it on in your instance if you haven't already.

  • Search updates to better balance queries across content types (CS-3638): Some improvements were made in 2.0.0 toward this issue, but it's 100% fixed in 2.0.3 and 2.1. There were two really big but really really hard to see problems with the way that we were executing our search queries. First, a quick background on how a search query is performed in Clearspace against all content types. We have a single Lucene index for all the content in Clearspace (there is a separate index for user data, but that's a different story) so when a search for 'bananas' is executed, we did something like this (don't read too much into the language I'm using, I'm just trying to illustrate how it works at a 30,000 foot level):

  1. get blog posts that match query

    • find all the blog posts where the subject matches 'bananas' OR the body matches 'bananas' OR the tags matches 'bananas' OR the attachments matches 'bananas' or the blogID matches 'bananas'

  2. get discussions that match query

    • find all the messages where the subject matches 'bananas' OR the body matches 'bananas' OR the tags matches 'bananas' OR the attachments matches 'bananas' or the threadID matches 'bananas'

  3. get documents that match query

    • find all the documents where the subject matches 'bananas' OR the body matches 'bananas' OR the summary matches 'bananas' OR the fieldsText matches 'bananas' OR the tags matches 'bananas' OR the attachments matches 'bananas' or the documentID matches 'bananas'

  4. merge results from steps 1-3 using the relevance score from each item in the result set as the comparator

  5. display results

The assumption we made when writing this code was that the scores that Lucene returns for each item of all content type will be relatively similar. More concisely, if I had a document and a blog post which for some reason had identical content, I'd expect they would both have the exact same Lucene relevance score if they came up in the results of a search. But that assumption turned out to be wrong, not once but twice.


First, as you can see from glancing at the sample queries I pasted above, we searched a different number of fields per content type. Who cares right? Why would the number fields that you search on influence anything? Turns out that Lucene cares: the way scoring works in Lucene is that if you search on ten fields and only get a hit on two of them in document 'X' that the resulting relevance score should be less than a hit on blog post 'Y' where you search on five fields and get a hit on two of them. It makes perfect sense when you think about it: it's just like the tests you had in school. Getting a 4 out of 5 on a test works out to be 80%, about a B. If you got 4 out of 10 on a test, that's 40%, you failed. You probably called them grades... maybe sometimes even a score, which just so happens to be exactly how Lucene refers to the relevance that a particular document has to a given query (if you're curious about how Lucene does scoring / relevance you should check out the JavaDoc for the Similarity class and also read this document on scoring). Anyway, this behavior was actually fixed in 2.0: so now when we execute a search on mixed content we search the exact same number of fields for each content type: subject, body, tags, attachmentsText.


The second assumption that turned out to be wrong was just as nebulous. I illustrated above how we search in Clearspace: we do searches for each content type and then we merge the results of those searches into a single result set. In order to do a query for just blog posts that match the word 'token', we do a query that in Lucene looks like this:

+objectType:38 +(subject:token body:token attachmentsText:token tags:token)

It kinds of looks like a SQL subselect: get me all the things where one of subject, body tags or attachmentsText match and then, from those results, only return the results where the objectType is 38 (which is the int that JiveConstants.BLOGPOST refers too). The thing that killed us here was outer statement



a) when Lucene executes a query, it computes a query weight and field weight for each statement in your query and multiplies those two values together to get the total weight for that statement and

b) the query weight is basically a measure of how many times the key (in this case 'objectType') appears divided by the number of times the value appears (in this case '38') which means that

c) content objects that you have less of (in our case: blog posts) will tend to have a score much higher than content objects that you have a lot of (in our case: documents). Again, this makes sense: items that appears in the index a relatively small number of times are in some sense rare and so they should get a relatively higher weight. Regardless, it turns out there's a really easy fix for this problem as well: you can boost specific fields in your query like this:


and you can effectively neuter a field by boosting it to zero:


which means that Lucene will look for all the items in the index whose subject is 'token' but the weight, which usually influences the score that it assigns to the field 'subject' will not influence the score that the resulting item receives. 


We're continually looking to improve the search tools in Clearspace. If you're seeing something you don't expect or if there's something cool you'd like us to add, please pipe up in our Support and Feature Discussion spaces here on Jivespace.



New Widgets

Posted by ajohnson1200 Jan 10, 2008

In a future release of Clearspace we're talking about adding a personalized homepage feature, much like or, where users can add Clearspace widgets like 'Recent Content' or 'Top Members' to their homepage. A lot of the widgets that we debuted in the 'customize a community' feature (have you seen the video?) will be available for users on the homepage, but we'd like to hear from you: assuming you had a customizable homepage and you could use Clearspace like you use or, what widgets (Clearspace related or unrelated) would make your day?

Jon Udell <a href="">recently blogged</a> about how the user interface for creating links in rich text editor hasn't changed in years:

The idiom goes like this:


1. Select the text to which you want to attach your link.

2. Click the Link button.

3. Type (or paste) the URL.


I’ve watched novices struggle with this for years, and it’s no wonder that they do.

We've spent a number of cycles making the rich text editor linking process easier in Clearspace. In case you've never explored the little link button in the editor, clicking on it pops up a window that looks like this:



The pop up window defaults to showing you a search interface: do a quick search and you can link to anything in Clearspace.  Conveniently there are two other options: the history option:




which shows a list of your most recently viewed blog posts, forum threads and wiki documents and second: the web address option:




where you can type in or paste a URL and then a link title. We've gone through a number of rounds of design on this pop up window and I'm sure there's still room for improvement (there is no perfect design). We'd love to hear your feedback though: what do you like about linking in the editor?  Which of the three tabs do you find you use the most?  Did you even know it was there? And for any developers out there, has anyone tried to or wanted to customize the editor?  If so, what did you did do to it?

Filter Blog

By date: By tag: