Blog

Multi-Level Security in Solr

On June 15, I presented at the Basis Technologies Lucene and Solr in the Government conference on the topic of multi-level security (MLS) in Solr using ManifoldCF. This post covers the notes from my presentation.

My presentation covers document security as it pertains to search:…across multiple repositories…by users of varying clearance…against documents of varying classification.

Overall this is known as Multi-Level Security, or MLS.

MLS isn’t the only way to implement Document Security. One way to implement document security is to only permit equal parties access to the system itself. The information flow in to and out of the system is screened by a separate process, so the system itself doesnt have to screen it at all. This type of system often fits into a hierarchical network where users may view information at an equal or lower clearance level, and any information created is only available to an equal or higher clearance level. An example of system high security is where separate indices are created for each clearance level. Users can only see the indexed data from an approved clearance level.

Multi-Level Security, or MLS, is the capability to recognize differing access permissions between otherwise equally trusted parties. It encompasses the notion of “compartmentalization”, where I am given access to the information I need to do my job, and you are given access to different information because you do a different job. This is possible within the same system because our differences are recognized and used to grant appropriate access.

This idea of compartmentalization was famously popularized during the Manhattan Project. The people who were enriching the plutonium knew all about the state-of-the-art of enriching plutonium, but knew nothing about bomb designs.

MLS is hard to do right and I’ll explain the reasons why a little later. In fact most of the systems I’ve worked on started with system-high security, and when the project gained enough traction or funding, MLS was added.

The difference between MLS and “system-high” is that in the case of system-high, the system itself cannot be relied on to appropriately keep track of and enforce access restrictions beyond a basic level. Additional processes must be introduced to filter information. This introduces a time and resource cost that may reduce the value of the information itself. Moreover, the value of the system is reduced because fewer people are allowed to use it.

Next, well discuss how Document Security relates to Search. The problem of appropriate access to information is particularly important to Search applications because the search engine needs access to all of the information in a repository in order to index it, and it must only present search results filtered by the users clearance. In order to do this filtering in a timely manner, the security information of the documents needs to be indexed and applied to the search algorithm.

If security is “bolted on” as an afterthought, the order that documents are listed and the estimation of total relevant documents will also have to be recalculated, and the overall feature set of the search engine may have to be limited or re-implemented to prevent “leakage”. By “leakage” I mean in general any information the search engine inadvertently provides that is outside the user’s clearance.

An example of “bolted-on” security would be, if our search engine provided us an unfiltered result set, we could then post-process each page of results before they are sent to the user. Now, in order to avoid blank or short pages we’ll need to re-page the results.

An example of leakage would be, if I typed in “Scott Stults ATF raid” and got back 15 hits, I might want to tone down my Fourth of July plans!

Another well-known technique for teasing out information from a search engine involves exploiting the little document excerpts you get below the document titles in search results pages.

Also, even if highly relevant and sophisticated search capabilities can be applied to one document repository, that information is much more useful when combined with information in other repositories. This is called Federated Search. Those repositories may have their own set of classifications which need to be honored by any system providing access to them.

For example, until recently (and it may still be the case with some systems) one Federal Agency and Another had separate clearance labels, and a clearance with one meant nothing to the other. If you had clearance with both and were performing a federated search across both sets of repositories, your clearance with the first would have to be applied to the first set of repositories, and your clearance with the second would need to be applied to those repositories. And then you’d need to blend the two result sets together so that you’re looking at the most relevant results from either.

That’s a difficult problem, to be sure. However, if you are able to technically implement that, you will gain a much more complete picture of all of the information available regarding your query.

This is the challenge addressed by the Manifold Connector Framework, known as ManifoldCF. ManifoldCF is an Apache Software Foundation “Incubator” project. It is composed of a set of connectors to repositories, authorities, and search engines. These connectors are joined together during indexing so that the security tokens (the classification of the documents) are indexed. Then when a user performs a search, the users access tokens (or clearances) are applied as search criteria.

Included in repository connectors that MCF has right now are the file system, RSS feed, web, Windows Share/DFS, database, Documentum, and SharePoint. By far the ones who get the most attention on the MCF mailing lists are SharePoint and Documentum. (Personally, I will probably see more use out of the generic database connector since that’s where the documents are stored in the systems I usually work with.)

The purpose of a repository connector is to “crawl” each document in the repository and send it to the output connector. It’s also responsible for doing incremental updated crawls over new or changed documents.

Active Directory and Documentum are some of the authority connectors. There aren’t a lot right now. They are responsible for supplying the authorization tokens (or clearances) for each person when they perform a search.

And lastly, there are the search engines, or output connectors, that MCF currently knows about: Solr and MetaCarta GTS. The output connector knows how to feed the documents it receives from the repository connector to its target search engine. (and of course Solr is everyone’s favorite!)

Before we dig into how security enforced by ManifoldCF, let me first give you a quick overview of what MLS looks like logically.Formally, MLS fits a Boolean logic construct called “implication”: x–>y = -(x / -y)

The thing on the left reads as “If x then y”, and the thing on the right-hand side of the equation is how that gets translated into Boolean logic. When that logic is interpreted as a security algorithm it enforces the notion that any token in x must also be present in y. Any “extra” tokens present in y do not affect how access is granted. In English it reads “do not allow any document whose tokens are in x but not in y.”

This logic is hard for the Boolean logic normally implemented in search engines because unary NOTs (the dog-leg symbol right after the equals sign) are hard to dereference. (Imagine asking Google for all documents that DONT contain the word “security”.)In fact, Lucene does not allow unary NOTs, but Solr does. Furthermore, sometimes its more efficient to specify who isnt allowed access, rather than constructing an inverted set of every single security token.  In fact if you do construct such an inverted set so that you can index it, you will need to reconstruct it and re-index your whole repository whenever a new token is added.By design, these two notions of authorization grants and denials together fit the way our federal government applies security markings to documents.

So now we turn our attention back to ManifoldCF and Solr. I described the logical underpinnings of MLS, and this is exactly how Manifold interacts with Solr to provide document security.
subUnprotectedClause.add(new FilterClause(new WildcardFilter(new Term(allowField,”*”)),BooleanClause.Occur.MUST_NOT));
subUnprotectedClause.add(new FilterClause(new WildcardFilter(new Term(denyField,”*”)),BooleanClause.Occur.MUST_NOT));
orFilter.add(new FilterClause(subUnprotectedClause,BooleanClause.Occur.SHOULD));
int i = 0;
while (i
{
String accessToken = userAccessTokens.get(i++);
TermsFilter tf = new TermsFilter();
tf.addTerm(new Term(allowField,accessToken));
orFilter.add(new FilterClause(tf,BooleanClause.Occur.SHOULD));
tf = new TermsFilter();
tf.addTerm(new Term(denyField,accessToken));
bf.add(new FilterClause(tf,BooleanClause.Occur.MUST_NOT));
}
bf.add(new FilterClause(orFilter,BooleanClause.Occur.MUST));
return bf;

This is an excerpt from ManifoldCFSecurityFilter.java. (Please don’t strain your eyes!) Its how you would write the logical implication I just showed you in Java. To use this in Solr, you simply put the compiled Java file where Solr can find it and include this filter (or “Search Component” in Solr parlance) in the Solr configuration file.

Elsewhere in this file is a call to ManifoldCF to retrieve the security tokens for a particular authenticated user, and we configure that within ManifoldCF when we set up the indexing Job. So for any particular “hit” within Solrs index Manifold has mapped that document back to the right authority, the correct authorization tokens are used in the filter, and the correct filter is applied to the result set.

Next, let’s briefly cover about some of the other things you should consider when implementing MLS with search.The process of filtering results based on the appropriate user authorization tokens is simple enough to be very efficient at query time. However, that process quickly becomes unwieldy when multiple authorities have to be consulted. The filter has to use each documents provenance to determine the appropriate authority, and each authority will have a set of authorization tokens to grant the user.

To be sure, having to incorporate multiple user authorities is a problem only the largest organizations may face, but it isnt a fictional problem. Its something joint task forces face all the time. So in order for documents to have their full value there must be full agreement on what tokens are used and what they mean.

Another consideration to bear in mind is that everything Ive described to this point falls under the category of “early binding.” The access restriction that is enforced only applies to how the documents were classified when they were last indexed. This isnt much of a problem with older documents that have “settled” into their classification, but for new information pertaining to current events, early binding may not offer enough assurance.

For example, not much is happening in Panama these days but the US has conducted military operations there so there is probably some classified material pertaining to that, and those classifications aren’t likely to change. On the other hand, documents pertaining to missions last week may get an immediate bump in classification if they suddenly pertain to missions happening right now.

Whenever a document’s classification changes it must be removed from the index and then re-indexed with the new classification. An additional way to mitigate the problem of frequently changing classification is to implement “late binding”, or re-request the documents classification immediately before presenting it to the user. But this additional assurance comes at the cost of speed. In reality a combination of all of the above would help optimize the need for quick access and a high-level of assurance.

Solr by itself is a wonderfully flexible platform, and as I pointed out before, its query filters allow for MLS to be implemented in a fairly simple environment. Manifold, however, greatly simplifies the process of indexing, even across multiple repositories. And in the single-repository scenario, the standardized way in which Manifold specifies and schedules indexing jobs will definitely be beneficial.

MLS is an important capability necessary to realize the full value of information stored in a document repository while also maintaining appropriate access control. When used in conjunction with Federated Search, that value is amplified by the information stored in neighboring repositories. Even if your repository isn’t federated now, Manifold and Solr significantly lower the barrier to achieving federation later at very little incremental implementation cost.

And, at the very least, ManifoldCF offers to be a consistent and reliable tool for regularly indexing documents.

Links:

Multi-Level Security Using Solr

ManifoldCF