umbraco and Lucene.net
Rating: 4,2 / 5 stars - 20 vote(s).
As you’re all probably aware, umbraco uses Lucene.net quite heavily for adding search capabilities to umbraco itself. For instance, searching for a document for which you don’t exactly remember the name is as easy as entering the ‘Search’ box on the right of the ‘Create’ button in the umbraco admin interface and enter a search term. At that point, umbraco will auto suggest a number of documents which match your search term. Clicking on such an item will take you to the document. Another search dialog can be opened by entering a ‘space’ char in the search box… which will popup a new search dialog.
Recently, I had a request from a client to search for documents of a specific document type. At first, this seemed like a very simple task, but really soon I realized that the search wasn’t behaving exactly as I expected it to. So I started digging into the code for umbraco to find out how the search works internally. Once I could grasp the idea on how umbraco does it, I could easily adapt it to suit my needs.
Basically, each document that gets saved or published in umbraco is subject to indexing, meaning that some properties of the document type will be used to build a super fast index file which can be queried at lightning speed (well, almost…). In order to index properties of a document type, umbraco uses Lucene.net. I won’t go into much details on how Lucene.net works internally, but you could consider it as an optimized index which one may query very fast.
By the way, if you’ve ever wondered where those files reside, they’re in the /data/_systemUmbracoIndexDontDelete folder. You’ll find at least one .cfs file, a deletable and segments file. If one of those is missing, then your index is broken! I’ll explain how to re-create the index if it’s broken.
First question that popped up: what properties are being indexed? I had to go through the umbraco core code to find out that:
- Only documents are indexed
- No custom properties are indexed whenever a new document is created. Custom properties are only indexed when documents are saved.
- Document type is NOT indexed.
I was quite satisfied with the first two, but I was having problems with the document type not being indexed, although it makes a perfect candidate for indexing.
Nothing really to worry about as we CAN write our own indexing mechanism of course. But as I didn’t want to touch the umbraco internals too much, I decided to go with an event handler that would index extra info based on the document type whenever such a document of that specific document gets saved or published. It is quite easy using the new event model.
1: public class IndexingEventHandler : ApplicationBase {
2:
3:
4: public IndexingEventHandler() {
5: umbraco.cms.businesslogic.web.Document.BeforeAddToIndex += new umbraco.cms.businesslogic.web.Document.IndexEventHandler(Document_BeforeAddToIndex);
6: }
7:
8: void Document_BeforeAddToIndex(umbraco.cms.businesslogic.web.Document sender, umbraco.cms.businesslogic.AddToIndexEventArgs e) {
9: e.Cancel = Index(sender) ;
10: }
11:
12: private bool Index(umbraco.cms.businesslogic.web.Document document) {
13: string[] contentTypeAliases = System.Configuration.ConfigurationManager.AppSettings["DocTypes"].Split(new string[] { ";" }, StringSplitOptions.RemoveEmptyEntries);
14: bool useOwnIndexingMethod = false;
15:
16: //If content type alias of current document is in list, 'Index' node to make sure nodeTypeAlias is included in field list of index items
17: foreach (string contentTypeAlias in contentTypeAliases)
18: if (string.Compare(document.ContentType.Alias, contentTypeAlias, true) == 0)
19: useOwnIndexingMethod = true;
20:
21: if (useOwnIndexingMethod) {
22: System.Threading.ThreadPool.QueueUserWorkItem(
23: delegate {
24: try {
25: Hashtable fields = new Hashtable();
26: foreach (Property property in document.getProperties)
27: fields.Add(property.PropertyType.Alias, property.Value.ToString());
28: //Adding extra field to filter search results
29: fields.Add("nodeTypeAlias", document.ContentType.Alias);
30: Indexer.RemoveNode(document.Id);
31: Indexer.IndexNode(document.nodeObjectType, document.Id, document.Text, document.User.Name, document.CreateDateTime, fields, true);
32: } catch (Exception ee) {
33: Log.Add(LogTypes.Error, document.User, document.Id, string.Format("Error indexing document from Eventhandler class: {0}", ee));
34: }
35: });
36: return true;
37: }
38: return false;
39: }
40: }
I’ve created my very own function to index a document. First, I check whether this type of document is of a specific document type. If so, I’ll proceed by including an extra field (nodeTypeAlias) for indexing. After that, I’ll let umbraco code take over again to index the node. Code also takes care of removing the node from the index first (if it’s there already) to avoid duplicate entries. If the document is being indexed by this event handler, I cancel out the event to prevent umbraco to index to node (as this event handler has done it already). If the event handler doesn’t have to index the node, the event won’t be cancelled and umbraco will take care of the indexing.
How do I know my event handler has indexed the node correctly, including the nodeTypeAlias property? I can’t, unless I build a search form OR starting using a nice nifty tool called Luke. Luke can be found at http://www.getopt.org/luke/ and it’s super easy to use. Just use the Java WebStart version (assuming you’ve got the requirements for Java in place – details on the site)
Point to the path of the Lucene index files (/data/_systemUmbracoIndexDontDelete folder), tick ‘Open' in Read-Only mode’ and hit ‘OK’.
Second tab ‘Documents’ will let you browse all documents that are indexed.
If you’d like to search for a specific term, go to the third tab ‘Search’, enter a search term and hit ‘Search’ button
Ok… now we want to perform the search ourselves using Lucene.net API for searching. It turned out to be quite easy because I could just go and copy umbraco internal search code and slightly modify that so it would only return documents of a specific document type.
1: public class Searcher {
2:
3: public List<SearchResult> Search(string[] nodeTypeAliases, string keyword, int maxResults, string[] searchFields) {
4: List<SearchResult> searchResults = new List<SearchResult>();
5: IndexSearcher searcher = new IndexSearcher(umbraco.cms.businesslogic.index.Indexer.IndexDirectory);
6:
7: try {
8: QueryParser parser = new QueryParser("Content", new StandardAnalyzer());
9: Query query = parser.Parse(keyword + "*");
10:
11: Hits hits;
12: int numberOfHits;
13:
14: SortField[] sortFields = { new SortField("SortText") };
15: hits = searcher.Search(query, new Sort(sortFields));
16: numberOfHits = hits.Length();
17:
18: for (int i = 0; i < numberOfHits; i++) {
19: bool includeInSearch = false;
20:
21: try {
22:
23: try {
24: string nodeTypeAliasField = hits.Doc(i).Get("field_nodeTypeAlias");
25:
26: foreach (string nodeTypeAlias in nodeTypeAliases) {
27: if (string.Compare(nodeTypeAlias, nodeTypeAliasField, true) == 0)
28: includeInSearch = true;
29: }
30: } catch { }
31:
32: if (includeInSearch) {
33: SearchResult result = new SearchResult(int.Parse(hits.Doc(i).Get("Id")), hits.Doc(i).Get("Text"), hits.Doc(i).Get("CreateDate"));
34: foreach (string searchField in searchFields)
35: result.Fields.Add(searchField, hits.Doc(i).Get("field_" + searchField));
36: result.Fields.Add("nodeTypeAlias", hits.Doc(i).Get("field_nodeTypeAlias"));
37: searchResults.Add(result);
38: }
39:
40: } catch (Exception e) {
41: throw new Exception("Error in search", e);
42: }
43: }
44: } catch (Exception ee) {
45: throw ee;
46: } finally {
47:
48: if (searcher != null)
49: searcher.Close();
50: }
51:
52: return searchResults;
53: }
54: }
Most important stuff in here in the try/catch block that checks for the existence of the ‘field_nodeTypeAlias’ field for each ‘Hit’ that is returned from the search. If such a property exists, we know we’re dealing with a document of that specific document type we’d like to include in the search results. Only thing left is to build a small datagrid/datalist that can consume the results and display to the user.
Probably, during development, you’ll hit the index broken wall! It shouldn’t happen too often, but still this happens quite a lot during development. If such thing happens, make sure to delete all files from the /data/_systemUmbracoIndexDontDelete folder. Once done, you should save/save and publish the documents again so the index gets rebuilt. Alternatively, you could use the umbraco reindexing page found at /umbraco/reindex.aspx. (For the latter, make sure you’ve close Luke utility first).
Happy coding!