If you’ve read my post about the latest rewrite of this site, you know that the search engine working in the background here is Solr, an opensource search server originally developed by CNet. During that migration I kind of learnt the basics of working with it on its own (without some of the fancy wrappers that do the whole configuration for you by looking at your model layer) but the schema.xml still was kind of scary to me. So when Packt gave me with “Solr 1.4 Enterprise Search Server” by David Smiley and Eric Pugh - once again - the chance to review a book about something that I wanted to learn, I couldn’t resist :-)
Disclaimer: This is based on a review copy generously provided by Packt Publishing.
First of all, the title is in my opinion a bit misleading. Yes, Solr is a so-called “Enterprise Search Server”, but this shouldn’t stop you from using something like it even for small projects. Usually users expect some kind of search facility on sites of any size and integrating Solr with for instance Django is really easy. So just do it :-)
The book consists of nine chapters starting off with giving an introduction into working with Solr and then steadily working through topics like how you should design your schema and how you can index various kinds of content up to advanced topics like scaling your search setup in various dimensions. So let’s go through the chapters:
Chapter 1: Quickstart
This chapter gives the answer to the “What is Solr anyway” question and where you could use it in your typical stack of technologies for your web project.
It also describes the basic installation process and where to get the software in the first place. The authors also don’t shy away from showing you XML dumps right in the first chapter with your first search query.
Chapter 2: Schema and Text Analysis
One of the great aspects of working with Solr is that you can also use it for not only your usual “user types in search term and gets result” kind of functionality but also for many other content listing aspects. But depending on what you want to use your search store for, you have to structure the schema accordingly.
The second chapter explains exactly this step in detail: How to denormalize data from your primary data store in order to make all your planed search queries possible. Here you learn about fieldtypes, fields and many more things that make up your Solr instance’s schema.xml. The basic structure of your schema.xml is actually pretty simple but in combination with analyzers and filters you can do tons of things with these fields.
The analyzer chains (one for query-time, one for index-time) are, what allow features like stemming and working with synonyms, stop-words, etc. And contrary to what the authors wrote on page 50, “filters” is IMO not a bad name for for the analyzers that are executed after the initial tokenizing. Data goes through them and ideally gets transformed :-)
But already during tokenizing you can do quite a bit. For example; I wasn’t aware that there were already special tokenizers for working with HTML content. After reading that I immediately changed my schema.xml here and re-indexed all posts :-P
The authors also explain a bit about what filters you’d normally do at index-time and which at query-time, although the synonym filter IMO was not such a good choice for an index-time filter. At least for me as newbie it looks chapter and more practical to primarily do the synonym conversion at query-time than rebuilding the whole index once you learn of a new synonym.
As data source for this and all the following chapters the authors have chosen the MusicBrainz database which included data on thousands of songs and artists. I perhaps could be argued that this might not be the best foundation for a supposedly full-text search service but since Solr can also be used as some kind of denormalized frontend to RDBMS it makes sense :-)
Chapter 3: Indexing Data
So now that you know what you want to have in your index, you have to get the data into it. And the book goes into quite a lot of detail regarding all the options you have there. From using your usual HTTP interface to post the documents to let Solr itself fetch the data from a passed URL or file path or directly from some database using JDBC. Every method was described with an example configuration, which is really great. It never felt like just a listing of features.
This chapter also describes Solr Cell which is a tool for indexing PDFs, Word documents and much more.
Chapter 4: Basic Searching
The forth chapter then goes into how you can get data out of the index again with “basic” search queries. And I was really surprised how many options you have here. Solr also offers a way to debug your search result in order to understand why a certain result-set was generated. And the authors seem to absolutely love this feature, using it quite a lot in this and the next chapters :-) And it definitely comes in handy when explaining the results after some boosting, a term that should have perhaps already been described in the first chapter, but anyway.
Ah, and if you’re looking for a simple explanation of the query syntax (for the standard query parser), you can find it here too :-)
Chapter 5: Enhanced Searching
If you want to dynamically influence the score documents get on certain queries this chapter presents function queries which offer exactly that. These seem to be interesting for situations where you for example need the age of a document to influence its score, but not completely overwhelm other criteria.
Another nice feature is the dismax query parser, which offers a simplified search syntax; something that you might want to expose directly to the user front-end. What kind of confused me here was the authors said that the dismax request handler was deprecated. After reading it again and looking it up on the wiki it was clear that this only means the request handler and not the query parser, but IMO this information should have belonged in some info box here.
Also part of this chapter is faceting, which probably should have been given its own chapter. It’s for me personally just one of those features, that make Solr (and other solutions offering this feature) so interesting compared to basic full-text search service and another reason why I wanted to read this book: To learn more about how to use faceting/how to better use it. The authors also give a really nice example in combination with a pattern tokenizer and synonym filter to provide facets for ranges of characters (like A-D, E-H, etc.). Very nice :-)
Chapter 6: Search Components
Search components are Solr’s extension mechanism, which includes the already mention facet search component. In this chapter the authors present a handful of other core and 3rd-party components like highlighting and geo-spatial searching that can be attached to a query handler.
What I really liked about this chapter was the inclusion of those 3rd-party components. I so want to use the LocalSolr (geo-spatial) component right now somewhere ;-)
Part of this chapter is also the “More like this” component which is great to list related content like related blog posts etc. The spell-checking component looks great, too, and, once again, I really have to mention the amount of detail the authors provide with each example.
Chapter 7: Deployment
Chapter 7 contains information about how to deploy Solr either using your favorite servlet container or the bundled jetty. A great tip here is to give all your search interfaces their own handler for abstraction.
My personal highlight of this chapter was the part about Solr cores. Having a staging core that is active during indexing and then swapping it with the current live-core really looks like a nice approach to provide consistent and fast search results even during high-load updates.
The last part of this chapter was all about security and how you can secure Solr to some extend using your container’s or frontend-server security facilities.
Chapter 8: Integrating Solr
Chapter 9: Scaling Solr
The last chapter is all about scaling Solr wide, high and deep. The authors even went so far as to provide a disk image for Amazon EC2 in order to let your try Solr’s wide-scaling features easily for yourself. Simply a great last chapter :-)
I really love this book. It showed me so many features of Solr that I didn’t know before that I still can’t even find the right order I want to integrate some of them into my projects :D
The authors didn’t just list features and gave some short explanation for all the configuration options but also always gave examples (using the MusicBrainz database as original data store) on what to use those features for. What’s also great is that the book isn’t really limited to what is bundled with the Solr distribution but also mentions quite a lot of components that are provided by 3rd parties. With the MusicBrainz database and working on it there is a common thread throughout the whole book which makes it really pleasant to read.
If you’re looking for a book about how to work with Solr, this is definitely something for you. For me personally reading this book was just a joy :D