SOLR Presentation at Montreal On Rails


I’m back from a very relaxing week in Mexico.  I strongly recommend this resort: Valentin Imperial Maya in Riviera Maya.  Great place, great food!

Alright.  Here are the show notes of my presentation on SOLR at Montreal on Rails on August 19th.

First, the slides are on SlideShare.

SOLR is a Java-based plugin. It is based on the Lucene technology.  Other possible full-text search engine solutions are: Ferret, Ultra Sphinx, Xapian.

You basically install the acts_as_solr plugin, configure it and start the server using a rake task: rake solr:start. You also have to create temporary folders in the plugin folder.



script/plugin install git://github.com/railsfreaks/acts_as_solr.git
mkdir vendor/plugins/acts_as_solr/solr/logs
mkdir vendor/plugins/acts_as_solr/solr/tmp

Now look at the file config/solr.yml that was created by the plugin.  You can customize it if you want.  Then, generate documentation (very handy) and start SOLR:



rake doc:plugins
rake solr:start

Then, you can test that SOLR is in fact running by going to: http://localhost:8982/solr/. This is a very handy tool to test the searches and verify that model instances have been properly indexed.

In your model, you simply have to add "acts_as_solr" and the model will be fulltext indexed. In my example, my model is named Tip. SOLR will index model instances when they are saved. To reindex existing instances, you can simply go through each of them and call save() or you can call rebuild_solr_index from the script/console:



script/console
> reload!
> Tip.rebuild_solr_index

To do a search, it's very easy: Tip.search "something".

Scores

Give the :scores option to the find method and results will have a solr_score attribute.

Tip.find_by_solr('foo', :scores => true)
number_to_percentage( tip.solr_score*100, :precision => 0 )

Additional fields

By default, SOLR indexes all model attributes.  If you want to index a virtual attribute, give the option :additional_fields to acts_as_solr:

acts_as_solr :additional_fields => [:searchable_tags]

Specific fields

If you don't want all the attributes to be indexed, use the :fields option to specify the attributes you want to have indexed (you can include virtual attributes):

acts_as_solr :fields => [:title, :body, :searchable_tags]

Boost

By default, all attributes have the same weight in the search.  You can boost models/attributes by using the :boost option:

acts_as_solr :fields => [:body, {:title => {:boost => 100.0 }}, :featured, :searchable_tags], :boost => 10.0

Range

You can tell SOLR to treat an attribute as a integer or float range.  This will allow you to search for intervals:

acts_as_solr :additional_fields => [ {:seconds => :range_integer} ]

Then, you can search for an interval:

Tip.find_by_solr('seconds:[0 TO 30]')

Pagination

The find_by_solr accepts pagination and sorting options: :limit, :offset, :order.

Multi-model search

You can search in multiple models by giving :models to the find_by_solr method.  You have to invoke the method on a Model and include the other ones:

Tip.multi_solr_search( “pure”, :models => [Category,Comment] )

Return IDs only

Sometime, you only wanna have instances IDs instead of all their attributes.  You might want to do that in order to perform a SQL query after the full-text search and limit the search to the IDs SOLR returned.

Tip.find_id_by_solr(‘pure’).docs

Facets

Faceting allows you to have statistics on result groups.  For example, you could have the number of results per Tip category.  This is a "advanced" topic and I encourage you to read the faceting article that you will find in my resources list below.

French accents

Now, what about french accents in a field? Boom... out-of-the-box, this SOLR plugin will treat them as whitespaces. So if you have "crédit" in a model, you will not be able to find it with "credit".  Look at the SOLR analyzer and you will see how it treats the indexing and search: http://localhost:8982/solr/admin/analysis.jsp?highlight=on

There is a way to fix this. You basically have to modify the filtering sequence in the SOLR schema (configuration). This is in the schema.xml file under vendor/plugins/acts_as_solr/solr/solr/conf. Modify the file with the following lines:



<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.ISOLatin1AccentFilterFactory"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
    <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
    <filter class="solr.SnowballPorterFilterFactory" language="French"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>

    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.ISOLatin1AccentFilterFactory"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
    <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
    <filter class="solr.SnowballPorterFilterFactory" language="French"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
</fieldType>

The ISOLatin1AccentFilterFactory filter will take into account the french accents and replace them with their equivalent english letter.  The SnowballPorterFilterFactory with french option will take into account the plural versions of some words.  You can add additional filters (such as a HTML stripper one to remove HTML codes).  Have a look at this page, it lists them all.

Caveat: those filters will apply to all the attributes.  Now this works well if you are integrating SOLR in a french-only site, but it will not work so well on a bilingual site.  This is where I want to eventually spend some time creating new field types based on language (i.e. text_fr, text_en).  This would allow having different sets of filters by field type.  I'll write a blog entry when I get this done.

Resources

Look at the following links for additional information:

Acts as Solr Plugin
acts_as_solr : search and faceting

Advanced acts_as_solr

Solr: Indexing XML with Lucene and REST

acts_as_solr on GitHub

And read recipe 11 "Faceted Search with SOLR" in the Advanced Rails Recipe book.

About these ads

4 thoughts on “SOLR Presentation at Montreal On Rails

  1. Pingback: Montreal on Rails » Blog Archive » 10 was a success!

  2. Do you use Capistrano? I remember having problems putting SOLR in background (capistrano would not stop).

    If so, here’s the command I have in my recipe to start SOLR in brackground:

    run “cd #{current_path} && nohup rake solr:start RAILS_ENV=#{rails_env} > log/solr.log 2> log/solr.err.log”

    I also changed the SOLR rake file (plugins/acts_as_solr/lib/tasks/solr.rake) to not send the STDOUT and STEDERR to log files (just remove the redirections in the exec statement). My capistrano command takes care of redirecting those outputs to the log files.

    Maybe this will help.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s