ElasticSearch as a primary data store

These are interesting times to be working with data; particularly because of the new possibilities that NoSQL (non-relational) databases offer for data modelling and manipulation. In addition to the new document databases, key-value stores and graph databases, there is an emerging option that has been hiding in plain sight: using your search engine as the database.

A conventional application begins with data stored in a relational database. If search and aggregation needs exceed what it can provide, an additional system is added: the search engine. The former remains the authoritative data store and the latter is a tacked-on hack that stays downstream but earns its keep by finding results quickly.

As ElasticSearch has developed to address the complications of previous search engines and adapt to modern NoSQL and hosting environments, it has come to resemble a data persistence platform in its own right. It can store and retrieve arbitrary JSON data, so it is useful for richly structured data as well as free-form text. It can be updated in near real-time, so there need not be any appreciable delay between the creation of new data and its availability for retrieval.

Extra tools incur extra overheads. If ElasticSearch already has all your data in a form that is persistent, updatable, and accessible by ID lookup or search queries, why go to the effort of replicating a subset of that functionality in a separate “primary” data store?

Some applications - such as those involving full scans that process whole records and produce large results - are not suited to the retrieval and aggregation options provided by search engines. Others require sophisticated transaction support during writes. Many more are perfectly suited, but there are still a couple of reasons for you to tread carefully.

ElasticSearch provides redundancy that can protect against hardware failure, and recent versions appear free of data corruption issues, but durability and facilities for taking recoverable backups have not yet had as much attention as in other data stores. From an application programming point of view, you should also keep in mind that object-document mappers (ODMs) targeting ElasticSearch are not yet as mature and full-featured as those for document stores such as MongoDB.

Despite these considerations, ElasticSearch is certinaly ready for experimental use as a data store, and depending on the requirements of your application, it may even be the only one you need.

A demonstration in Rails

To show you how simple it is, I’ve whipped up a basic note taking application that uses ElasticSearch as its primary and sole data store. Much of the credit for this should go to Karel Minařík, whose ElasticSearch API, Tire, made it a cinch. You can find the demo application on GitHub and run it locally by following the instructions in the README file.

Read on here for tips on building your own.

Disabling ActiveRecord

After creating a fresh Rails application, the first step is to disable ActiveRecord. In config/application.rb find the line with require 'rails/all' and replace it with:

require "rails"
[
  # 'active_record',
  'action_controller',
  'action_mailer',
  'active_resource',
  'rails/test_unit',
].each do |framework|
  begin
    require "#{framework}/railtie"
  rescue LoadError
  end
end

Then you can delete config/database.yml, remove gem 'sqlite3' from your Gemfile and rerun bundle install.

ElasticSearch-backed models and near real-time updates

With the database stuff out of the way you are free to create models that persist to ElasticSearch. Check out the Tire README, paying particular attention to the last part, which discusses its persistence features. You’ll be up and running in no time.

There is one small trick you should know. After adding a record to an ES-backed model, you may find that you need to reload your index page before the new record appears. This is because ElasticSearch performs updates in “near real-time”.

Near real-time means that it may take up to one second before the creation or update of records is reflected in search results. This is different from GETing a record based on its ID, which in recent version of ES will work immediately. The one second period can be configured to be shorter, but to ensure that a recently indexed record is available you should make an explicit refresh call.

For any application that is not write-heavy this should pose no performance problems and it is easily implemented. In my Note model I do it in these 3 lines:

refresh = lambda { Yire::Index.new(ES_INDEX_NAME).refresh }
after_save &refresh
after_destroy &refresh

Do you need your database?

You might like to read through the related discussions on the mailing list. I hope you’ll give this question some thought, and have fun with ElasticSearch either way!