web crawling

Recovering a 6.5GB Elasticsearch Index from Amazon S3

Update: This was run in 2013 on Elasticsearch 0.9 (I think). Elasticsearch has changed considerably in the last few years and is now on version 1.7.1 - download the most recent version of Elasticsearch here.

Summary: A 6.5GB Elasticsearch index takes roughly 10:00 minutes to recover to a fresh Amazon EC2 M1.Small Instance that was preconfigured to start it's own cluster and pull the index from S3 upon launch.  The index was actually searchable ( although incomplete ) around 8:30 minutes.  I did not expect the index to recover this quickly.  I'm constantly impressed by Elasticsearch.

System Configuration

I'm running an Elasticsearch index on a single Amazon EC2 instance.  It's configured to run with five shards and one replica ( default ).  This means it will work as a single node on a single machine or as a cluster with many nodes on many machines.  Elasticsearch is configured to back up to Amazon S3 using the Elasticsearch AWS Cloud Plugin.

I've created an Amazon Machine Image ( AMI ) that allows me to boot a fresh instance that will automatically launch Elasticsearch and join a running cluster.  If no cluster is found, it will pull the index from Amazon S3 onto it's 50GB ephemeral drive and into memory, creating a cluster and becoming the master without me touching the instance.

Index Configuration

I've generated a large Elasticsearch index containing real data used for a client's project.   The index contains two document types, Company and File.  The Company documents contain profile data about a company: name, url, description, addresses, keywords, etc.  Each company type can include a few to dozens of nested objects used so we can perform very specific searches not available if these were simple objects ( addresses, industries served, industry certifications, etc ).  The Company documents are parents to the File documents.  File documents contain a few fields like title, description, keywords, etc., but mostly consist of an attached document in a specific format: doc, pdf, html, etc.

Index Stats

Stats are pulled from the Elasticsearch Head Plugin ( a web based interface to look at your index ).

  • There are 1,475,892 File documents and 454,252 Company documents.
  • Total documents in the index, including nested objects in the Company type are 4,954,814.
  • The total index size is 6.5GB.

Launching a New Cluster

To launch a new cluster, I stopped Elasticsearch on the system that I built the index with so that my new EC2 instance would not try to join that cluster. The new Instance started it's own cluster and pulled the index from S3.  Below are notes about the times it took to recover the index when launching our AMI on an M1.Small EC2 Instance with a 50GB ephemeral drive ( M1.Small is the smallest instance size that Elasticsearch will run on. Micro instances don't have enough memory available and Elasticsearch fails during launch ).

  • 0:00 minutes - Launched new AMI from Amazon EC2 Dashboard
  • ~2:30 - Elasticsearch was available via it's API ( curl -XGET 'http://AMAZON_HOST:9200/' )
  • 3:45 - Index size reported by Head Plugin: 1.1gb
  • 4:35 - Index size reported by Head Plugin: 1.2gb
  • 5:35 - Index size reported by Head Plugin: 1.7gb
  • 7:00 - EC2 Instance available via SSH
  • 7:36 - Disk usage reported by DF: 6.2gb
  • ~8:30 - The index is searchable even though it was not fully loaded yet.  Some shards were not distributed properly based on head until this time.  This may vary based on different trials making the index searchable sooner, even though incomplete.  ( API Request: curl -XGET 'http://AMAZON_HOST:9200/companies/company/_search?q=_all:adhesive&pretty=true&size=3' )
  • 8:50 - Index size reported by Head Plugin: 5.8gb
  • 9:10 - Disk usage reported by DF: 8.1gb ( index almost fully loaded to disk )
  • ~10:00 - Index size reported by Head Plugin: 6.5gb ( index fully recovered )

Notes

  • The results returned by the index at 8:30 minutes when it was first available were different than the results returned when it was fully loaded at ~10:00 minutes.  They were still correct ( objectively ), but not all Company documents were available yet. ( API Request: curl -XGET 'http://AMAZON_HOST:9200/companies/company/_search?q=_all:adhesive&pretty=true&size=3' )
  • The first few searches on the index are very slow, but it warms up quickly.  I'm not exactly sure what is happening here.  I have not dug deep enough yet.  As far as I know, this index is too large to load entirely into memory on an M1.Small EC2 Instance, but search performance seems to be comparable to larger instances that have enough memory to store the entire index.  I have not quantified this.
  • The index is available and searchable in about 30 seconds when stopping and starting Elasticsearch on a machine where the index is already loaded to disk ( An M1.Small EC2 Instance ).