Elasticsearch is a search engine ideal for full text search.

Here is an overview of key aspects.

Lot of companies use it to fully or partially power their search and autocomplete features on their platform.

To get an idea of which companies are using it and how they are using it, visit stackshare

Sections

setting up elastic search
terminology
curl commands to index, update, search

Setup

Remote Hosting

If you plan to use a hosted offering, take a look at

Amazon’s Elastic Search
Google Cloud Launcher
Hosting from ElaticCo maintainers of ElasticSearch

Using Docker

docker pull docker.elastic.co/elasticsearch/elasticsearch:5.6.0
# folder that we are going to save elastic search data locally
mkdir esdata
docker run -p 9200:9200 -e "http.host=0.0.0.0" -e "transport.host=127.0.0.1" -v esdata/usr/share/elasticsearch/data docker.elastic.co/elasticsearch/elasticsearch:5.6.0

Without Docker

// Linux/Mac
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.6.0.tar.gz

// Windows
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.6.0.zip

unzip the archive and run

bin/elasticsearch

To verify everything is working, navigate to http://localhost:9200/. You should see something like.

{
    "name": "kl9yfPi",
    "cluster_name": "elasticsearch",
    "cluster_uuid": "j3jdeZJuQciCEaPibQPtzg",
    "version": {
        "number": "5.6.0",
        "build_hash": "781a835",
        "build_date": "2017-09-07T03:09:58.087Z",
        "build_snapshot": false,
        "lucene_version": "6.6.0"
    },
    "tagline": "You Know, for Search"
}

Terminology

Document

Any item that needs to be stored in elastic search.
Examples of a Document can be Artist Info, Song Info and lyrics, User Profile,News articles, Tweets.

Any item that can be represented as a json , can be a document. A document is composed of multiple fields that may be indexed.

Here is a sample document for

{
   "movie_id":1,
   "title":"Hidden Figures",
   "director":"Theodroe Melfi",
   "actors":[
      "Taraji P. Henson",
      "Octavia Spencer",
      "Janelle Monáe"
   ],
   "duration":127,
   "lang":"en",
   "genres":[
      "Biography",
      "drama",
      "history"
   ],
   "description":"",
   "release date":"2017/6/1"
}

Once a doc is indexed, you can query on any fields. You can ask questions like

All biography movies released this Year
All Movies with a certain actor
All movies that have certain words in description

Type
Defines the schema and mapping shared by documents.

You can specify how you want to index different fields.

A index can have multiple schemas.
If you are indexing a movie site, there would be a type for person (actor, director), movie, review.

Indices

Data structure used by Elastic Search to store info for fast retrieval. Elastic Search stores info in an “Inverted Index”. Here is an example of an inverted index.

Movie 1 Aladdin is a 1992 American animated musical fantasy film produced by Disney.

Movie 2: Mulan is a 1998 American animated musical action comedy-drama film produced by Disney

Term	(Doc id, pos)
Aladdin	(1,1)
Mulan	(2,1)
American	(1,5), (2,5)
Disney	(1,12), (2,13)

When searching for matches in the index, Elastic Search uses TF-IDF. If there is a lot of documents to index, Elastic Search can break the index into shards that are stored in different machines.

TF-IDF A Scoring algorithm for two documents. Product of “Term Frequency” and “Inverse Document Frequency” Term Frequency/ Document Frequency can be used to measure relevance.

Term Frequency is how often a term appears in a docuemnt . Document Frequency is how often a term appears in all documents.

The intuition for this formula is that

words that appear in a lot of documents, might not be very useful. words that appear a lot of time in one document and rarely in other docs, is more relevant to the query Shard In order to scale, Elastic Search can split the index into smaller shards. A shard has its own copy of the index. Every document is sharded into a specific shard.

III. Commands For the rest of the tutorial, we are going to use a movie dataset from MovieLens. The data is spread across multiple csv files.

For simplicity sake, I have processed the three files using this script and stored it in this file.

Lets look at some sample data in this file

IMAGE

The core info we are storing is movieId, title, genres, tags, imdbId, tmdbId, numRatings. Tags and genres are lists.

To interact with your elastic search cluster, we are going to use the http requests. If you want a ui to make the requests, you can use Postman or you can use the elasticsearch plugin elastichead chrome plugin

If you decide to use the elasticsearch head plugin, navigate to the “Any Request” tab. Here is the ElasticSearch Head ui.

[Elastic Search Head]

Create a schema/ type Before we can store the data, we need to create a schema. Send a put request to the below url with payload

localhost:9200/movies

{
  "mappings": {
    "movie": {
      "_all": {
        "enabled": false
      },
      "properties": {
        "movieId": {
          "type": "text"
        },
        "title": {
          "type": "text"
        },
        "genres": {
          "type": "text"
        },
        "tags": {
          "type": "text"
        },
        "imdbId": {
          "type": "text",
          "index": "no"
        },
        "tmdbId": {
          "type": "text",
          "index": "no"
        },
        "numRatings": {
          "type": "integer",
          "index": "no"
        }
      }
    }
  }
}

Here is the ESHead ui containing mapping.

[Create Mapping]

We are defining a mapping called “movie”. By default, Elastic Search stores all data in the _all field. To disable that, we set enabled to false.

The properties field contains all the fields we want to index. For every property, we can specify the type such as text. We could also specify, if we want to store a field but not index it using index=no

Insert one document To add a document, we need to send a put request to

localhost:9200/{index}/{schema}/{doc_id}

In our case, it would be

localhost:9200/movies/movie/1

The payload would be

{
  "movieId": "1",
  "title": "Toy Story (1995)",
  "genres": [
    "Adventure",
    "Animation",
    "Children",
    "Comedy",
    "Fantasy"
  ],
  "year": 1995,
  "imdbId": "0114709",
  "tmdbId": "862",
  "numRatings": 247,
  "tags": [
    "Pixar"
  ]
}

Here is the ESHead UI with the payload. [Insert One Doc]

Delete Doc

To delete doc, we need to send a delete request to

localhost:9200/{index}/{schema}/{doc_id}

So, if we want to delete the document we inserted, send a request to

localhost:9200/movies/movie/1

Here is the ESHead UI with the payload. [Delete Doc]

Insert multiple documents

If you want to insert multiple documents at once, you need to create a payload like

[Bulk Payload]

Here is the command I used to convert the movies.json file to the appropriate file.

cat movies.json | jq -c ' {"index": {"_index": "movies", "_type": "movie", "_id"
: .movieId}}, .' > movies_bulk.json

You need to use JQ tool.

For simplicity sake, here is the output file after the command.

Once you use download the file, run the below command to bulk insert the docs.

curl -XPOST localhost:9200/_bulk --data-binary  @./movies_bulk.json

Search

Creating an elastic search query can be a bit difficult. To create the query, lets use the “Structured Query” in elasticsearch.

Here is an example query, to find all movies that

have “harry” in title
released after 2000
genre is Fantasy

[Search Query]

If you want to send the query using curl, click the checkbox “Show query source”.

[Raw Query]

If you want to see the raw payload, change the value of “Output Results” to json.

[Search Results]

For testing, using docker

</div>
<footer class="article-footer">
  <a data-url="http://npatta01.github.io/2017/09/04/elastic/" data-id="cj7icp3ac0023qduyyy7keqhz" class="article-share-link">Share</a>
  
    <a href="http://npatta01.github.io/2017/09/04/elastic/#disqus_thread" class="article-comment-link">Comments</a>

</footer>