Elasticsearch is a search engine ideal for full text search.
Here is an overview of key aspects.
Lot of companies use it to fully or partially power their search and autocomplete features on their platform.
To get an idea of which companies are using it and how they are using it, visit stackshare
Sections
- setting up elastic search
- terminology
- curl commands to index, update, search
Setup
Remote Hosting
If you plan to use a hosted offering, take a look at
- Amazon’s Elastic Search
- Google Cloud Launcher
- Hosting from ElaticCo maintainers of ElasticSearch
Using Docker
docker pull docker.elastic.co/elasticsearch/elasticsearch:5.6.0
# folder that we are going to save elastic search data locally
mkdir esdata
docker run -p 9200:9200 -e "http.host=0.0.0.0" -e "transport.host=127.0.0.1" -v esdata/usr/share/elasticsearch/data docker.elastic.co/elasticsearch/elasticsearch:5.6.0
Without Docker
// Linux/Mac
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.6.0.tar.gz
// Windows
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.6.0.zip
unzip the archive and run
bin/elasticsearch
To verify everything is working, navigate to http://localhost:9200/. You should see something like.
{
"name": "kl9yfPi",
"cluster_name": "elasticsearch",
"cluster_uuid": "j3jdeZJuQciCEaPibQPtzg",
"version": {
"number": "5.6.0",
"build_hash": "781a835",
"build_date": "2017-09-07T03:09:58.087Z",
"build_snapshot": false,
"lucene_version": "6.6.0"
},
"tagline": "You Know, for Search"
}
Terminology
Document
Any item that needs to be stored in elastic search.
Examples of a Document can be Artist Info, Song Info and lyrics, User Profile,News articles, Tweets.
Any item that can be represented as a json , can be a document. A document is composed of multiple fields that may be indexed.
Here is a sample document for
{
"movie_id":1,
"title":"Hidden Figures",
"director":"Theodroe Melfi",
"actors":[
"Taraji P. Henson",
"Octavia Spencer",
"Janelle Monáe"
],
"duration":127,
"lang":"en",
"genres":[
"Biography",
"drama",
"history"
],
"description":"",
"release date":"2017/6/1"
}
Once a doc is indexed, you can query on any fields. You can ask questions like
- All biography movies released this Year
- All Movies with a certain actor
- All movies that have certain words in description
Type
Defines the schema and mapping shared by documents.
You can specify how you want to index different fields.
A index can have multiple schemas.
If you are indexing a movie site, there would be a type for person (actor, director), movie, review.
Indices
Data structure used by Elastic Search to store info for fast retrieval. Elastic Search stores info in an “Inverted Index”. Here is an example of an inverted index.
Movie 1 Aladdin is a 1992 American animated musical fantasy film produced by Disney.
Movie 2: Mulan is a 1998 American animated musical action comedy-drama film produced by Disney
Term | (Doc id, pos) |
---|---|
Aladdin | (1,1) |
Mulan | (2,1) |
American | (1,5), (2,5) |
Disney | (1,12), (2,13) |
When searching for matches in the index, Elastic Search uses TF-IDF. If there is a lot of documents to index, Elastic Search can break the index into shards that are stored in different machines.
TF-IDF A Scoring algorithm for two documents. Product of “Term Frequency” and “Inverse Document Frequency” Term Frequency/ Document Frequency can be used to measure relevance.
Term Frequency is how often a term appears in a docuemnt . Document Frequency is how often a term appears in all documents.
The intuition for this formula is that
words that appear in a lot of documents, might not be very useful. words that appear a lot of time in one document and rarely in other docs, is more relevant to the query Shard In order to scale, Elastic Search can split the index into smaller shards. A shard has its own copy of the index. Every document is sharded into a specific shard.
III. Commands For the rest of the tutorial, we are going to use a movie dataset from MovieLens. The data is spread across multiple csv files.
For simplicity sake, I have processed the three files using this script and stored it in this file.
Lets look at some sample data in this file
IMAGE
The core info we are storing is movieId, title, genres, tags, imdbId, tmdbId, numRatings. Tags and genres are lists.
To interact with your elastic search cluster, we are going to use the http requests. If you want a ui to make the requests, you can use Postman or you can use the elasticsearch plugin elastichead chrome plugin
If you decide to use the elasticsearch head plugin, navigate to the “Any Request” tab. Here is the ElasticSearch Head ui.
[Elastic Search Head]
Create a schema/ type Before we can store the data, we need to create a schema. Send a put request to the below url with payload
localhost:9200/movies
{
"mappings": {
"movie": {
"_all": {
"enabled": false
},
"properties": {
"movieId": {
"type": "text"
},
"title": {
"type": "text"
},
"genres": {
"type": "text"
},
"tags": {
"type": "text"
},
"imdbId": {
"type": "text",
"index": "no"
},
"tmdbId": {
"type": "text",
"index": "no"
},
"numRatings": {
"type": "integer",
"index": "no"
}
}
}
}
}
Here is the ESHead ui containing mapping.
[Create Mapping]
We are defining a mapping called “movie”. By default, Elastic Search stores all data in the _all field. To disable that, we set enabled to false.
The properties field contains all the fields we want to index. For every property, we can specify the type such as text. We could also specify, if we want to store a field but not index it using index=no
Insert one document To add a document, we need to send a put request to
localhost:9200/{index}/{schema}/{doc_id}
In our case, it would be
localhost:9200/movies/movie/1
The payload would be
{
"movieId": "1",
"title": "Toy Story (1995)",
"genres": [
"Adventure",
"Animation",
"Children",
"Comedy",
"Fantasy"
],
"year": 1995,
"imdbId": "0114709",
"tmdbId": "862",
"numRatings": 247,
"tags": [
"Pixar"
]
}
Here is the ESHead UI with the payload. [Insert One Doc]
Delete Doc
To delete doc, we need to send a delete request to
localhost:9200/{index}/{schema}/{doc_id}
So, if we want to delete the document we inserted, send a request to
localhost:9200/movies/movie/1
Here is the ESHead UI with the payload. [Delete Doc]
Insert multiple documents
If you want to insert multiple documents at once, you need to create a payload like
[Bulk Payload]
Here is the command I used to convert the movies.json file to the appropriate file.
cat movies.json | jq -c ' {"index": {"_index": "movies", "_type": "movie", "_id"
: .movieId}}, .' > movies_bulk.json
You need to use JQ tool.
For simplicity sake, here is the output file after the command.
Once you use download the file, run the below command to bulk insert the docs.
curl -XPOST localhost:9200/_bulk --data-binary @./movies_bulk.json
Search
Creating an elastic search query can be a bit difficult. To create the query, lets use the “Structured Query” in elasticsearch.
Here is an example query, to find all movies that
- have “harry” in title
- released after 2000
- genre is Fantasy
[Search Query]
If you want to send the query using curl, click the checkbox “Show query source”.
[Raw Query]
If you want to see the raw payload, change the value of “Output Results” to json.
[Search Results]
For testing, using docker
</div>
<footer class="article-footer">
<a data-url="http://npatta01.github.io/2017/09/04/elastic/" data-id="cj7icp3ac0023qduyyy7keqhz" class="article-share-link">Share</a>
<a href="http://npatta01.github.io/2017/09/04/elastic/#disqus_thread" class="article-comment-link">Comments</a>
</footer>