Elasticsearch provides powerful search capabilities with support for sharding and replication of the data. So we would want to index data available in our DB into Elasticsearch.
There are multiple ways to index data into Elasticsearch:
- Use Logstash to setup source as DB and sink as Elasticsearch and use a filter if required to build JSON object.
- Use an external library elasticsearch-jdbc which runs, in its own process, external to Elasticsearch instance. It makes use of the transport client and its bulk APIs to index data into Elasticsearch.
In this article, we will look at the approach 2 i.e using an external library running as a separate process.
We will use MySQL as the DB and use the sample database world, which comes with MySQL, with the following tables:
Elasticsearch-jdbc library supports up to Elasticsearch 2.3.4 with the compatibility matrix available on their GitHub page. First, let us index the data without pre-defining the index structure in Elasticsearch (ES). If we don’t provide the index structure, ES infers and builds one based on the data indexed. But this will not be an ideal structure always.
Create an Empty Index
Make sure ES is running. If you are on windows, then navigate to the bin directory of ES and run elasticsearch.bat. If you are on Linux, then follow the instructions here.
To create an empty index you have to issue a HTTP POST to <es_url>:<port>/<index_name>. For example, using cURL:
curl -X POST localhost:9200/world
You can even use REST clients of your choice like Postman to invoke the ES REST APIs
Setting up the JDBC Importer
Download the elasticsearch-jdbc binary from here and extract it into a folder, let’s call it ES_IMPORTER. There are already scripts available in ES_IMPORTER/bin for working with MySQL and other DBs and we will place all the scripts related to the importing in the same folder.
Note: If you are using Oracle DB, then you need to place the JDBC driver in the ES_IMPORTER/lib folder. For other DBs like MySQL, PostgreSql JDBC drivers are already available.
Another Note: elasticsearch-jdbc requires JDK 8
We will create the following files (code available at the links) to run the importer:
- world-importer.bat/world-importer.sh – used for launching the importer
- world-importer-config.json – used for configuring the importer
- world.sql – contains the SQL query to be executed to get the data for indexing.
The JDBC connection string, the DB username and the password can be updated in the world-importer-config.json. Also, the ES port, the ES host name, the ES index name, the ES index type name can be updated in the same file.
Running the JDBC Importer
Before running the importer, open the URL: http://localhost:9200/world
The above shows that the index world has no structure defined
Open the URL: http://localhost:9200/world/_search
The above shows that there is no data in the index world.
Navigate to the directory: ES_IMPORTER/bin and run world-importer.bat OR world-importer.sh.
Once the importer completes, which should be almost immediately, open the URL: http://localhost:9200/world/. You will now see that the index has a structure, which is also called mappings, defined within the mappings key. And when you open the URL: http://localhost:9200/world/_search you will find the indexed data which would be around 239 entries as reported by the hits field.
Advantages of JDBC Importer
- Easy to import data, a lot of configuration options available. And can be scheduled to run repeatedly.
- Supports creating nested objects and arrays of nested objects (In our example capital is a nested object and cities and languages are arrays of nested objects)
- SQL queries can be parameterized
- Only the data changed after the last run can be re-indexed.
Shortcomings of JDBC Importer
- No support for ES version 5 and above
- There is a possibility of duplicate objects in the array of nested objects. But de-duplication can be handled at the application layer.
- There can be a possibility of delay in support for latest ES versions.
In the next part of this article we will:
- Create the mappings for the index
- Understand the SQL used for indexing