Sunday, March 30, 2014

Intro to Elasticsearch and New Features in version 1.0.x (Nested Documents)

These are the notes from my presentation: http://www.meetup.com/Elasticsearch-Philadelphia/events/164506092/

http://youtu.be/6Idi13rQx1o?t=57m5s

A big feature of Elasticsearch is the ability to roll up multiple Lucene documents into a single document while still being able to search the sub document and the rolled up document.

First we set-up our data model.  To do this we added the nested type to the tables we wanted to be able to search on as a sub document.  

$ curl -XPUT 'http://localhost:9200/park_maintenance' -d '
{
    "settings": {
        "number_of_shards": 5,
        "analysis": {
            "analyzer": {
                "key": {
                    "tokenizer": "keyword",
                    "filter": [
                        "lowercase"
                    ]
                }
            }
        }
    },
    "mappings": {
        "park": {
            "properties": {
                "park": {
                    "properties": {
                        "name": {
                            "type": "string"
                        },
                        "cost": {
                            "type": "long"
                        },
                        "tree": {
                            "type": "nested",
                            "properties": {
                                "height": {
                                    "type": "long"
                                },
                                "last_trimming": {
                                    "type": "date"
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
'


{"acknowledged":true}

Now we set-up the data in the nested index.
$ curl -X PUT 'http://localhost:9200/park_maintenance/_alias/pm'

{"acknowledged":true}


$ curl -s -XPOST localhost:9200/_bulk -d '
{ "index" : { "_index" : "park_maintenance", "_type" : "park", "_id":"central"} }
{"park":{"name":"central","cost":"5000","tree":[{"height":100,"last_trimming":1375717333},{"height":70,"last_trimming":1385717333},{"height":10,"last_trimming":1395717333}]}}
{ "index" : { "_index" : "park_maintenance", "_type" : "park", "_id":"yellowstone"} }
{"park":{"name":"yellowstone","cost":"50000","tree":[{"height":99,"last_trimming":1395717333},{"height":70,"last_trimming":1385717333},{"height":10,"last_trimming":1375717333}]}}
'



{"took":12,"errors":false,"items":[{"index":{"_index":"park_maintenance","_type":"park","_id":"central","_version":1,"status":201}},{"index":{"_index":"park_maintenance","_type":"park","_id":"yellowstone","_version":1,"status":201}}]}


To be able to compare this to the default we set-up a second index without any mappings.
$ curl -s -XPOST localhost:9200/_bulk -d '
{ "index" : { "_index" : "park_maintenance_plain", "_type" : "park", "_id":"central"} }
{"park":{"name":"central","cost":"5000","tree":[{"height":100,"last_trimming":1375717333},{"height":70,"last_trimming":1385717333},{"height":10,"last_trimming":1395717333}]}}
{ "index" : { "_index" : "park_maintenance_plain", "_type" : "park", "_id":"yellowstone"} }
{"park":{"name":"yellowstone","cost":"50000","tree":[{"height":99,"last_trimming":1395717333},{"height":70,"last_trimming":1385717333},{"height":10,"last_trimming":1375717333}]}}
'


{"took":26,"errors":false,"items":[{"index":{"_index":"park_maintenance_plain","_type":"park","_id":"central","_version":2,"status":200}},{"index":{"_index":"park_maintenance_plain","_type":"park","_id":"yellowstone","_version":2,"status":200}}]}

$ curl -X PUT 'http://localhost:9200/park_maintenance_plain/_alias/pmp'


{"acknowledged":true}

First lets query for tree taller than 50 units and trimmed longer than 1384717333 ago. 

$ curl -X POST 'http://localhost:9200/park_maintenance_plain/park/_search?pretty=true' -d '{
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "height": {
              "gte": 50
            }
          }
        },
        {
          "range": {
            "last_trimming": {
              "lt": 1384717333
            }
          }
        }
      ]
    }
  }
}'



{
  "took" : 66,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.4142135,
    "hits" : [ {
      "_index" : "park_maintenance_plain",
      "_type" : "park",
      "_id" : "central",
      "_score" : 1.4142135, "_source" : {"park":{"name":"central","cost":"5000","tree":[{"height":100,"last_trimming":1375717333},{"height":70,"last_trimming":1385717333},{"height":10,"last_trimming":1395717333}]}}
    }, {
      "_index" : "park_maintenance_plain",
      "_type" : "park",
      "_id" : "yellowstone",
      "_score" : 1.4142135, "_source" : {"park":{"name":"yellowstone","cost":"50000","tree":[{"height":99,"last_trimming":1395717333},{"height":70,"last_trimming":1385717333},{"height":10,"last_trimming":1375717333}]}}
    } ]
  }
}

We get too many parks back in this example.  This is due to all trees being grouped together.  There exists a tree taller than 50 units and there exists a tree that has not been trimmed since 1384717333.  But, they are not the same tree. 

To solve this problem we query our nested index.  This has the tree table marked as nested.  Which means that each tree will be queried on its own. 
$ curl -X POST 'http://localhost:9200/park_maintenance/park/_search?pretty=true' -d '{
"query": {
    "nested": {
      "path": "park.tree",
      "score_mode": "total",
      "query": {
        "bool": {
          "must": [
            {
              "range": {
                "height": {
                  "gte": 50
                }
              }
            },
            {
              "range": {
                "last_trimming": {
                  "lt": 1384717333
                }
              }
            }
          ]
        }
      }
    }
  }
}'


{
  "took" : 69,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.4142135,
    "hits" : [ {
      "_index" : "park_maintenance",
      "_type" : "park",
      "_id" : "central",
      "_score" : 1.4142135, "_source" : {"park":{"name":"central","cost":"5000","tree":[{"height":100,"last_trimming":1375717333},{"height":70,"last_trimming":1385717333},{"height":10,"last_trimming":1395717333}]}}
    } ]
  }
}

This time we only get the one park we should get.

This does not mean you cannot shoot yourself in the foot with nested. The following is a way to write a bad query using nested documents. The key is to remember to never repeat the same path more than once in your query.  Here there are two paths that have "park.tree".
$ curl -X POST 'http://localhost:9200/park_maintenance/park/_search?pretty=true' -d '{
  "query": {
    "bool": {
      "must": [
        {
          "nested": {
            "path": "park.tree",
            "score_mode": "total",
            "query": {
              "range": {
                "height": {
                  "gte": 50
                }
              }
            }
          }
        },
        {
          "nested": {
            "path": "park.tree",
            "score_mode": "total",
            "query": {
              "range": {
                "last_trimming": {
                  "lt": 1384717333
                }
              }
            }
          }
        }
      ]
    }
  }
}'


{
  "took" : 8,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 2.1213202,
    "hits" : [ {
      "_index" : "park_maintenance",
      "_type" : "park",
      "_id" : "central",
      "_score" : 2.1213202, "_source" : {"park":{"name":"central","cost":"5000","tree":[{"height":100,"last_trimming":1375717333},{"height":70,"last_trimming":1385717333},{"height":10,"last_trimming":1395717333}]}}
    }, {
      "_index" : "park_maintenance",
      "_type" : "park",
      "_id" : "yellowstone",
      "_score" : 2.1213202, "_source" : {"park":{"name":"yellowstone","cost":"50000","tree":[{"height":99,"last_trimming":1395717333},{"height":70,"last_trimming":1385717333},{"height":10,"last_trimming":1375717333}]}}
    } ]
  }
}

As you can see this did gave the same result as without the nested. 

No comments:

Post a Comment