Overview

The PMC data set contains scientific papers from PubMed Central ® (PMC). We use it to evaluate the performance of Elasticsearch for full-text content. We run the following variations (which we call "challenges" in Rally):

  • Append: Indexes the whole document corpus using Elasticsearch default settings. We only adjust the number of replicas as we benchmark a single node cluster and Rally will only start the benchmark if the cluster turns green. Document ids are unique so all index operations are append only. After that a couple of queries are run in parallel by multiple clients.
  • Append Fast: Indexes the whole document corpus using a setup that will lead to a larger indexing throughput than the default settings. Document ids are unique so all index operations are append only.
  • Id Conflicts: Indexes the whole document corpus using a setup that will lead to a larger indexing throughput than the default settings. Rally will produce duplicate ids in 25% of all documents (not configurable) so we can simulate a scenario with appends most of the time and some updates in between.

The benchmarks are run either for an out of the box configuration of Elasticsearch ("default settings") or with a larger heap of 4GB ("4g heap"). For more details please refer to the PMC track specification and have a look at our benchmarking methodology).