Overview
The PMC data set contains scientific papers from PubMed Central ® (PMC).
We use it to evaluate the performance of Elasticsearch for full-text content. We run the following variations (which we call
"challenges" in Rally):
- Append: Indexes the whole document corpus using Elasticsearch default settings. We only adjust the
number of replicas as we benchmark a single node cluster and Rally will only start the benchmark if the cluster turns
green. Document ids are unique so all index operations are append only. After that a couple of queries are run in
parallel by multiple clients.
- Append Fast: Indexes the whole document corpus using a setup that will lead to a larger indexing
throughput than the default settings. Document ids are unique so all index operations are append only.
- Id Conflicts: Indexes the whole document corpus using a setup that will lead to a larger indexing
throughput than the default settings. Rally will produce duplicate ids in 25% of all documents (not configurable) so we
can simulate a scenario with appends most of the time and some updates in between.
The benchmarks are run either for an out of the box configuration of Elasticsearch ("default settings") or with a larger heap
of 4GB ("4g heap"). For more details please refer to the PMC track specification and have a
look at our benchmarking methodology).