Skroutz.gr is a Greek price comparison website that enables its users to filter through a variety of more than 7.5 million products in almost 1.5 thousand shops. It has been running for almost 10 years and in the process has spinned off a sister site in Turkey, Alve.com and will expand in the UK in the near future with Scrooge.co.uk.
Let’s analyse user behaviour
Being one of the top-3 sites in Greece, and with a monthly traffic of more than 12 million visits from more than 4 million unique visitors, it is only natural that we would like to learn something about the way our visitors use our platform. Which product categories or product category families (e.g. “Technology”) are the most popular? How does time of the season affect their popularity? Which are our users’ most favourite products? Which are the top products of any category right now or a year ago? The most popular shops? Can we drill-down these results to find the most popular product of a particular shop last summer?
From another point of view, maybe we would like to gain some insights about our search engine. For example: which are the most popular searches in our platform and which are the most popular searches in any specific category? Or we could also try to identify problematic searches by examining those that return zero results: which zero-result query is used most often by our users? Is our search engine performing well? Do our users find what they look for in the first result page, or do they paginate endlessly? How much do they use our search filters?
We could obviously go on, but at this point it should be painfully obvious that insights of this kind are of great value and can play a crucial role in the development of our product, both in the short and in the long term.
The old approach
Certainly, all of the above didn’t come to us as some kind of epiphany. Lots and lots of work had been done in previous attempts to make sense of the mountains of logs that we keep amassing. With the danger of oversimplifying, we can describe the previous approach as a set of task-specific Ruby scripts that would run in regular intervals, process the most recent logs and persist the results inside time-stamped MongoDB documents. The results of these computations were presented to users in custom dashboards using libraries such as highcharts.
Much thought has also been put in creating the logs that we use as input. These are text files with lines in JSON format that record the subset of the website activity that is of interest to us. Each line corresponds to a specific user action and contains information such as the type of that action (e.g. “product view”, “search”, or “browsing”), a timestamp, the exact uri and also action-specific fields such as category or product id, search query etc.
Unfortunately, our approach had its limitations. Creating a new dashboard required much effort because many layers of the system would have to be touched: usually the MongoDB collections required a new schema so that query performance would be acceptable, which in turn meant altering existing or creating new scripts to convert the input to the required state. This process translated into time spent on designing, implementing and testing those scripts and also beefing up our MongoDB servers so they could handle the increased workload.
Years of repeating the above process whenever a need for a new dashboard arose had left us with a system that was unmaintainable and of debatable value. Sure, it did the number crunching correctly, but the actual presentation of the data was sub par, with the information hidden inside huge tables and dense plots. Any requested change could only be performed by the very few developers that were familiar with its inner workings, and even they found the task daunting.
All in all, we used to spend more time developing the platform than analysing the data it presented. Things like on-demand dashboards, interactive filtering and ad-hoc queries with acceptable response time looked like a far-fetched dream. The old way had served us well until then, but it was obvious that we had to approach our internal analytics platform differently.
ELK stack to the rescue
We were already happy Elasticsearch users here at Skroutz. We depend on it for our search engine functionality and have invested time into creating plugins for NLP tasks like stemmers for the Turkish language, but we save that for a future article.
We have followed the evolution of Elasticsearch from a search-specific platform to one whose power can also be leveraged for analytics. The success stories of companies using Elasticsearch for analytics (such as the Guardian or Thomson Reuters) also played a part in making us want to investigate and experiment with it.
What we found was that the aggregations functionality added in version 1.0.0 was a huge boost over facets, and support for scripted aggregations in version 1.4 also meant that we could use more complex calculations inside our queries. However, there are two features of Elasticsearch that we consider the most important: the fact that every field is indexed which means we can perform ad-hoc queries without performance loss, and the improvements on doc value performance in version 1.4 which means that our computations will not be heavily memory-dependent.
Another attractive aspect was that Elasticsearch can work seamlessly with two other tools, Logstash (for log monitoring and parsing) and Kibana (for data visualization) - all three comprising what is known as the ELK stack. This offers a complete solution that can take our log files, crunch them, and enable us to create beautiful, interactive and real-time dashboards with minimal fuss. The fact that we could just throw at it a couple of our log files and quickly experiment with it and judge whether it suits our needs also played a huge part in us making up our minds.
We decided to spend a few days experimenting with it. Setting up a simple ELK installation on a virtual machine was a breeze using the provided Debian packages. We then took a few days’ worth of logs, pointed Logstash to their location and let it import their contents into Elasticsearch. After that it was only a matter of waiting for the import to finish and opening Kibana inside our browsers to validate that our intuition about ELK was correct: we could visualize, aggregate and analyze our logs on demand, going into as much detail as we wanted.
After deciding on using the ELK stack for our internal analytics platform we spent a couple of weeks writing down in detail all of its requirements, creating user stories for the dashboards and planning our developing strategy. During that process we realised we would need to log some new fields so we altered our application’s code accordingly.
For our infrastructure we decided that at first we would use two nodes each having 8 cores, 32GB of RAM one 50GB SSD and two 600GB spinning disks, to handle our daily stream of about four million events. We would experiment with importing only new logs and after being satisfied with the cluster’s stability we would import a year’s worth of backlogs. We would also set up Curator to move older indexes into spinning disks so newer ones would stay on the SSDs.
We already had UDP multicast emission of our logs set up in our network, so we wrote a simple program that would join that multicast group and write to a file that would be monitored by Logstash. This approach worked well, apart for a few hiccups that were related to UDP and caused a 0.4% data loss. Those were mitigated and will be the subject of a future article.
We also needed to create custom mappings for our Elasticsearch documents. We took care to specifically set all the fields’ type correctly, their format to “doc_values”, and to set all string fields’ (apart from that holding the search queries) index to “not_analyzed”.
The next step was configuring Logstash. The most important part was using a filter to point to the timestamp field in our logs so it could be parsed and used by Elasticsearch. Besides that, we created some regular expressions to extract information such as pagination, ordering and filtering from the uri. Those were either added as tags or as new fields in the resulting document, depending on the use case.
We also created a pair of tools that we dubbed “scratch mapper” and “scratch reducer”. Scratch mapper serves the purpose of injecting the extra fields we needed into old log files that were created before our changes were made. The purpose of scratch reducer is a bit more complex. Logstash by default inserts events into daily indexes, determining the correct index from their timestamp and always assuming UTC. We also store our log files into daily chunks, however these correspond to Greek time (UTC +2 or +3). Scratch reducer abstracts away the case where we want to reindex a full Elasticsearch day index, getting a date as input, processing our log files and fetching the relevant lines.
Finally, we created a series of Kibana dashboards, extracted them into JSON files and put them under version control.
Reception & future work
We consider our project both a big success and also a great start. The simple and intuitive Kibana interface has democratized access to data, making it possible for non-technical people to easily navigate their way through a Kibana dashboard and even create custom panels that suit their specific needs. Everything is indexed and no code needs to be written when a need for different statistics arises. We now have a very fast feedback loop in which the impact of a change in the website can be monitored in real-time, opening many possibilities.
Of course this is only the beginning. We intend to continue working on this at least for the next few months:
- We intend to migrate to Kibana 4 as soon as it is stable. Aggregation support should be a huge benefit, letting us create panels showing bounce rate, conversion rates, etc.
- We intend to create custom panels for functionality not currently provided by Kibana out of the box. This will probably take advantage of the modularity of the Kibana 4 codebase.
- Finally, we intend to provide an API for other services in our stack to consume, thus providing a single source of internal analytics data. This will probably be written in Clojure and interface directly with Elasticsearch.