Skroutz Engineering

Growing the documentation of our android project using Dokka

2024-02-29T05:08:13+00:00

In the last few years the android team has grown significantly and with that so did our codebase. We are at a state that the lack of documentation has become an issue but not for what you might think. Documenting a class as to how it works is not as essential as making the same class easy to discover!

Couple of our problems:

The biggest issue we have is the inability to reason as to what we support. For example, we have a concept called Section. Each section has its own type and based on that type we render it with a different layout. Being able to see, at a glance, which types we render has become nearly impossible since every relevant component might reside οn a different package or even module.
Do we have anything for [place need here]?. This question is being asked a bit too often and its answer depends either on the mnemonic of the rest of the team or on the efficiency of the IDE’s search as long as the name of the function/class etc is descriptive enough.

Our goal

It is clear that we need to have some kind of documentation that will allow us to discover easily what can help us. A documentation that, apart from listing all classes, functions etc, can have custom lists like the one with all of our sections.

So, based on that we decided that we need to:

Have a way to group code, from different files/packages, together.
Be able to add a visual hint such as an image (a picture is worth a thousand words).
Have docs that contain only the code that has comments. Everything else is just a distraction.

Dokka

We decided to use Dokka to achieve our goal. It is a tool written and maintained by Jetbrains and can be extended by a plugins system allowing each team to add the functionality it needs.

Dokka’s flow

In a very abstracted and simplified way we can describe Dokka’s flow like this:

First, you provide to it anything that can be represented by modules, classes, functions etc. This is the Input.
That input is being translated to a list of Documentables where each documentable is one of the aforementioned concepts.
The documentables are then transformed to a tree of Pages (one page per documentable) where each page is a collection of information represented by structures such as titles, texts, links etc.
Finally these pages are being rendered to a desired format such as an HTML or Markdown page. This is the Output.

Entry points for plugins

You might be wondering where do we write our plugin’s code? For that we need to see the above flow in more details:

Here, every arrow is an extension point:

By default Dokka provides a way to translate Java/Kotlin code to documentables but it also allows us to add our own translations too. The resulted documentables are being organized in modules. These are not, necessarily, the modules we have in our project, even though that is the case in an android project.
At this point Dokka provides us a list of modules and allows us to transform them however we need. We can add, remove, change all kinds of documentables including the list of provided modules.
Here is where all modules are being merged into one. Dokka expects to have a single merger and provides a default implementation for it. Anything we provide must override the default one.
Yet another transformation point, like in step 2, only that this time we have a single module with all documentables in it.
Moving from documentables to pages Dokka expects to have a single translator. Again, it provides a default implementation and anything we provide must override it.
At this point Dokka provides us with a tree of pages and the ability to add one or more transformations for that tree. We can modify the tree by adding, removing or updating a page.
The final entry point is where Dokka allows us to provide our own renderer. By default it uses one of its own implementations that renders the tree of pages into HTML pages.

Documentation node

Creating a documentation relies on two things, the code and, of course, the comments.

If a piece of code has a doc-comment, its corresponding documentable will have a documentation node which is nothing more than a list of TagWrappers. A TagWrapper is used to represent anything that KDoc supports (the description -both summary and detailed-, the author, the since tag etc) plus any custom tag that will be used to extend KDoc. This custom tag is being represented in code by CustomTagWrapper.

Skroutz Dokka Plugin

First steps

We decided to have the plugin as part of our repository.

For that we:

created a Java/Kotlin library module and made it depend on org.jetbrains.dokka:dokka-core and org.jetbrains.dokka:dokka-base.
created a class that extends DokkaPlugin and
added a file named org.jetbrains.dokka.plugability.DokkaPlugin in the module’s resource folder (src/main/resources) under the path META-INF/services. The file points to the class we created: gr.skroutz.dokka.plugin.SkzDokkaPlugin.

Now every time we run one of Dokka’s gradle tasks (ex: dokkaHtmlMultiModule) our plugin’s code is being loaded and executed for every module that is configured to create documentation.

Configuring a module:

Dokka must be added in the plugins { } section and
Our plugin must be given as a dependency dokkaPlugin(project(":skroutz-dokka-plugin"))

Have docs that contain only the code that has comments

Even though it was not the first in our list it was the place to start since we did not want the clutter of having many documentables that offer nothing, since they don’t have any comments.

By default Dokka creates a page for every documentable. We didn’t want that. If our documentation has a page it will be because there is a comment in it.

For that we chose to go with entry point #2 and wrote a new PreMergeDocumentableTransformer.

Its job is to filter the provided list of modules and keep only those that have at least one package which, on its turn, has at least one documentable with a comment.

Implementation notes:

We used SuppressedByConditionDocumentableFilterTransformer which is designed for exactly that. Suppressing a documentable or not:

private class KeepOnlyDocumentablesWithComments(
  context: DokkaContext
): SuppressedByConditionDocumentableFilterTransformer(context) {

  override fun shouldBeSuppressed(d: Documentable): Boolean {
    return d !is DPackage && !d.hasDocumentation()
  }
}

We used an extension function for checking if a documentable has comments:

internal fun Documentable.hasDocumentation(): Boolean {
  val hasDocumentation = documentation.values.flatMap { it.children }.isNotEmpty()
  if (hasDocumentation) return true
  return children.any { it.hasDocumentation() }
}

and the key part here is the recursion. This supports cases like a class that, on its own, does not have a comment but one of its properties/methods does.

Have a way to group code, from different files/packages, together.

The combination of Dokka and KDoc allows the usage of custom block-tags so we decided to leverage it for creating groups of code. Each time we want a certain class/function etc to be part of a group we tag it by using @tags name-of-group in its doc-comment:

/**
 * Renders a SKU in the list layout.
 * 
 * @tags section item, rendered sku
 */

For that we had to implement yet another PreMergeDocumentableTransformer.

Its job is to

collect, from all modules, all the documentables that their comment includes our custom block tag
group them by the tag’s name
create a package for every group (tag)
create a module that has all these new packages

internal class CreateTagsModule : PreMergeDocumentableTransformer {
  override fun invoke(modules: List<DModule>): List<DModule> {
    val allDocumentables = modules
      .flatMap { module -> module.packages }
      .flatMap { pckg -> pckg.allDocumentables() }
    val hasTags = allDocumentables
      .any { documentable -> documentable.hasTags() }

    val sourceSets = modules.first().sourceSets
    return if (hasTags) modules + createTagsModule(allDocumentables, sourceSets) else modules
  }

  private fun createTagsModule(allDocumentables: List<Documentable>, sourceSets: Set<DokkaConfiguration.DokkaSourceSet>): DModule {
    val tagPackages = allDocumentables
      .filter { documentable -> documentable.hasTags() }
      .flatMap { documentable -> documentable.allTags().map { tag -> tag to documentable } }
      .groupBy { entry -> entry.first }
      .mapValues { entry -> entry.value.map { it.second } }
      .map { entry -> createTagPackage(entry.key, entry.value, sourceSets) }

      return DModule(
          name = TAGS,
          packages = tagPackages,
          documentation = emptyMap(),
          sourceSets = emptySet()
      )
  }

Implementation notes:

Dokka does not allow a documentable to be part of more than one pages. This means that simply creating a new package with the tagged documentables would cause a failure. That is why for every new package we made copies of the necessary documentables and add those to it.

private fun Documentable.makeCopyForTag(tag: String): Documentable {
  val newDri = dri.copy(extra = tag)

  return when (this) {
      is DFunction -> copy(dri = newDri, extra = PropertyContainer.withAll(IsCopy))
      is DProperty -> copy(dri = newDri, extra = PropertyContainer.withAll(IsCopy))
      is DTypeAlias -> copy(dri = newDri, extra = PropertyContainer.withAll(IsCopy))
      is DClasslike -> makeCopy(newDri)
      is DParameter -> copy(dri = newDri, extra = PropertyContainer.withAll(IsCopy))
      else -> throw IllegalStateException("I don't know what to do with $this")
  }
}

This transformer is set to run after the one that filters out all documentables with no comments

internal val createTagsModule by extending {
  dokkaBase.preMergeDocumentableTransformer with CreateTagsModule() order { after(suppressDocumentablesWithNoDocumentation) }
}

Showing tags in the documentable’s page

One thing we wanted was to have our custom tags render in a page just like @since or @author do.

For that Dokka provides an abstraction (CustomTagContentProvider) that you can implement and provide the way you want your custom tag to be structured.

For our @tags tag we choose to go with a title and the tags underneath it:

override fun PageContentBuilder.DocumentableContentBuilder.contentForDescription(
    sourceSet: DokkaConfiguration.DokkaSourceSet,
    customTag: CustomTagWrapper
) {
    group(sourceSets = setOf(sourceSet), styles = emptySet()) {
        header(4, TAGS)
        comment(customTag.root)
    }
}

Making tags searchable

One of the PageTransformers (entry point #6) that Dokka offers out of the box is SearchbarDataInstaller. Its job is to create the file that populates the search functionality.

We decided to add a descendant of SearchbarDataInstaller and create a search record for every tag we come across. For that we made sure that when a package related page gets processed we check if it contains a tag-package documentable and if it does we create a search record for that tag:

override fun processPage(page: PageNode): List<SignatureWithId> {
    if (page.isCopy()) return emptyList()
    if (page !is PackagePageNode) return super.processPage(page)

    val tagPackage = page.documentables.firstOrNull { it is DPackage && it.extra[IsTagPackage] != null }
    if (tagPackage != null) {
        tagPackageNames.add(page.name)
        return page.dri.map { SignatureWithId(it, page) }
    }

    return super.processPage(page)
}

override fun createSearchRecord(name: String, description: String?, location: String, searchKeys: List<String>): SearchRecord {
    if (name !in tagPackageNames) return super.createSearchRecord(name, description, location, searchKeys)

    val tag = name.removePrefix(TAG_PACKAGE_PREFIX)
    return SearchRecord(
        name,
        tag,
        location,
        listOf(tag)
    )
}

Implementation notes:

In order to have our transformer executed we had to override the default one

internal val makeTagsSearchable by extending {
    dokkaBase.htmlPreprocessors providing ::MakeTagsSearchable override dokkaBase.baseSearchbarDataInstaller
}

Every documentable provides a container where you can add custom properties. We used that to characterize every copied documentable with the property IsCopy and every every tag-package with IsTagPackage during the documentables’ transformations.

internal data object IsCopy : ExtraProperty<Documentable>, ExtraProperty.Key<Documentable, IsCopy> {
  override val key: ExtraProperty.Key<Documentable, *> = IsCopy
}

internal class IsTagPackage(val tag: String) : ExtraProperty<DPackage> {
  override val key: ExtraProperty.Key<DPackage, *> get() = IsTagPackage
  internal companion object : ExtraProperty.Key<DPackage, IsTagPackage>
}

This way we where able to keep here only the pages that contained our tags.

Be able to add a visual hint such as an image

Grouping code is very helpful. There are cases though, like the one with sections, that it wasn’t enough. We wanted every group item to have a preview of how it looks so that we can easily pick and choose what fits our needs.

For supporting that we had to break it to two parts:

First we needed to add support for one more block-tag. One that will be used to provide the name of an image.
Then we had to make sure that the image is being rendered in the resulted page

The block-tag

We wanted to make it as easy as possible for the commenter:

Take a screenshot
Give it the name you want (ex: image-name.png)
Move it to a specific folder (ex: images/previews)
Add the block tag @preview image-name.png to the comment

/**
 * Renders a list of pills horizontally.
 *
 * @tags section
 * @preview section-pills.png
 */

Then, another implementation of CustomTagContentProvider will make sure that the block-tag will be structured as an image:

private fun PageContentBuilder.DocumentableContentBuilder.previewComment(customTag: CustomTagWrapper) {
    val text = (customTag.root.children.first().children.first() as Text)
    val customDocTag = CustomDocTag(
        children = listOf(
            Img(
                params = mapOf(
                    "href" to "images/previews/${text.body}",
                    "alt" to ALT_SKZ
                )
            )
        ),
        name = customTag.name
    )
    comment(customDocTag)
}

Rendering the image

The content provider sets the image’s structure but, at this stage, it does not know anything about the page that will use it. So the image’s path is not correct and the page will no be able to find it.

To fix it we wrote a PageTransformer that changes the image’s path after taking into consideration the page’s position in the tree of pages:

override fun invoke(input: RootPageNode): RootPageNode {
    val locationProvider = locationProviderFactory.getLocationProvider(input)
    
    return input.transformContentPagesTree { contentPage ->
        val hasPreviewImage = contentPage.content.allContentNodes().any { it is ContentEmbeddedResource && it.altText == ALT_SKZ }
        if (hasPreviewImage) {
            val count = locationProvider.ancestors(contentPage).count()
            return@transformContentPagesTree contentPage.modified(
                content = contentPage.content.mapTransform<ContentEmbeddedResource, ContentNode> {
                    val prefix = "../" * count
                    it.copy(address = prefix + it.address)
                }
            )
        }
    
        contentPage
    }
}

Final result

As we already said, an image is worth a thousand words, so this is how our docs are starting to look: this is the page for the tag section

Links:

Growing the documentation of our android project using Dokka was originally published by Leonidas Partsas at Skroutz Engineering on February 29, 2024.

The Importance of Having a Healthy Chapter

2022-12-06T22:00:00+00:00

At Skroutz, every product engineer belongs both to a product team and a chapter. A product team contains people from all crafts and is responsible in delivering new features to our users. A chapter on the other hand contains only engineers of a certain craft and is responsible for all technical aspects of a project.

The mobile team has two chapters. One for Android engineers and one for iOS. Both chapters have weekly meetings where we inform each other on what we are doing and discuss ways to move our codebase and project forward.

As you can probably guess, for a big project like this, a meeting once a week is not enough to keep it scalable, maintainable and up to date. This is a constant effort which requires organization and most of all good communication. This is were a healthy chapter shines. This is what our Android chapter is!

The example

In a recent PR I noticed that we keep using the following convention:

class CampaignTracking(
  val tag: String,
  val trackableActions: List<ActionType>
) : RootObject {

  fun shouldTrackClick(): Boolean = trackableActions.contains(ActionType.CLICK)
  fun shouldTrackImpression(): Boolean = trackableActions.contains(ActionType.IMPRESSION)
}

where we add a helper method for each supported enum.

This, in my opinion, violates the open close principle since every change in the ActionType will require a change in CampaignTracking too. The thing is that because I feel comfortable with the team I am in I didn’t just thought of it, I shared my thoughts in our slack channel. The main argument for having the convention was readability so I even argued that, in Kotlin, something like this

if (TrackableActionType.IMPRESSION in campaign.trackableActions) {
...
}

is readable too!

Soon after my comment a discussion started where a colleague suggested a simple and elegant solution:

class CampaignTracking(
  val tag: String,
  val trackableActions: List<ActionType>
) : RootObject {

  fun isActionTracked(type: ActionType): Boolean = trackableActions.contains(type)
}

which fixes both the initial problem and the one that I introduced by removing the methods all together.

You see, by having a method like isActionTracked we hide implementation details like the fact that we use a list for trackable actions. Exposing something like this makes the code hard to change/scale.

This example might seem trivial and the solution simple but imagine having lots of these small changes every day. The project will self heal in no time! And all that because we, as a chapter, are not afraid of suggesting things.

A healthy chapter

In a healthy chapter every member is trusted, is not afraid to ask questions, can express an opinion and, above all, listens to the other team members. In such an environment ego comes last and knowledge/information flows through the team ending up in having all decisions shaped and accepted by everyone.

The fact that we are such a team has helped in applying certain practises that allow the project, and us, to grow both in a day to day base and in a long term:

Day to day

Having a group of talented and capable engineers is not enough if they don’t communicate.

This is why we have adopted two rules in our chapter:

Don’t remain stuck for more than a couple of hours, ask! The project is big and chances are that the problem you are facing has already been solved so, ask! Someone will either point you to the proper file or will search / pair with you and help you solve it. At the end of the day, the team can be an excellent rubber duck. Try forming the question and an answer might pop up on its own!
If you feel that you want to challenge a decision, do it! As we saw from the example above both the project and the team will benefit from it.

Long term

Our project is old and the codebase big so for keeping it up to date we need to have a plan and make small steps through a long period of time. This is why we have a board where we add, discuss and monitor our long running tasks.

Tasks that aim to help the project move forward but cannot be resolved by one person or in one “sprint”. Tasks like the migration from Java to Kotlin, the migration from callbacks to coroutines, moving from the deprecated onActivityResult to something that suits our needs (spoiler: we ended up creating a tiny library for that) and many more.

The process has four steps:

Every new idea and suggestion is added in an inbox. Nothing detailed. Just a small description like “Usage of Hilt” or “Introduction of Jetpack Compose”.
If there is someone that wants to investigate the proposed task she assigns the task on her and delivers a proof of concept to the chapter.
With the POC in hand the chapter discusses if its worth moving forward or not.
If the suggestion gets accepted we fine grain the proposed change and extract a detailed action plan.

A simple process for sure but it can work effectively only when having a healthy team:

No fear for criticism ends up in having these ideas and suggestions
Feeling trusted ends up in stepping up and taking the initiative to do the investigation and propose a solution
Expressing freely our opinions ends up in having solid and structured plans

The mobile apps have changed a lot during the last couple of years and managed to close the gap with mobile web offering to our users a great and complete experience. All that, while still maintaining a proper codebase, wouldn’t be possible if the chapter didn’t have such professional engineers that love their craft.

Stay tuned, more to come!

feature image: unsplash

The Importance of Having a Healthy Chapter was originally published by Leonidas Partsas at Skroutz Engineering on December 06, 2022.

Handling inertial scroll in combination with scroll snapping

2022-05-15T22:00:00+00:00

At Skroutz, we aspire to provide the most intuitive and hassle-free user experience. As a result, we constantly iterate over interface elements, redesigning, polishing and tailoring them to users’ needs. One such iteration was the recent redesign of the image gallery on fashion pages.

The aim of the redesign was to provide a more premium user experience on fashion categories. We opted to increase the main image size, as such categories are mainly image-driven, while also making image browsing easier and faster via scroll and thumbnail interactions. Additionally, we redesigned the image preview modal on desktop to better cater to user needs.

Implementation details

Before we go any further, it’s worth explaining how the component works from a technical standpoint. Without getting into too much detail, here’s a quick overview of the scrollable gallery area implementation:

The outer container layer, .slides-container, has a 3:4 aspect ratio which locks it into a fixed size. This is done to ensure fashion images which are always cropped to this ratio are displayed correctly.
The inner container layer, .slides, fits the outer container and uses overflow-y: auto to be vertically scrollable. It also uses scroll-snap-type: y mandatory to create a snapping behavior on scroll.
Inside the inner container, there are multiple .slide elements. Each one is sized to fill the area and has a scroll-snap-align: start property to ensure that it snaps to the top of the container.

There are also various other implementation details that come into play, such as JavaScript event handling, updating component state, highlighting the current slide thumbnail and so on.

The problem

After implementing and deploying the new design, we received internal reports about the gallery not responding correctly to certain user interactions. Specifically, some users reported that touchpad scrolling would lock the page to the gallery after reaching the end of the gallery slides. This would effectively prevent users from scrolling down the rest of the page until they scrolled up again. Here’s what this looked like in action:

An in-depth investigation

Bug reports aren’t always clear or easy to reproduce. In this case, we had some difficulty tracking down the issue. We finally managed to pinpoint it to touchpads and, more specifically, to their inertial scrolling behavior. Due to the nature of this behavior, OS and browser made a huge difference in reproducing it. This only made it harder to track down and understand the inner workings of the problem. From what we know now, MacOS touchpad inertia was the main culprit.

After realizing the behavioral cause, we had to understand the technical one, too. After some investigation, it seemed like scroll-snap-type: y mandatory was to blame. There are various conflicting reports of bugs with this property on MacOS related to inertia on different browsers and OS versions. The bottom line is that the mandatory part can cause certain problems under the right circumstances.

Oddly enough, using plain scroll-snap-type: y worked correctly and didn’t cause any bugs, but the behavior wasn’t the desired one. As expected, the scroll position would only snap at certain parts of the image instead of always. At this point, we thought we could use the :hover pseudo-selector to make snapping mandatory only when the mouse was inside the gallery container. While this CSS-only approach made sense on paper, it started to cause some very unexpected issues.

Clearly, this approach didn’t work as well as we’d hoped. However, it pushed us closer to a solution. After all, using :hover was a straightforward way to detect if the user was indeed intent on scrolling the gallery or the entire page. Thus, we could disable the vertical scroll (overflow-y: hidden) when the gallery wasn’t hovered. This was far more stable, but would cause gallery slides to get stuck halfway through being scrolled if the cursor exited the gallery container.

The next step towards a solution was to add some JavaScript. A simple 300ms interval that would check for the container being hovered and snapping the slide into position should solve the problem, we thought. And it worked for the most part. However, the user experience didn’t feel great.

There was a little bit of a visual stutter involved, which we weren’t pleased with. After all, great care was put into making the gallery scroll experience feel smooth and premium. So, we had to deal with this stutter by using some sort of transition.

Unfortunately, overflow is a binary CSS property and, much like display, cannot be transitioned. The CSS engine has no clue what such a transition would look like. Fortunately, CSS animations can be leveraged for this kind of thing. By creating an animation with a from { overflow: auto; } keyframe, we can make it so that the stutter is less pronounced.

By now, the average reader wouldn’t expect this to work without a hitch. And, like clockwork, it did not. While the animation worked, it required about 600ms to feel smooth. This would lock the page scroll for a little too long and the user would feel like the page was unresponsive.

Luckily, the animation timing highlighted a potential solution. By slowing down the start of the animation and speeding it up towards the end, we could simulate an inertial snap. After some tinkering, we ended up with a cubic-bezier(.35, -.7, 1, 1) timing function.

This timing function enabled us to shorten the animation duration back to 300ms, matching the snap interval. This was the last piece in this puzzle. While the inertial snap isn’t perfect, it’s far less noticeable and the page doesn’t lock anymore when the user reaches the last slide.

Putting it all together, we had to make the following changes to the gallery component:

Set an interval that runs every 300ms from the gallery component. Whenever it’s run it checks if the .slides element is still hovered. If it isn’t, it snaps it to the correct gallery slide.
Use the :hover CSS pseudo-selector to change overflow-y behavior in the .slides element, effectively preventing scroll events from occurring in the gallery when the mouse is not over it. This prevents the scroll from getting locked when the user reaches the end of the gallery with inertial scroll.
Define a CSS animation for the overflow property that animates the transition from hovered to not hovered on the .slides element. An appropriate timing function effectively produces an inertia-like transition while the JavaScript-based slide snapping happens.

Here’s a CodePen with the final gallery implementation. Note that internal implementation details have been omitted, as they’re unrelated to this example.

Impact on user experience

After fixing the bug, we took a look at the numbers to see the potential impact on user experience. On surface, this was a localized issue, that would only affect certain users on very specific conditions. As it turns out, that wasn’t exactly the case. While only a small fraction of sessions (roughly 2%), about 500.000 monthly Skroutz users are on the appropriate OS and browser combination to experience this bug. This means that, even though the percentage is small, the absolute number of users that could end up on an almost unusable page was still pretty high. This goes to show that even small, localized bugs, can spiral into a lot of user frustration, if left unaddressed.

References

Handling inertial scroll in combination with scroll snapping was originally published by Angelos Chalaris at Skroutz Engineering on May 15, 2022.

Core Web Vitals Real-time Monitoring at Skroutz.gr

2022-02-27T21:00:00+00:00

At Skroutz, we believe that for a modern web experience, it’s important to get fast and stay fast.

For this, speed has always been a critical component for our Engineering and SEO Teams and we were monitoring speed KPIs early on.

Image 1: SpeedIndex graphs for Skroutz.gr back in 2015.

As Skroutz.gr shifted from a price comparison site to a fully operational Marketplace, we made some serious changes to our core product. At the same time, our Engineering teams grew rapidly and on top of this, we architecturally moved our front-end stack toward heavier Javascript rendering, from a static to a more reactive fashion.

Occasionally, rendering performance was getting worse and until recently we were running ad-hoc sprints in order to improve Skroutz.gr ’s speed (read a post for such a sprint here).

Although we achieved better performance after each sprint -and hopefully a better user experience for our visitors-, we knew that this was not an ideal, sustainable process.

To solve this, we established an additional continuous monitoring and alerting system of Core Web Vitals using field data (real users) with a new set of tools and methodologies that we apply, in order to have these new metrics under our daily radars.

This continuous monitoring helps to not only be proactive from an SEO perspective, but also allows engineering teams to be in touch with rendering and speed issues and to organically establish a “fast speed mentality”.

In this article, we describe what we did, some real life cases we’ve dealt with, and some takeaways from our experience during the symbiosis with the Core Web Vitals real-time monitoring.

Core Web Vitals Continuous Real-Time Monitoring

Lab data is not enough

While lab tools are invaluable, the data they provide isn’t always predictive of how a website performs for real users.

For example, Lighthouse runs tests with simulated throttling in a simulated desktop or mobile environment. While such simulations of slower network and device conditions often help surface user experience problems better than native network and device conditions, they’re just a single slice of the large variety in network conditions and device capabilities across a website’s entire user base [web.dev/vitals-tools].

On the other hand, there is the Chrome User Experience Report (CrUX), a BigQuery dataset of field data gathered from a segment of real Google Chrome users, which presents Core Web Vitals with sufficient traffic, but only at the origin level. CrUX is still useful since one could compare it with field or lab data to see how they align.

Search Console’s Core Web Vitals section assesses groups of similar pages (for example, our Product pages) and also includes a Core Web Vitals report based on field data from CrUX, offering novel insights into how performance improvements impact the entire sections of the site and different page templates.

All these tools are extremely useful, but they alert us about any issues long after they have occurred, one would say a bit too late, as organic performance is already affected at scale.

How we measure Core Web Vitals

Since the Core Web Vitals metrics represent the user’s experience when interacting with a web page and they were confirmed ranking factors in Google Search as of May 2021 (along with mobile-friendliness, HTTPS-security, and intrusive interstitial guidelines), the importance of incorporating Web Vitals into our site hygiene monitoring practice was larger that ever.

We decided to collect field data from Skroutz.gr ‘s thousands of daily visitors in real time, process them and add some alerting heuristics. We used the web-vitals library, a tiny (~1K), modular library for measuring all the Web Vitals metrics on real users, in a way that accurately matches how they’re measured by Chrome and reported to other Google tools (e.g. Chrome User Experience Report, Page Speed Insights, Search Console’s Speed Report).

Mid July 2021, live monitoring for Core Web Vitals was launched. Using this library, we essentially render the web-vitals JavaScript bundles and invoke the functions for the 3 Core Web Vitals on Skroutz.gr.

We send a portion of the traffic (1% of random anonymized sessions, that is more than 100k pageviews & data points daily) at Grafana, an open-source visualisation and analytics software providing tools to turn time-series data into graphs and visualisations.

We have created dedicated dashboards for our most important site sections and we furthermore distinguish them into mobile and desktop traffic. More specifically, we are monitoring and visualising the scores of the 3 Core Web Vital Metrics (LCP, CLS, FID) per page type (Product Listing Pages (PLPs) and Product Detail Pages (PDPs)) and device type (mobile, desktop).

Image 2: Core Web Vitals (LCP) Real-Time Continuous Monitoring dashboard for Skroutz.gr.

How we get alerted for Core Web Vitals issues

When each Core Web Vital metric drops below the “Good Performance” range, an alert is fired within a dedicated channel on Slack, our main communication tool. This way we are informed instantly when one of the Web Vital metrics drops at the “Medium Performance - Needs Improvement” state, while we’re also made aware of the exact section of the site that was affected.

Image 3: Web Vitals alert notifications in Growth Team’s slack channel.

Thus, we get alerted as soon as an issue appears, oftentimes before even Google was able to spot the affected area. Then we take immediate actions to remedy the situation.

Image 4: CLS of Product Pages on Desktop exceeded the 0.10 threshold and an alert was fired.

We monitor 2 time series for each Web Vital metric, one for the current time and one for 1-week earlier, in order to make it easier for us to compare them and make up our mind as to whether the performance has significantly declined or not.

Image 5: Core Web Vitals (LCP) Real-Time Continuous Monitoring dashboard at Skroutz.gr.

There is also a toggle option to see all the deployments. The exact time of each deployment as well as other details linking to the github page are easily accessible. This can prove very useful when an alert pops up, as it can direct the team straight to the source of the issue.

Image 6: Deployments annotation in the Core Web Vitals dashboard.

With the help of all these advanced monitoring systems and procedures, we keep Skroutz.gr fast and steady, we find and fix any rendering issues promptly, and we optimise user experience, which in turn leads to increased user engagement, more conversions, and -hopefully- higher user satisfaction.

Ιncorporating Core Web Vitals monitoring has led Skroutz.gr to an impressive 98,5% of ~26 million pages seen as providing a “good page experience”!

Image 7: Page Experience Score of Skroutz.gr at Search Console.

Image 8: Core Web Vitals of Skroutz.gr at Search Console.

Examples Of How Core Web Vitals Helped Us

Let us show you 3 examples of how Core Web Vitals real-time monitoring has helped us resolve issues that we might not have detected otherwise.

1. Server-side rendering gone wrong

The first example is from September 2021, where we saw an abnormal increase in pages’ rendering stability score, CLS (Content Layout Shift), almost 2x, and specifically on Product pages (PLPs) on both mobile and desktop.

This was very strange, because no matter the styling changes in mobile and desktop views, it is not really possible for different code (applied CSS styles in DOM) to cause such (relatively) huge layout shifts simultaneously.

Up until then, we had seen cases where a major page change impacted more in terms of layout shift in either desktop or mobile view (usually at the desktop where there is larger viewport to composite layout).

Image 9: CLS for Product Pages almost doubled in September 2021.

We deep-dived, but we couldn’t find any season for layout shifts caused by CSS changes - everything seemed okay.

However, a more careful examination showed that we had introduced a critical bug in the rendering process: we normally send a fully rendered page to the client from the server (server-side rendering) at the initial load; then the client’s JavaScript bundle takes over and manipulates DOM depending on the users’ interactions. This approach has been chosen as more SEO-friendly. What we saw, in this case, was that during a major refurb of the Product page, we accidentally disabled the server-side rendering and the page was rendering in the browser.

Since our pages are often heavy and rich in content, browsers struggled to composite and paint, resulting in more layout shifts compared with the server-side rendering.

If we didn’t manage to catch this error early, we would probably have been impacted severely in terms of SEO and organic performance. Product prices, reviews, info, etc. are changing very frequently and, especially in the ecommerce industry, content freshness is very important.

2. New fashion categories layout shifts

The second incident began in December 2021, when a number of alerts started popping up, regarding the CLS score of our Product Listing Pages (namely Categories) in desktop views. These alerts informed us about an increase of the CLS score up to 0.37, when a score of more than 0.25 is seen as poor performance.

Image 10: CLS on Product Listing Pages exceeded alert thresholds.

After examining the deployments that happened in the exact period, one stood out the most. All image driven PLPs (mainly Fashion, see example here) were switched to a new layout, going from the usual 4 tile layout to a wider 3 tile layout. Our new layout didn’t render in a solid and stable way, so users were seeing things pushed down and down while loading.

Image 11: New Fashion layout at Skroutz.gr.

Images in this layout have a fixed ratio, which is very helpful since we only set their width to fill its container and their height is auto-calculated. We already knew that we had an unknown variable, the image height. However, the width of the images was also unknown since it depends on the viewport, the grid, the grid gaps and the resulting columns. This meant we had practically no control over the width or the height of our images.

Setting a height or width on our images was in this case impossible, since we could not calculate either correctly. Using aspect-ratio was also not a safe resort back then, since it was a fairly new property.

So, we used an old CSS trick for creating responsive squares (initially), but the logic can be applied to rectangles as well. The % vertical padding of an element is always relative to its width and not its height, as one might expect. To avoid CLS issues and use fixed ratio images, we have a fixed ratio empty area, based on the available width that the image can then fill when it gets loaded without shifting the content of the whole page. Finally we had to absolute position the images and the gallery to go to the correct place.

We had a stabler layout.

3. CSS Grid module issues

The third example is again about CLS issues, yet again for Product listing pages in desktop view.

Product listing pages had marginally a good performance score (<0.1) for a long time, however this was okay for us.

Unfortunately, on January 10, a huge layout shift triggered alerts in our slack channel. Something really bad had happened. The increase was observed only in desktop views, while at the same time mobile view had a small decrease.

Image 12: CLS for Listing Page Desktop almost tripled in January 2022.

When something like this happens, we usually search in the latest deployments, where it’s more likely to find the bug. However, in the specific example, we didn’t find anything that had changed on the Listing pages, front-end wise. Moreover, the increase started after working hours in a strange and unusual way.

Image 13: CLS for Listing Pages Desktop didn’t seem to correlate with a deployment.

When we investigated carefully we saw that this was a multi-factor event. One, Listing pages have not been optimal in terms of stability for a long time. Two, a Chrome update (97.0.4692) kicked in at that time, the new Chrome could evaluate something not optimal in a more rigorous manner.

Normally, the Products Listing page has a left sidebar with the filters and a right -main- section with all the products.

Image 14: Normal Listing Page rendering on Desktop.

After we ran some tests we figured out that the layout shifts were caused by the main section of the page. What was happening?

Image 15: The main section contributed mostly to the problem.

Playing with network throttling and CPU slowdown, we caught the bug: the order of the elements (main, sidebar) in the page source for desktop were reversed on a markup level, so we were using CSS Grid modules to reorder them. Until now we specified only the order of the Sidebar (which comes after the main content in the DOM) and the Main section position was left unspecified. Since the sidebar in some specific cases was delayed, the main content would take its place from the grid-template.

Naturally, this caused a minor yet noticeable issue for the user and subsequently the Page Experience and CLS score.

Image 16: A middle state of Listing Page rendering on Desktop: content is pushed to the left due to lack of content in the sidebar.

The fix to this issue proved to be a very quick tweak in our CSS. The main change was basically specifying explicitly the grid column where the main section should stand.

After the fix, product listing pages improved and they are now much more stable than before.

Image 17: It is pretty amazing how 2 lines of CSS can make or break a page. Pay attention to your CSS grid module and make sure you specify all elements’ position to avoid any unexpected layout shifts.

We have also spotted changes of the other Core Web Vital metrics, Largest Contentful Page (LCP) and First Interaction Delay (FID), however the most sensitive to changes metric until now has proved to be Content Layout Shift (CLS).

Conclusion

Having the ability to measure and report on real-world rendering performance is critical for diagnosing issues promptly and improving performance over time. Without field data, it’s impossible to know whether certain changes are actually pushing towards the desired results.

Core Web Vitals helped Skroutz.gr provide a faster, stabler, and more responsive experience. Web Vitals real-time monitoring proved to be essential to delivering a great user experience, in terms of loading time, interactivity, and visual stability.

Image 12: Core Web Vitals Phone State for Skroutz.gr - January 2022.

Core Web Vitals represent the best available signals we have today to measure the quality of experience across the web. However, these signals and the available free tools are far from perfect and we expect future improvements or additions. This fact creates a crucial need for an engineering team that caters for all aspects of performance, while a good relationship between SEO and engineering is invaluable for a successful site.

Speed, stability and responsiveness are foundational parts of a good user experience. Since we are committed to offering better user experiences, striving for great site performance is a never-ending journey.

SEO Team.

💡 Feel free to connect and follow our fresh Skroutz SEO Team Twitter account for more SEO insights and news, or follow Skroutz Engineering at Twitter.

Hero image source: Unsplash.

Core Web Vitals Real-time Monitoring at Skroutz.gr was originally published by Skroutz Engineering Team at Skroutz Engineering on February 27, 2022.

Skroutz contributes to Hotwire's upstream

2021-11-01T22:00:00+00:00

As we mentioned in a previous post, we have started to investigate Hotwire and its techniques, which claim to bring the speed of a single-page web application without writing any JavaScript. It seems that Hotwire, especially Turbo, keeps its promise by providing useful tools which make your application more dynamic, without having to write almost no custom JavaScript.

From our experience with Turbo so far, we found Turbo-Frames to be very handy and can be easily used out of the box. But, as Hotwire is a relatively new tool, we often come across to situations where something seems to be missing, or it doesn’t work as it is supposed to. Skroutz’s engineers always look for opportunities to contribute to open source projects, and this seemed to be a perfect opportunity, so we proceeded by opening some pull requests to the Hotwire’s repo.

Now, let’s take a look at the pull requests that have already been merged and see what problem, each one of them, tries to solve.

Including url in `turbo:before-fetch-request` event

Pull request #289 by Christos Trochalakis

Turbo fires the turbo:before-fetch-request before it issues a network request. Let’s say that we have multiple Turbo-Frame elements in the page and each one of them, uses a different endpoint to update its contents. Let’s also say that we have the following event listener attached to the document:

document.addEventListener('turbo:before-fetch-request', handleBeforeFetchRequest);

Before #289 got merged, we didn’t have a way to distinguish each one of those events. We just knew that some Turbo element issued a network request. By making available the url to which the network request gets issued, from the respective event, we can add any custom logic that handles the different urls.

For example, we can do this:

const handleBeforeFetchRequest = ({ detail: { url } }) => {
  switch (url) {
    // handle different urls
  }
}

Adding the target element to `turbo:before-fetch-(request|response)`events

Pull request #367 by John Kapantzakis

Docs update regarding #367

Similarly to turbo:before-fetch-request, turbo:before-fetch-response fires after the network request completes. Those events used to get fired on the document and we couldn’t identify the element that caused the network request / response, from an event listener attached to the document.

This PR adds the target element to the turbo:before-fetch-request and turbo:before-fetch-response events, so that we can listen for those events coming from specific elements, like this:

myTurboFrame.addEventListener('turbo:before-fetch-request', handleFetchRequest);

Introducing `turbo:frame-render` and `turbo:frame-load` events

Pull request #327 by John Kapantzakis

turbo:frame-load cherry-picked from #59

Docs update regarding #327

Lifecycle events were missing from Turbo-Frames until turbo:frame-render and turbo:frame-load were introduced, and gave us the opportunity to hook various handlers on those events.

These get fired as soon as the Turbo-Frame element has rendered its contents and when it has finished loading, respectively. Furthermore, these events get fired on the respective Turbo-Frame element, rather than on the document, making it easier to target specific elements.

myTurboFrame.addEventListener('turbo:frame-render', handleMyTurboFrameRender);

const handleMyTurboFrameRender = ({ target }) => {
  target.querySelectorAll('.elements-inside-frame').forEach((elem) => { ... })
}

Introducing test runner options

Pull request #311 by Christos Trochalakis

This PR doesn’t affect the tools that Turbo provides, directly, but it makes the development of Turbo’s features a lot easier, by adding some options to the testing process. Specifically, it adds the --grep and --environment options.

You can use the --grep option when you want to target a specific test case.

$ yarn test --grep 'triggers before-render and render events'

You can use the --environment option when you want to set the environment on which you want to perform the tests.

$ yarn test --environment 'Firefox'

Avoiding race condition between visit tests

Pull request #310 by Christos Trochalakis

This is another PR that improves the development experience of Turbo features by fixing a race condition that was happening when the page location was changed asynchronously and an event logs array was getting out of sync. You can inspect the PR for more details regarding the relevant changes.

Summary

Summing it up, here’s a list of the commits sent upstream so far:

Skroutz contributes to Hotwire's upstream was originally published by John Kapantzakis at Skroutz Engineering on November 01, 2021.

Monolith Diaries: Upgrading Rails

2021-10-22T13:30:00+00:00

We recently upgraded our monolith application from Rails 6.0 to Rails 6.1. By evaluating our prior experience on Rails upgrades, we have streamlined the process and we want to share it with you.

In this post we are going to give some insights on our workflow, from organizing such a milestone to actually delivering it without blocking an engineering team of more than 160 developers, building an application that peaks at more than 100k requests per minute.

Introduction

The core application of Skroutz is a large Rails monolith heavily utilizing MariaDB, MongoDB, Elasticsearch, Kafka, Redis and Memcached. We also use Jenkins for our CI and various tools like Sentry, NewRelic and Grafana for monitoring.

Even though we were upgrading to a minor version, Rails 6.1 introduced a notable amount of changes affecting many parts of our codebase and the aforementioned components.

We will describe the process we followed including some key points that allowed us to have a smooth release (such as our deprecation handling mechanism, how we approached backportable and non-backportable changes, canary deployment and more).

Organizing the upgrade

Spending time and resources to properly organize such a milestone is crucial for a successful delivery so we started with brainstorming and discussions on the following three questions: who, how and when.

Who

We have a core team named Kernel that among others, is responsible for keeping the application healthy, modern and productive.

Although the whole upgrade was driven by this team, all Skroutz’s teams were involved with much of the work to be done. Why?

Share the knowledge

With every upgrade, new things become available, some things start working in a different way than before and some others are no longer there.

Having engineers directly working on these changes, familiarizes themselves more effectively than just reading a changelog of the new version. Additionally, the gained knowledge is much more easily and directly communicated to other members of their team.
Cross team work is beneficial in many ways

This is a very good opportunity for engineers to
- familiarize themselves with sections of the codebase that don’t belong to their domain
- meet and work with engineers outside of their team
- exchange knowledge, share tips, hacks and cat photos :P
Speed up the process

It’s much easier and productive to investigate problems and proceed to changes in specific code sections by the team that owns it.

At Skroutz we have organized the engineering team under product groups with each group consisting of a handful of teams.

For the upgrade process, each product group assigned to one of its members the role of the Contact Person with the following responsibilities:

Single point of reference

Address any requests for help or information coming from the Core team.
Delegation

Work directly to address a group’s issue regarding the upgrade or pass it on to the proper member of the group.
Sync

Stay up to date with the status of the upgrade, communicate developments affecting the group’s pipeline, raise the flag and request for help in case of delays or blocking items.

How

For a milestone of this size, effective communication and task breakdown is critical.

Tracking

Since upgrading Rails is a recurring task, we use a dedicated project in our tracking system and we create milestones for each explicit upgrade.

The workboard contains columns categorizing the tasks based on their nature, so we can easily have a good overview of the state of the upgrade process, what’s left to be done, what’s blocked etc.

The nature of the tasks varies for each application but the following categories should be pretty common for everyone.

Preparations

Tasks for preparing the upgrade process before the actual work starts - find more in the Preparations section below
Investigations

Tasks for items that need investigation, for example checking if there is a version compatible with the target Rails version for a specific gem or check if the CI needs modifications to play well with the new version etc.
Deprecations

Tasks for complying with suggestions deriving from Rails active support deprecations for the target version - find more in the Deprecations section below
Gem updates

Tasks for updating internal or external gems to their new Rails compatible version
Changes & Fixes

This category contains all the tasks that actually make the codebase compatible with the new Rails version. Most commonly, these tasks involve fixing bugs due to changes that were not resolved by the deprecations or modifying code to use a newly introduced Rails feature.
Pre-release tasks

Tasks for actions that need to be done after everything seems to be in place and before the actual release (such as smoke tests, create the deploy plan etc.)
Post-release

Task for actions that need to be taken after the new version release - that could be cleanups, monitor the performance etc.

Communication

We created a Slack channel joined by the Core Team, the Group Contact Persons and any other engineer interested in the upgrade and we set up our tracking system to publish notifications of the Rails upgrade milestone to it.

Having a dedicated place for communication had many benefits:

Anything related to the upgrade was shared at the channel - the information was not being spread to emails, private conversations or other communication channels. We didn’t have to remember what was discussed and where, everything was available and discoverable in a single place - we could revisit the channel at any time in the future and find what we’re looking for.
Every member was constantly in sync with the upgrade developments - any accomplishments, resolutions, blocking factors or discussions were communicated to the channel - even if someone got involved at a later phase of the milestone, the information was there.
Something that possibly affected a specific group’s code area was visible to any member of the channel - everyone could contribute and familiarize themselves with almost all introduced changes of the upgrade.

Given the above, the “How” could be summed up to:

The upgrade has to be well broken down in tasks on the milestone workboard in the tracking system - when everything is resolved, we’re ready for the release.
Whatever we need - help, raise a flag, share a finding - use the Slack channel and let the discussion begin.

When

Even though planning a Rails upgrade is hard and can easily go off track, there are one or two things that can help us accomplish it in a safer manner.

Cross team work

The upgrade should be a cross team work.

Our Core team had this task already in its pipeline but involving other teams, having their own planning, at the last minute would not work. To avoid this, we had to evaluate the required effort and how it is distributed to the other teams’ components early in the process - three months before the date we wanted the release to take place.

Take baby steps - don’t jump at once

Having a well tested application with a green CI build doesn’t mean that everything will be fine once we go live. There are many things that could go wrong - from degraded performance to bugs showing up only in production - and the sooner we learn about it the better it is.

Upgrading an application to a newer Rails version, usually means:

update the gems to a compatible version
modify the codebase to conform with the new conventions
change previously deprecated mechanisms to the suggested ones (for example, the dalli_store that doesn’t play well in Rails 6.1 and can be replaced by the mem_cache_store implementation as suggested by both Rails and the Dalli gem)

Instead of packing all of the above in a single deployment, we isolated any backwards compatible changes and shipped them as soon as possible in the current Rails version.

Preparation

We decided how we will organize the upgrade. Time to start preparing it - we couldn’t just shout out to Slack “Hey everybody, start upgrading the application”.

As previously mentioned, we wanted to measure the effort and break it down efficiently in tasks. How do we do this though?

Changelogs

Obviously, the first step was to read the changelogs to get an idea of what is changing in the new version. Besides learning of new features that our application could use, this step is also crucial to easier understand and resolve any failures that will show up later on during the upgrade process.

But. There will be a lot of changelog entries for which it’s not very obvious how they affect our application.

This one for example:

Fix complicated has_many :through with nested where condition

Does this mean that we are already affected by this bug in our current version? If yes, are we already using a workaround?

Since we are talking about a monolith built by a multi-member engineering team, someone can’t know every single bit of it. They can’t answer the questions above unless they actually coded something that revealed this specific bug. But even in this case, what about the rest of the changelog entries?

So, after this step, what will help us get a better idea of what is going on is to take a look at our CI. How many failures do we have in the new version? But there’s a prerequisite for that step. Updating our gems…

At this point, we should create a branch (we named ours rails-upgrade-main) in which we will start adding the commits that will be merged to our main branch when we will be ready to ship the upgrade.

Gems update

It would be so great if we could just change the Rails’ gem version in the Gemfile, run bundle and get the green message.

But that’s pretty uncommon. A monolith usually comes with a Gemfile full of dependencies and it’s almost certain that you’ll have to upgrade some or many of them to a version compatible with the target Rails version.

So, after changing the Rails version in the Gemfile, we run bundle update rails and start resolving any failures that arise due to other gem incompatibilities.

This usually means that we have to

visit the gem’s homepage to locate the appropriate version
read the changelogs and check if the changes affect the gem’s usages in our codebase

We use Appraisal in all of our internal gems and testing their compatibility with the Rails version was as simple as creating a new appraisal definition and making sure that the tests are green. In most cases, all we had to do was to extend their Rails dependency to include the new one.

A very good practice here is to check if the new gem’s version is also compatible with the current Rails version. If yes, then this version bump should be brought to the main branch and deployed early. This will allow us to identify and deal with gem issues incrementally and gem by gem instead of dealing with all of them upon the upgrade release. So, instead of pushing the gem version bumps to the rails-upgrade-main branch, use the main branch and ship them one by one or as we see fit.

Ideally, from this step, the rails-upgrade-main branch should contain only the commit that bumps the rails gem version to the target one.

`rake app:update`

So, we have a branch whose rails version is the target one and we can bundle successfully.

At this point, we need to execute the rake app:update task as noted here and also configure the framework defaults.

The new Rails version might have different configuration defaults than the previous version. However, after following the steps described above, your application would still run with configuration defaults from the previous Rails version. That’s because the value for config.load_defaults in config/application.rb has not been changed yet.

We follow the interactive session and proceed based on our application’s setup. At the end we should carefully review the changes, especially those related to the new version’s defaults, and commit them in the rails-upgrade-main branch.

Test Suite

We have successfully bundled, we adapted to the new version’s configuration and we want to run the test suite to see what’s going on.

Extensively testing our application makes milestones like the Rails upgrade much safer and gives us more confidence that everything will be fine.

In our application, we have ~75k RSpec examples and we have set up our CI to distribute them in a group of servers ending up decreasing the duration of the sequential run from hours to just 15minutes.

Our first execution finished with more than 1.5k failures. Even though this seemed kind of disappointing, we knew what the root cause along with the fix for the majority of them was. Deprecations :)

Deprecations

Rails comes with a deprecation API, ActiveSupport::Deprecations, and every framework component like ActiveRecord, uses it to inform for usages that are deprecated and subject to removal, replacement or change in a next version release (in most cases the warnings include a suggestion on how to deal with it).

At Skroutz, we have set up this deprecation mechanism to work along with Rails’ instrumentation API, ActiveSupport::Notifications.

Instead of raising an error or just logging a deprecation, we configured all of our environments to notify in case of a deprecation

config.active_support.deprecation = :notify

and in an initializer we subscribed to the related event in order to implement our deprecation handling.

ActiveSupport::Notifications.subscribe('deprecation.rails') do |_, _, _, _, payload|
  # Deprecation handling goes here
end

We define an allowed list of deprecation messages - deprecations that we don’t have to deal with at the moment and should be ignored.

The following table shows how our handling works:

Environment	Allowed deprecation	Action
Production	Yes	Nothing
Production	No	Send event to Sentry
All other environments	Yes	Log the deprecation
All other environments	No	Raise it as an error

With some simplifications, the code would look like this:

class DeprecationHandler
  ALLOWED_LIST = [
    /You should not do this/
  ]

  def self.handle(payload)
    allowed = ALLOWED_LIST.any? { |pattern| pattern.match?(payload[:message]) }

    if Rails.env.production?
      return if allowed

      report_to_sentry(payload)
    else
      if allowed
        Rails.logger.tagged('active_support', 'deprecation') do
          Rails.logger.warn(payload[:message])
        end
      else
        raise ActiveSupport::DeprecationException.new(payload[:message])
      end
    end
  end
end

ActiveSupport::Notifications.subscribe('deprecation.rails') do |_, _, _, _, payload|
  DeprecationHandler.handle(payload)
end

Given the above, most of the messages we had in our allowed list before we started the upgrade, were deprecations generated in our Rails 6.0 for items that would affect our target version 6.1.

...
/Initialization autoloaded the constants/,
/Class level methods will no longer inherit scoping from `/,
/update_attributes(!)? is deprecated and will be removed from Rails 6.1 \(please, use update(!)? instead\)/,
/ActionMailer::Base\.receive is deprecated and will be removed in Rails 6\.1\. Use Action Mailbox to process inbound email\./,
/ActionView::Base instances should be constructed with a lookup context, assignments, and a controller/,
/ActionView::Base instances must implement `compiled_method_container` or use the class method `with_empty_template_cache` for constructing an ActionView::Base instance that has an empty cache/,
/Rails 6\.1 will return Content-Type header without modification/,
/render file: should be given the absolute path to a file/,
/NOT conditions will no longer behave as NOR/,
...

So, before starting to check each one of the 1.5k failing specs mentioned in the previous section, we first worked on dealing with these deprecations. How?

For each deprecation:

we created a branch from our main branch
we removed the deprecation from the allowed list
we ran the test suite on the branch and we located the parts that were generating the deprecations - remember that our deprecation handling raises errors for non-allowed messages
engineers from each group prepared commits to the branch fixing the deprecations relevant to their team
when the suite got green, we shipped it in production, and
we checked our production monitoring system for deprecation events a.k.a. deprecations that occurred from code that was not fully tested

Note here that the changes were backwards compatible - fixes were merged in the main branch and not deferred to rails-upgrade-main for the final upgrade.

Working for the upgrade

After fixing all the deprecations for Rails 6.1, the test suite ended up failing with only 50 errors or so. Good news, right?

Well, now this is the most tricky part of the upgrade process. It’s the part in which we have to investigate and try to find out which changelog entry caused it in order to get a good understanding of what changed and how to fix it.

As previously noted, we can’t know exactly how a changelog entry actually affects the codebase and in many cases we will have to check the Rails PRs that have been merged to the new version in order to gather more information.

Also, note that some failures might actually happen due to a framework’s bug introduced in the new version, such as this one that we located in one of our specs and for which we opened a Rails PR upstream.

For each of the failing specs in our suite, we created a task in the tracking system and we assigned it to the proper contact person to either work on it or delegate it to one or more team members.

Normally, any work that has to be done from now on would be committed in the rails-upgrade-main branch. The whole process might take some weeks or even months to complete and rebasing this branch to the main one should take place on a weekly basis if not more frequently.

To eliminate the effort of conflict resolution by rebasing to the main branch though, there are some things that we can do.

Backportable changes

There will be changes that will work both in the current and the target Rails version - these could and should be directly committed to the main branch.

For example, in one of our specs we made use of the last_migration method of ActiveRecord::MigrationContext which was removed in Rails 6.1 and we now had to calculate ourselves in Rails 6.1. Since the calculation would work in our main Rails version, we pushed the fix there instead of the rails-upgrade-main branch.

Non-backportable changes

For the rest of them, if a change is relatively small and contained, we can use a condition and alter the implementation based on it.

In a base module of the application we added the following helper methods:

module Skroutz
  def self.rails_next_version
    Gem::Version.new('7.0') # Your target version here
  end

  def self.rails_next?
    Rails.gem_version >= rails_next_version
  end
end

Then, when introducing a small change like the following, use the provided condition above to differentiate the behaviour.

Assume that there is a Rails framework method rails_method that returns a number in the current Rails version and we use it in a file that is frequently changed in the codebase:

class ClassWithManyChanges
  def a_method
    if Rails.rails_method(params) == 1
      logger.info 'All good'
    end
  end
end

but in the next Rails version it returns a boolean instead of a number.

Instead of changing the condition to use true instead of 1 that would work only in the rails-upgrade-main branch

class ClassWithManyChanges
  def a_method
    if Rails.rails_method(params) == true
      logger.info 'All good'
    end
  end
end

we can instead do the following and push it to the main branch.

class ClassWithManyChanges
  def a_method
    # TODO(rails6.1): Cleanup after upgrade
    against_value = Skroutz.rails_next? ? true : 1

    if Rails.rails_method(params) == against_value
      logger.info 'All good'
    end
  end
end

This might seem a bit weird but besides saving time from conflicts upon rebase, it also works as a warning for engineers when they attempt to change a part in the main branch that has altered behaviour in the next Rails version.

Delivering the upgrade

At this point the suite is green and the most important milestone of the upgrade has completed successfully. Now, the target is to deliver it in a safe manner and without surprises. Well, at least with the least possible surprises :)

Sanity testing

It is very common in our field to test something in the development environment and find out that it’s working, write specs about it and they get green but once it goes live users see the 500 page as our new feature.

To eliminate these cases, having a staging environment that is very close to production is a saver.

Core testing

This is a list of items to test against the new framework version:

Migrations: we proceed to at least one ActiveRecord migration to make sure that everything works as expected and we review the generated changes in the schema.
Caching: it is very common when upgrading to a new version, to have failures when deserializing an object that was cached in the previous one. We must try to identify such cases and note them down in order to be prepared to clear the affected keys from the cache upon releasing the upgrade unless, like in our case, we can afford a whole cache clear.
Encryption: if we use Rails’ encryption (ex. encrypted cookies), we have to make sure that the decryption succeeds in the new version (and vice versa in case of a rollback).
Integrations: the following checks should be also done (depending on the setup):
- rake: make sure that the application loads and the execution completes successfully for the most important tasks. In addition, if we are using libraries like whenever for cron tasks, we should check that the generation of the crontab list succeeds and the result is identical to the previous version’s one.
- Background jobs: in our setup, we use resque and kafka for background processing - we queued jobs to both and made sure that their execution completed with the desired results.
- Benchmarking: at this point, we have to monitor the performance of the application. We used StackProf along with flamegraphs and Ruby’s Benchmark module and compared the performance (memory usage and timings) of our most critical flows.
- Elasticsearch: index documents to ensure that new version changes to ActiveRecord models haven’t affected the generated JSON to be indexed on the server
Traffic replay: we replay a large sample of production requests against the old and the new implementation and we verify that the results are identical

Application testing

We deployed the rails-upgrade-main branch to our staging environment and we requested from all product groups to perform manual tests at least for the most important flows of their domain. In our case, this step led to a couple of important bug fixes that would otherwise reach production.

Spread the news

We checked everything. We’re ready to move on.

Given that the deployment of the upgrade will require some time and that our engineering team counts more than 160 members, it is important to inform everyone about the release date a few days beforehand:

Product engineers: Teams with a tight schedule to release an important feature should not be informed about the upgrade at the last minute. We have to make sure that we will not block any important operations and we might even end up postponing the release for a few days if another milestone has a higher priority.
Platform engineers: Our platform team that is responsible for the infrastructure and the site reliability, also has to be informed soon enough to reserve the appropriate time to help us with the upgrade and its monitoring afterwards.
Contact persons: The Core team is the one to deploy the upgrade, though the Contact persons have to be available throughout the process to help, investigate and hotfix if something related to their domain comes up.

Deployment

We found the date. What are we going to do actually on that day?

Canary Release

Our setup consists of many servers grouped by their purpose:

application servers: serving the application to our end users
workers: executing the background jobs
internal tools: serving parts of the monolith to internal users (ex. content editing, reporting…)
etc

Instead of deploying the Rails upgrade to all of the servers at once, we follow the Canary Release technique. In a nutshell, with this approach the changes are deployed to a subset of the servers and in an order that will reduce the impact in case of failure.

In our case, it was obvious that we should start with the servers dedicated to our internal tools. This would help us get immediate feedback from our internal users and our monitoring system and also avoid causing unnecessary frustration to our end users. So we deployed the upgrade to one of the group’s servers, everything went well and we moved on deploying to the rest of them.

Even though the workers group seemed a good next candidate, we decided to deploy on it last because in case of failure, on top of resolving the error, we would have a big amount of operational work to do related to the failed jobs.

Create a detailed plan

As we described above, the deployment is a multistep process. It is extremely helpful to have a document with all the steps that we will need to follow on the release date.

We created a task in our tracking system in the milestone’s board in which we documented each specific deploy action along with notes, commands and resources (ex. monitoring links).

Here’s a sample:

1. Lock deploy

2. Merge `rails-upgrade-main` to `main`

3. Ensure successful build in Jenkins

4. Deploy to internals-1
   Command: $ TARGET=internals-1 bundle exec deploy
   Monitoring: https://monitoring/events?host=internals-1

5. Deploy to all internals
   Command: $ TARGET=internals bundle exec deploy
   Monitoring: https://monitoring/events?host=internals

...

10. Clear Rails.cache
    Comand: $ Rails.cache.clear

...

Upon the deployment, we might end up executing different commands, add new steps etc. Updating the task with these changes will be valuable since we will revisit this task to create the next upgrade’s release plan.

Monitoring

We use Sentry for reporting exceptions in production and Grafana with a great amount of dashboards with configured alerts on most of them. Both tools send notifications to one or more Slack channels.

During the release of course, we were not waiting for notifications to appear in Slack - we had the critical dashboards opened in our browsers and we were checking their state constantly till the moment we felt confident that everything was fine.

After cross-checking with the Platform team that things look good on their side as well we considered the release successful!

No, not yet but close. We might have scheduled tasks that execute during the night or on specific days on a weekly basis so we have to remember to check the monitoring tools occasionally for errors that might be triggered by them until all of them complete successfully once after the upgrade.

Next steps

Rails 7 is around the corner and there are a couple of things we can do to be better prepared for the next upgrade.

– Gem updates: we can schedule more frequent updates of our gems (especially those required by or depending on Rails) saving time from the next upgrade milestone

– Deprecations: new deprecations appeared in the current version and we can already start working on them moving our codebase to a more compatible state for the next version

– Release information: we need to keep our eyes open for any major changes, new features etc. about the new release

Now we’re done :)

If you like providing a top-notch development environment or you get intrigued by working with Ruby & Rails, make sure to check our Core team’s open position, or our other job openings.

Thank you for reading!

PS: We forgot to acknowledge the upgrade’s coordinator.

Monolith Diaries: Upgrading Rails was originally published by Lazarus Lazaridis at Skroutz Engineering on October 22, 2021.

Hotwire @ Skroutz: Lazy load data with minimum effort

2021-07-11T22:00:00+00:00

At Skroutz we constantly try to find ways to make our website faster and consequently optimize our users’ experience. In this context, Hotwire couldn’t escape our attention as it aroused the interest of the developers community from the moment it was announced by its creators.

What is Hotwire?

Hotwire, as described in the official website, is

an alternative approach to building modern web applications without using much JavaScript by sending HTML instead of JSON over the wire

In other words, Hotwire creates HTML markup, instead of JSON objects, and sends it as response to a request from the client. This way, we avoid the manipulation of the response data with Javascript.

Furthermore, Hotwire can find the way to automatically inject the received HTML into the right place of the DOM, with Turbo, a set of techniques that eliminate the need of writing custom Javascript in order to handle form submissions, partial DOM updates, history changes and many more.

As stated in their documentation, Turbo is able to handle at least 80% of the cases by itself on the client side, without the need for any Javascript to be written by you. For the remaining 20% of the cases, Hotwire provides Stimulus, a lightweight Javascript framework that works well with Turbo. Stimulus can be used to create reusable components that can be bound to any HTML element and enhance it with custom behaviour.

The order show page

Let’s get started by setting the context of our example. At Skroutz we have developed a portal, known as Skroutz Merchants, that provides useful tools to our partners in order to facilitate the operation of their store. In one of those views we show the order’s details alongside a list of tickets that may exist and are related to this order.

In order to reduce the initial rendering time, we choose to load the tickets list asynchronously, as soon as the initial render has finished.

The following image illustrates a simplified wireframe of the order show page. The parts that are loaded on the initial render, such as the sidebar, the top bar and the order itself, are colored with green. The tickets list section is colored in orange, indicating that it gets loaded asynchronously, after the initial render.

Image 1: Merchants panel: Order show

Lazy load with vanilla Javascript

The process is simple: as soon as the page loads, a javascript function makes a request to /merchants/orders/:code/tickets path in order to fetch the tickets, if any.

As shown in the following block, order_tickets queries the database, checks to see if there are any tickets and creates the HTML from the respective partial template.

# orders_controller.rb

# GET /merchants/orders/:code/tickets
def order_tickets
  tickets = # db query

  options = { layout: false, formats: :html }
  view = if tickets.present?
            options.merge!(partial: 'merchants/tickets/ticket',
                          collection: tickets,
                          as: :ticket)
          else
            options.merge!(partial: 'merchants/tickets/no_tickets_message')
          end

  respond_to do |format|
    format.json do
      render json: { html: render_to_string(view).squish }, status: :ok
    end
  end
end

Now, let’s see the frontend part. OrderTicketsView is the class that is responsible for fetching the tickets data and injects the received markup into the DOM. More specifically, _getOrderTicketsData performs the asynchronous request, finds the #js-tickets-wrapper element and replaces it with the received markup.

<%# show.html.erb %>
...
<div id="js-tickets-wrapper" data-order-code="<%= order.code %>">
  <div class="loading-tickets flex-row">
    <%= render 'merchants/shared/spinner' %>
  </div>
</div>
...

// order_tickets_view.js

export default class OrderTicketsView {
  constructor() {
    this._cacheElements();
    this._getOrderTicketsData();
  }

  _cacheElements() {
    this._ticketsWrapper = document.getElementById('js-tickets-wrapper');
    if (this._ticketsWrapper) {
      this._orderCode = this._ticketsWrapper.dataset.orderCode;
    }
  }

  _getOrderTicketsData() {
    if (this._orderCode) {
      const orderTicketsUrl = `${this._orderCode}/tickets`;

      axios.get(orderTicketsUrl)
        .then(({ data }) => {
          this._appendTicketsGrid(data.html);
        })
        .catch(() => {
          this._showErrorMessage();
        });
    }
  }

  _appendTicketsGrid(tickets) {
    this._ticketsWrapper.parentElement.innerHTML = tickets;
  }

  _showErrorMessage() {
    this._ticketsWrapper.innerHTML = `<div class="box-alert error">${__(
      'Failed loading tickets'
    )}</div>`;
  }
}

As we can see from order_tickets_view.js file, we have to write a fair amount of custom javascript code to achieve a lazy loading behaviour. Wouldn’t it be nice if we had a way to apply this lazy loading feature without the boilerplate javascript code?

Introducing Turbo Frames

Fortunately, Turbo provides Turbo Frames, a set of techniques that help us decompose a page into independent parts that get updated individually.

Turbo frame is nothing more than a custom HTML element with the <turbo-frame> tag. Every turbo frame element must have a unique id that is used by Turbo in order to update its contents. Anything that is wrapped within a <turbo-frame> tag, belongs to a separate context that gets updated independently of the rest of the page.

Lazily loading frames is a special case of turbo frames that fits perfectly to our case. In order to create a lazily loading frame we just have to provide a src attribute to the <turbo-frame> element with a url as the value. As soon as the <turbo-frame> element gets rendered, Turbo will make a request to the provided url and try to update the frame’s contents with the received HTML (As we said earlier, Hotwire responds with HTML instead of JSON). This update happens automatically by Turbo and we don’t have to write any custom javascript to handle the response.

Applying lazily loading frames

Introducing turbo frames to an existing codebase is quite simple. Just wrap the desired part of the page with a <turbo-frame> tag and you have created a frame.

In this way, in show.html.erb view, we replace the #js-tickets-wrapper div with a <turbo-frame> tag. The new turbo frame element must have a unique id, so we assign the order_tickets id, alongside with a url as value of the src attribute. Finally, we add the loading: 'lazy' attribute so that the request to the provided url happens only when the turbo frame element becomes visible in the viewport. More details about the available HTML attributes can be found here.

<%# show.html.erb %>
...
<%# <div id="js-tickets-wrapper" data-order-code="<%= order.code %>"> %>
<%= turbo_frame_tag :order_tickets,
                    src: tickets_merchants_order_path(code: @order.code),
                    loading: 'lazy' do %>
  <div class="loading-tickets flex-row">
    <%= render 'merchants/shared/spinner' %>
  </div>
<% end %>
<%# </div> %>

Then, we have to adjust the response of the action that gets called when the turbo frame element requests the provided url. Turbo frame waits for a response that contains HTML markup, so we alter the contents of the respond_to block in order to return the respective partial view. Furthermore, we no longer need the options and the view objects because we don’t build the HTML manually, as we did before with render_to_string.

# orders_controller.rb

# GET /merchants/orders/:code/tickets
def order_tickets
  tickets = # db query

  # options = { layout: false, formats: :html }
  # view = if tickets.present?
  #           options.merge!(partial: 'merchants/tickets/ticket',
  #                         collection: tickets,
  #                         as: :ticket)
  #         else
  #           options.merge!(partial: 'merchants/tickets/no_tickets_message')
  #         end

  respond_to do |format|
    # format.json do
    #   render json: { html: render_to_string(view).squish }, status: :ok
    # end
    format.html do
      if tickets.present?
        render partial: 'merchants/tickets/tickets', locals: { tickets: tickets }
      else
        render partial: 'merchants/tickets/no_tickets_message'
      end
    end
  end
end

There is something more that we need to do. We’ll have to adjust the merchants/tickets/no_tickets_message partial so that it responds with the expected markup. merchants/tickets/tickets has been created from the beginning in order to wrap the collection of tickets in a way that Turbo can handle.

Turbo frame has to find a way to match the content it receives from the request to the provided url, with the part of the page that it needs to update. As we said earlier, we gave the order_tickets id to the turbo frame element. Turbo will try to find a <turbo-frame> tag with the same id inside the response body, and if it finds it, it takes its contents and replace the contents of the #order_tickets turbo frame element of the page with them.

So, nothing scary, just wrap the contents with a <turbo-frame> tag with the appropriate id as shown in the following blocks.

<%# _tickets.html.erb %>

<%= turbo_frame_tag :order_tickets do %>
  <%= render partial: 'merchants/tickets/ticket',
             collection: tickets,
             as: :ticket %>
<% end %>

<%# _no_tickets_message.html.erb %>

<%# Previously %>
<%# <div class="box-alert warning"> %>
<%#   <%= _('No tickets found') %>
<%# </div> %>

<%# Add turbo_frame_tag %>
<%= turbo_frame_tag :order_tickets do %>
  <div class="box-alert warning">
    <%= _('No tickets found') %>
  </div>
<% end %>

Oh, and don’t forget, we no longer need the custom javascript code from order_tickets_view.js, so, we can safely delete it!

And that’s it! In three simple steps we have introduced Turbo Frames to our codebase in order to achieve the same lazily loading behaviour, without the use of custom javascript.

Summary

In this post, we tried to demonstrate the ease with which we can use Turbo Frames. We have completed the refactoring in three simple steps:

Wrap the desired part of the page with a <turbo-frame> tag and give it a unique id and a url as the value of the src attribute
Refactor the controller’s response as needed
Add the <turbo-frame> tag with the appropriate id to the partials that get rendered from the controller

Except from the simplicity of this refactoring, we have managed a small shrinkage of our codebase (as shown in the following image from Github), as a result of the removal of unwanted custom javascript code that was used to handle these updates that now, automatically, get handled by Turbo.

Image 2: Github: Lines removed and added

Next steps

Turbo comes with many more techniques, apart from Turbo Frames. Turbo Streams is another powerful feature that can improve the dynamic nature of any app. We can use streams to broadcast changes to our models, from the server to the client. And this is done with a WebSocket connection that Turbo, automatically establishes and handles for us.

In our case, we can take advantage of the power of Turbo Streams and push any updates of a specific order’s tickets to the client, so, users will be able to see a live update (insertion of a new ticket, deletion or edit) on their screen, without having to constantly refresh the page to fetch the latest state.

Hotwire @ Skroutz: Lazy load data with minimum effort was originally published by John Kapantzakis at Skroutz Engineering on July 11, 2021.

SEO at Skroutz.gr: Our Top 5 Principles & Values

2021-06-17T21:00:00+00:00

Table of Contents

Serve the human, not the machine
› Great User Experience should be your top priority
› You can’t fool Google in the long run

User intent is your guiding principle for great content
› Know your audience
› Have a quality page for every important (to your company) query

An excellent site performance & usability should be a company objective, not just a task

Understand how Google sees your property must be top priority
› Google doesn’t have to know everything; you can help!
› Are you confident that GoogleBot can always parse all your content?

SEO is a team sport
› SEO should be in the DNA of the company, not just an extra task
› SEO unveils helpful, actionable data and creates tools that help the other teams objectives

Final Words

With almost 10.000 stores and more than 10 million products on its platform, Skroutz.gr is currently the fourth most visited site (after Google, Facebook, and YouTube) and the leading Marketplace in Greece. Skroutz.gr has on average 35M visits per month, with the vast majority of the traffic coming from Organic and Direct channels; we have never used paid ads (Adwords, etc.) for driving traffic to categories and products.

From the early days of Skroutz, back in 2005, we focused on quality content and experience in order to drive organic traffic. Although we didn’t always have a dedicated SEO team, the SEO mentality was present throughout the company. This mentality is what gave us an extremely good performance in the Greek SERPs and a steady year-over-year organic growth.

In this article, we share the most important values and principles that we have followed all these years. Although Skroutz.gr is a marketplace and many of the principles focus on the e-commerce’s SEO aspects, we believe that any fellow of the SEO community could find some useful information that can be applied to their websites.

Serve the human, not the machine

Great User Experience should be your top priority.

This is basically derived from our core values, as a company. We focus on the user journey and seek to give the customer the best experience in every step.

Although we always strive to get our content and structure accessible and optimized for search engines, we never build things exclusively for SEO reasons. So, be it a new feature or a page redesign, our primary focus is always an excellent user experience.

After all, according to Google’s Web Vitals: “Optimizing for quality of user experience is key to the long-term success of any site on the web.”

You can’t fool Google in the long run.

We are not going to talk about this extensively, but there are many grey and black hat SEO techniques that can lead to some good short-term results and aren’t endorsed, of course, by the Google Quality Guidelines.

Well, we think that if you want to build a site on solid “SEO” ground, get all the tremendous benefits of organic traffic in the long term, and not lose your sleep over every Google Core Update, you should stay away from any shady techniques. Google, despite its flaws, has evolved so much these years, and you will have many chances to get caught with a penalty.

After all, who wants to spend a lot of time on something that won’t payout in the future when they could work on things that create value for their visitors.

User experience, content, and more technical stuff like performance, crawlability & indexability, and website architecture are some of the things you might invest your time in!

User intent is your guiding principle for great content

Know your audience

If you can deeply understand your audience, you have made the first step to structure your pages to serve the user’s intent; that’s something that Google rewards in the long term.

By “deeply understand”, we don’t mean only how they search (Search Intent) but also:

What type of information is likely to help them most.
How should you serve that content to help the user.
Which piece of content can remove any doubts from the user to continue their journey.

At Skroutz, we use many techniques to learn our users. We start with the Search Intent (how the users search on Google), and then we try to unveil valuable insights about their behavior after landing on our site.

Skroutz Info: Except for the quantitative research, in order to deeply understand what information we need on our Product or Category pages, our User Research Team runs comprehensive qualitative & UX research.

For example, they use Live Chats, User Surveys and Live Usability Tests with scenarios like “I want to get a Refrigerator for my family”. They gather all the pain points of the User Journey and use them to enhance our products.

Have a quality page for every important (to your company) query

This is one of the fundamental principles of SEO, yet many sites underestimate or fear duplicate content or crawl budget issues. Especially for E-commerce Sites, there are a lot of “traditional” rules that many (or their CMS) blindly follow:

“You should always no-follow & no-index category facets (filters).”

“Product variations are duplicate pages and should always be blocked from Google.”

“Out-of-Stock products have no value and should be removed from the Google Index as soon as possible.”

At Skroutz, we think that everything should be decided based on the user search intent and the company’s objectives. If one page has value for the company and can drive high-quality organic traffic, there is no reason why this page shouldn’t be indexed.

If we want to be more specific:

Many facet combinations (Category Filters) can rank for many short and long-tail searches. If something has value for the visitors, index it; if not, save your crawl budget for another quality page.

Skroutz Info: We use a sophisticated & automated way of indexing filter combinations (Faceted Navigation) and follow their links, mainly based on traffic and internal searches.

In some cases, product variations may have a substantial difference regarding search intent. For example, some color variations in fashion products have a decent volume for many different colors. This means that the user wants to see a specific variation of one product. Hence, a dedicated page for each color might be more relevant, helpful (i.e. recommend a suited color-complementary product) and drive more traffic cumulatively than a page which contains every variation on the same page.

A large number of out-of-stock products can drive a lot of traffic, even if they have been discontinued for months. In some cases (e.g., a newer model came out), it’s beneficial to test if you could add value to a visitor by promoting more recent/ related products in the out-of-stock product pages.

Skroutz Info: We keep our out-of-stock products until there is no genuine search interest for them. Some of them drive quality traffic to relevant (linked) products for many months after the day of being out-of-stock.

An excellent site performance & usability should be a company objective, not just a task

There has been a lot of chatter in the SEO community lately about the Web Vitals and Google’s page experience update that is taking place. Some are rushing now to fix those metrics to increase or preserve their organic performance after the update.

At Skroutz, we believe that delivering a great user experience on the web is heavily impacted by site performance and usability. That’s why speed was always a critical factor for Skroutz.gr.

In order to preserve an excellent performance and usability:

We are actively monitoring all the SEO-specific metrics like LCP, FID, CLS.
We have set up a “speed mentality” for our Front-End engineers, especially for the latest and greatest things on rendering performance.
Our Systems Team is actively monitoring all requests, response volumes, and timings to ensure a stable and fast performance of our servers.

Understand how Google sees your property must be a top priority

Google doesn’t have to know everything; you can help!

Google has improved its crawling capabilities all these years, and, in most cases, GoogleBot can crawl a site efficiently, regardless of the technology used in the backend or the site’s size.

However, crawl efficiency is not always guaranteed for large sites (1 million+ unique pages) or sites with daily updated content. In those cases, prioritizing what to crawl is a vital aspect and should be considered in your SEO strategy.

How can you help Google?

Sitemaps: Help Google understand what YOU think should be prioritized.
Content Pruning: Remove pages that are of little value to your audience and save crawl budget.
Site Architecture: Help Google find & crawl your site easily, and understand the importance of every page
Internal Linking: Help Google understand how each page is related to each other and boost crawl rate for your important pages.

Skroutz Info: 2 years ago, we optimized our crawl budget by removing 72% of Skroutz indexed URLs. If you are curious about how we did it, you can read the detailed case study.

Are you confident that GoogleBot can always parse all your content?

The web is changing, so is SEO. The need for better website design and user experience, accelerated the usage of new technologies and frameworks, like ReactJS, VueJS, etc., that can change the content of one web page dynamically. This can create some problems for the SEO teams.

If your site makes heavy use of Javascript, you have to know also:

If Google can crawl and parse your content.
If your most important information like meta robots, titles & descriptions are always served correctly to GoogleBot.
In client-side rendering, you should be aware of the time needed for all your content to be indexed, especially if there are frequent updates; in such cases, GoogleBot will crawl and index the HTML first and come back later to render the JavaScript when their resources become available.

SEO is a team sport

SEO should be in the DNA of the company, not just an extra task

SEO, especially in enterprise-level websites with millions of pages, shouldn’t be one team’s job, but it has to be embedded in the company’s DNA. Imagine how easier it would be for SEO teams if non-SEO teams had a clear knowledge of:

what SEO is,
why does their job affect the SEO,
how they can help the SEO Team and vice versa,
when they should proactively get in touch with the SEO team.

At Skroutz, we are trying to embrace SEO through the company as a mindset for every individual from Product & Design to Content and Engineers. We use training, workshops and meetings with individuals/ teams so that everyone is involved in this.

Skroutz Info: Content teams are actively involved in many “SEO” kind of tasks like Keyword Research for Category & Product Titles

SEO unveils helpful, actionable data and creates tools that help the other teams objectives

Knowing how a user searches to find a specific piece of information in Google is an invaluable asset for site owners. In addition, this knowledge is something that the SEO team specializes in and can use to create value for many other teams and the company.

Some examples of different cases where the SEO team can really offer value are the following:

Help the customer support teams by sharing information about the search behavior of the customer for any information they need from the site. For example, if many people search for “how to return a product in site X” or “cost of the X service”, the SEO team can propose some changes or a new section/ landing page, thus decreasing the number of phone calls/ emails.
Help Merchandising & Marketing Teams with prioritizing their promotional efforts (Site Banners, Social Media Posts, etc) especially for Seasonal products/ services, by providing them with weekly or monthly organic trends for some keywords or landing pages.

Skroutz Info: We have created a Data Studio Dashboard with Organic Trends for Categories, Landing Pages, and Keyphrases using Search Console data. This Dashboard is used by many fellows of Merchandising and Marketing teams.

Educate Content teams about SEO and create tools that help their everyday job, like creating new Product or Category pages. For example, Search Console can be used to create a tool (via API or Data Studio) where members of Content Teams can find popular keyphrases and use them in titles or main content.

Final Words

Having good organic performance is a long, difficult journey, especially for large and complex websites. However, if you stay focused on providing the best user experience, you will be rewarded with great results in the long term.

We hope that you found this article useful as a source of inspiration for your SEO adventure!

What are the values and principles that you follow, regarding SEO? Let us know, in a comment below (we’ll reply to all questions).

On Behalf of Growth Team,
Vasilis.

SEO at Skroutz.gr: Our Top 5 Principles & Values was originally published by Vasilis Giannakouris at Skroutz Engineering on June 17, 2021.

Refactoring a React app to progressively load its data

2021-04-07T22:00:00+00:00

Except from the main product, Skroutz provides various internal tools to its people. These tools are developed in-house and are highly customized for our specific needs. One of these tools provides its users with statistics related to orders. Let’s refer to this page as the statistics page from now on.

The problem

The statistics page is rendered using React and is responsible for getting specific data from the backend and display them through various charts to the end user. The problem is that all required data come from a unique database query which is quite heavy and takes some time to finish (it depends on the requested period of time and number of shops) on the one hand and, on the other hand, the way React handles this waiting.

The following image shows what user sees while waiting for the page to finish loading.

Image 1: Before the refactoring

This is not the best user experience because we have to wait for all data to be available before React starts to render the children components that are going to display the desired data. Furthermore, there is too much blank space on the screen while page loads.

The following illustration depicts the way that the current implementation is organized. Each solid-lined rectangle represents a React component and each arrow represents data that flow from one component to another.

There is a wrapper component (the outer rectangle) that is responsible for fetching the data from the backend. Inside the wrapper component there is another component (the intermediate rectangle) that holds various children components (the colored rectangles) which are going to display the respective data.

When the data are available, the wrapper component updates its internal state and all children components get re-rendered because we pass the data as props to each one of them through the intermediate component.

Image 2: Single source of data

The components are colored this way on purpose. Different color means different data. We can see that some components request different data from each other, but others, like A and B, or C and E, request the same data, only they display them in a slightly different way.

Proposed solution

As mentioned before, the main problem with the initial implementation is the fact that React waits for a heavy query to finish in order to return all the required data.

What if each component requests their own data independently from the backend? We could split the one heavy query to smaller ones that would get called from the respective components. We may end up with multiple network requests instead of one, but we can render each component as soon as it has its data available.

Image 3: Independent data fetch

What do we want to achieve with these changes?

Better user experience, because the user will see various components in a loading state (instead of a full page loader) and, gradually, each one of them will render the respective chart, as soon as it gets its data. This is a valid benefit here because the nature of the specific page is to provide various, independent metricts that can be consumed individually by the user and provide a valuable insight. In other words, the user doesn’t have to view all the data that the page, eventually, will render in order to extract a conclusion.
Avoiding single point of failure, by executing multiple requests, we avoid the case when a failed request prevents all the components to be rendered, leaving a black page with an error message. With multiple, independent components we can render the ones that have data, while in the ones where an error has occurred, we can render an error message with a retry button.

Implementation

Now let’s move on with the implementation. We are not going to dive into much detail here and we are not going to provide the full code as this is not the goal of this article. The purpose of this article is to highlight the most interesting parts of the current implementation and briefly explain the changes we made to achieve the final result.

Names of classes, methods and components may have changed. Many parts of code have been omitted for reasons of simplicity.

Create the API

First of all we have to provide the API that the React components are going to use in order to fetch their data. Until now, we had an endpoint that, as soon as it gets called, it executes a heavy query to the database in order to return a hash containing all the required data.

# GET /path/to/stats_data
def stats_data
  render json: DataClass.new(
    from: params[:from],
    to: params[:to]
  ).stats
end

After adding the appropriate methods to DataClass in order to return the respective portion of data, we make stats_data action to accept the metric param in order to be able to call the respective DataClass method.

# GET /path/to/stats_data
def stats_data
  data_summary = DataClass.new(
    from: params[:from],
    to: params[:to]
  )

  data = {}
  data[params[:metric]] = data_summary.public_send(params[:metric])

  render json: data
resque => e
  respond_error(e, :unprocessable_entity)
end

Now, each component will be able to call stats_data, providing its own metric param to get the desired data.

How it used to work initially

Let’s take a look at the initial state. There are two wrapper components, Stats and StatsMetrics as we saw in image 2. Stats component fetches the data and passes them to StatsMetrics, as we can see in the following snippet.

export default function Stats({ options }) {
  const [searchUri, setSearchUri] = useState(null);
  const [statsData, setStatsData] = useState(null);
  ...
  useEffect(() => {
    if (!shouldFetchStats) return;

    getStats(searchUri)
      .then((data) => setStatsData(data))
      .catch((error) => { ... })
  }, [shouldFetchStats, searchUri]);

  return (
    <>
      ...
      <div>{statsData && <StatsMetrics data={statsData} />}</div>
    </>
  );
}

StatsMetrics, on its turn, gets the data from its parent and renders the children components, passing the respective data to each one of them. You can see a comment after each component that indicates the respective rectangle from image 2 (and 3).

As we explained earlier, some components require the same data as other components do, like component A, which requires data.order.all, just like component B does. The same goes for components C and E which require the data.order.billed part.

export default function StatsMetrics({ data }) {
  return (
    <>
      <StatsOrderGroup data={data.orders.all} /> /* A */
      <StatsOrderCountLine order={data.orders.all} /> /* B */
      <StatsOrderGroup data={data.orders.billed} /> /* C */
      <StatsOrderGroup data={data.orders.pending_billing} /> /* D */
      <StatsAverageGroup data={data.orders.billed} /> /* E */
      <StatsRatiosGroup data={data.ratios} /> /* F */
      <StatsOrderGroup data={data.orders.cancelled} /> /* G */
      <CancellationGroup data={data.cancellation_per_reason} /> /* H */
    </>
  );
}

Taking a look at one of the children components, let’s say StatsOrderGroup, we can see that it takes the data prop and displays parts of the data object via helper components.

export default function StatsOrderGroup({ data }) {
  return (
    <>
      <StatsQuantityMetric value={data.count} />
      <StatsCurrencyMetric value={data.revenue} />
      <StatsCurrencyMetric value={data.commission} />
    </>
  )
}

Move responsibility of data fetching to children components

As we explained in the previous section, the plan is to assign the responsibility of data fetching to each one of the children components. So, the first step is to remove the useEffect hook from the Stats component and pass the searchUri prop to StatsMetrics component, instead of statsData.

export default function Stats({ options }) {
  const [searchUri, setSearchUri] = useState(null);
  // const [statsData, setStatsData] = useState(null);
  ...
  // useEffect(() => {
  //   if (!shouldFetchStats) return;

  //   getStats(searchUri)
  //     .then((data) => setStatsData(data))
  //     .catch((error) => { ... })
  // }, [shouldFetchStats, searchUri]);

  return (
    <>
      ...
      {/* <div>{statsData && <StatsMetrics data={statsData} />}</div> */}
      <div>
        {searchUri !== null && <StatsMetrics searchUri={searchUri} />}
      </div>
    </>
  );
}

Then, we change the StatsMetrics component to receive the searchUri prop in order to pass it to the children components.

export default function StatsMetrics({ searchUri }) {
  return (
    <>
      <StatsOrderGroup searchUri={searchUri} />
      <StatsOrderCountLine order={searchUri} />
      <StatsOrderGroup searchUri={searchUri} />
      <StatsOrderGroup searchUri={searchUri} />
      <StatsAverageGroup searchUri={searchUri} />
      <StatsRatiosGroup searchUri={searchUri} />
      <StatsOrderGroup searchUri={searchUri} />
      <CancellationGroup searchUri={searchUri} />
    </>
  );
}

We need a loading state and error reporting

Until now, children components got rendered if and only if they had their data available. Their parent was passing the data to them and they were ready to render their markup.

Now, the situation is different. Each child component gets rendered immediately on page load and it waits until it has its data available from the backend in order to display them to the user.

So, we need to have a loading state for each component and an error reporting state in case of an error response from the api call. For this reason, we have introduced some wrapper components that are responsible for the following:

Fetch data from the backend
Render a loading state while waiting for data to be available
Display an error message in case of error, and a retry button (for on-demand data fetching)

Introducing children components wrappers

We are going to need a wrapper for each component that needs to fetch its data, so we create three wrappers:

StatsOrderGroupWrapper for StatsOrderGroup
StatsRatiosGroupWrapper for StatsRatiosGroup
CancellationGroupWrapper for CancellationGroup

We did not create wrappers for StatsOrderCountLine and StatsAverageGroup because these components will get their data, indirectly from other components (pairs A - B and C - E).

The following snippet shows the StatsOrderGroupWrapper component in its final form. This component (wrapper) fetches the data it needs and it renders a loading state while waiting and displaying the data as soon as they are available, or displays an error message in case of unsuccessful fetch.

import { getStats } from 'path/to/api';
import { useGetCpsOrderStats } from 'path/to/useGetCpsOrderStats';
import { useCpsOrdersStats } from 'path/to/cpsOrdersStatsContext';

export default function StatsOrderGroupWrapper({ searchUri, metric, updateRequestsState }) {
  const initialStats = {
    count: 0,
    revenue: 0,
    commission: 0
  };
  const { dispatch } = useCpsOrdersStats();
  const { stats, isLoading, showError, getData } = useGetCpsOrderStats({
    getStats,
    searchUri,
    metric,
    initialStats,
    dispatch
  });

  useEffect(() => {
    updateRequestsState(metric, isLoading);
  }, [updateRequestsState, metric, isLoading]);

  return showError ? (
    <StatsErrorSection errorMessage={stats.error} retryFunc={getData} />
  ) : (
    <StatsOrdersGroup isLoading={isLoading} data={stats} />
  );
}

As you can see, all the functionality is delegated to two custom hooks, useGetCpsOrderStats and useCpsOrdersStats. The other two wrappers use the same hooks and are similarly organized.

Let’s see what’s going on inside cpsOrdersStatsContext which exposes the useGetCpsOrderStats hook alongside with another two context components that we are going to see later in action:

const initialStats = {
  all: null,
  billed: null
};

const CpsOrdersStatsContext = React.createContext(initialStats);

function cpsOrdersStatsReducer(state, { type, payload }) {
  switch (type) {
    case 'SET_ALL': {
      return {
        ...state,
        all: payload
      };
    }
    case 'SET_BILLED': {
      return {
        ...state,
        billed: payload
      };
    }
    case 'RESET_ALL': {
      return {
        ...state,
        all: null
      };
    }
    case 'RESET_BILLED': {
      return {
        ...state,
        billed: null
      };
    }
    default: {
      return state;
    }
  }
}

function CpsOrdersStatsProvider({ children }) {
  const [stats, dispatch] = React.useReducer(cpsOrdersStatsReducer, initialStats);
  const value = React.useMemo(
    () => ({
      stats,
      dispatch
    }),
    [stats]
  );

  return <CpsOrdersStatsContext.Provider value={value}>{children}</CpsOrdersStatsContext.Provider>;
}

function CpsOrdersStatsConsumer({ children }) {
  return (
    <CpsOrdersStatsContext.Consumer>
      {(context) => {
        if (context === undefined) {
          throw new Error(
            'CpsAllOrdersStatsConsumer should be used inside a CpsAllOrdersStatsProvider'
          );
        }
        return children(context);
      }}
    </CpsOrdersStatsContext.Consumer>
  );
}

function useCpsOrdersStats() {
  const context = React.useContext(CpsOrdersStatsContext);
  if (context === undefined) {
    throw new Error('useCpsAllOrdersStats should be used inside a CpsAllOrdersStatsProvider');
  }
  return context;
}

export { CpsOrdersStatsProvider, CpsOrdersStatsConsumer, useCpsOrdersStats };

First of all we create a context object and store it in the CpsOrdersStatsContext constant.

After that, we declare the cpsOrdersStatsReducer function that we are going to use as a reducer inside CpsOrdersStatsProvider component, which we create immediately after.

What CpsOrdersStatsProvider does is that it provides a value to its children components, notifying them about changes and providing the dispatch function in order for them to update the context state.

But, in order for CpsOrdersStatsProvider’s children components to be informed about changes in our context (the stats data), they need to be wrapped inside a context consumer component. For this reason we create the CpsOrdersStatsConsumer component, which does exactly that.

Finally, we create the useCpsOrdersStats custom hook to be used by our wrappers (like StatsOrderGroupWrapper) in order to have access to our context, and especially the dispatch function. StatsOrderGroupWrapper calls the dispatch function (through useGetCpsOrderStats hook) every time it needs to inform its siblings components that it has the data they need.

Now let’s take a look at the contents of useGetCpsOrderStats:

import { useState, useCallback, useEffect } from 'react';
import camelCase from 'lodash/camelCase';

const useGetCpsOrderStats = ({ getStats, searchUri, metric, initialStats, dispatch = null }) => {
  const [stats, setStats] = useState(initialStats);
  const [isLoading, setIsLoading] = useState(false);
  const [count, setCount] = useState(0);

  const dispatchCallback = useCallback(
    (payload) => {
      if (dispatch && (metric === 'all' || metric === 'billed')) {
        dispatch({ type: `SET_${metric.toUpperCase()}`, payload });
      }
    },
    [dispatch, metric]
  );

  const getCpsOrderStats = useCallback(
    (isMounted) => {
      if (metric && searchUri !== null) {
        if (dispatch) {
          dispatch({ type: `RESET_${metric.toUpperCase()}` });
        }
        setIsLoading(true);
        getStats(searchUri, metric)
          .then((data) => {
            if (isMounted) {
              const thisData = data[camelCase(metric)];
              setStats(thisData);
              dispatchCallback(thisData);
              setIsLoading(false);
            }
          })
          .catch((e) => {
            if (isMounted) {
              const errorMessage = (e.response || {}).statusText || 'An error occurred';
              const errorData = { error: errorMessage };
              setIsLoading(false);
              setStats(errorData);
              dispatchCallback(errorData);
            }
          });
      }
    },
    [getStats, dispatch, metric, searchUri, dispatchCallback]
  );

  const getData = () => {
    setCount(count + 1);
  };

  useEffect(() => {
    let mounted = true;

    getCpsOrderStats(mounted);

    return () => {
      mounted = false;
    };
  }, [getCpsOrderStats, count, searchUri]);

  const showError = !isLoading && (!stats || stats.error !== undefined);

  return { stats, isLoading, showError, getData };
};

export { useGetCpsOrderStats };

In this custom hook we keep the logic of fetching the desired data (metric), update the stats context by calling the dispatch function for the specified metric and define the loading state and if we need to show an error message or not.

Prevent new requests until all components have finished loading

There are two buttons under the stats page filters, Search and Clear buttons. Every time we click on each one of them, all components should request their data again. We need to prevent the user from clicking either of those buttons until all components have finished loading their data. Otherwise, we might end up having multiple asynchronous requests to compete with each other from which one will finish first.

For this reason we introduce the LoadingInspectionContext, which is responsible for keeping the loading state of the whole page, in other words, it checks to see if there is at least one component in the page that still waits for its request to finish.

const LoadingInspectionContext = React.createContext({});

function atLeastOneIsPending(collection) {
  return Object.entries(collection).some((x) => x[1] === true);
}

function LoadingInspectionProvider({ children }) {
  const [requests, setRequests] = useState({});
  const [isLoading, setIsLoading] = useState(false);

  const updateRequest = useCallback((metric, metricIsLoading) => {
    setRequests((r) => ({ ...r, [metric]: metricIsLoading }));
  }, []);

  const value = React.useMemo(
    () => ({
      isLoading,
      requests,
      updateRequestsState: updateRequest
    }),
    [updateRequest, requests, isLoading]
  );

  useEffect(() => {
    setIsLoading(atLeastOneIsPending(requests));
  }, [requests]);

  return (
    <LoadingInspectionContext.Provider value={value}>{children}</LoadingInspectionContext.Provider>
  );
}

function LoadingInspectionConsumer({ children }) {
  return (
    <LoadingInspectionContext.Consumer>
      {(context) => {
        if (context === undefined) {
          throw new Error(
            'LoadingInspectionConsumer should be used inside a LoadingInspectionProvider'
          );
        }
        return children(context);
      }}
    </LoadingInspectionContext.Consumer>
  );
}

export { LoadingInspectionProvider, LoadingInspectionConsumer };

As we can see, we expose the LoadingInspectionProvider and LoadingInspectionConsumer components from the above file. The way we use these goes like this: we wrap the filters and the main page (that contains the stats components) with a LoadingInspectionProvider. Then, we wrap each one of the wrapper components (the components that fetch the data) with a LoadingInspectionConsumer. When a stats component’s data are available, we call the updateRequestsState function that is provided by the context object of the LoadingInspectionConsumer components, in order to update the page’s loading state.

Combining them all together

Now, let’s see how all the above work together.

export default function Stats({ options }) {
  const [searchUri, setSearchUri] = useState(null);

  const searchCallback = useCallback(({ queryString }) => {
    setSearchUri(queryString);
  }, []);

  return (
    <LoadingInspectionProvider>
      <StatsFilters OnSearchCallback={searchCallback} />
      <div>
        {searchUri !== null && <StatsMetrics searchUri={searchUri} />}
      </div>
    </LoadingInspectionProvider>
  );
}

As we said earlier, LoadingInspectionProvider wraps the filters and the StatsMetrics components. If we take a look inside the StatsMetrics component, we can see how we use the LoadingInspectionConsumer and the CpsOrdersStatsProvider and CpsOrdersStatsConsumer components.

import { CpsOrdersStatsProvider, CpsOrdersStatsConsumer } from 'path/to/cpsOrdersStatsContext';
import { LoadingInspectionConsumer } from 'path/to/loadingInspectionContext';

export default function StatsMetrics({ searchUri }) {
  return (
    <CpsOrdersStatsProvider>

      /* A */
      <LoadingInspectionConsumer>
        {({ updateRequestsState }) => (
          <StatsOrderGroupWrapper
            searchUri={searchUri}
            metric="all"
            updateRequestsState={updateRequestsState}
          />
        )}
      </LoadingInspectionConsumer>

      /* B */
      <CpsOrdersStatsConsumer>
        {(context) => {
          const {
            stats: { all }
          } = context || { stats: {} };
          return <StatsOrderCountLine metric={all} />;
        }}
      </CpsOrdersStatsConsumer>

      /* C */
      <LoadingInspectionConsumer>
        {({ updateRequestsState }) => (
          <StatsOrderGroupWrapper
            searchUri={searchUri}
            metric="billed"
            updateRequestsState={updateRequestsState}
          />
        )}
      </LoadingInspectionConsumer>

      /* D */
      <LoadingInspectionConsumer>
        {({ updateRequestsState }) => (
          <StatsOrderGroupWrapper
            searchUri={searchUri}
            metric="pending_billing"
            updateRequestsState={updateRequestsState}
          />
        )}
      </LoadingInspectionConsumer>

      /* E */
      <CpsOrdersStatsConsumer>
        {(context) => {
          const {
            stats: { billed }
          } = context || { stats: {} };
          return <StatsAverageGroup billed={billed} />;
        }}
      </CpsOrdersStatsConsumer>

      /* F */
      <LoadingInspectionConsumer>
        {({ updateRequestsState }) => (
          <StatsRatiosGroupWrapper
            searchUri={searchUri}
            metric="ratios"
            updateRequestsState={updateRequestsState}
          />
        )}
      </LoadingInspectionConsumer>

      /* G */
      <LoadingInspectionConsumer>
        {({ updateRequestsState }) => (
          <StatsOrderGroupWrapper
            searchUri={searchUri}
            metric="cancelled"
            updateRequestsState={updateRequestsState}
          />
        )}
      </LoadingInspectionConsumer>

      /* H */
      <LoadingInspectionConsumer>
        {({ updateRequestsState }) => (
          <CancellationGroupWrapper
            searchUri={searchUri}
            metric="cancellation_reasons"
            updateRequestsState={updateRequestsState}
          />
        )}
      </LoadingInspectionConsumer>

    </CpsOrdersStatsProvider>
  );
}

To sum up, we can say that the flow goes something like this:

Wrapper components (like StatsOrderGroupWrapper) render with a loading state enabled
useGetCpsOrderStats requests the data
When the data are available, the dispatch method is called in order to update the context
The component’s loading state becomes false
The wrapper notifies the LoadingInspectionProvider context component that it has finished loading its data
The stats contexts has been updated, so CpsOrdersStatsConsumers notify their children to render the desired data

Final result

It is time to take a look at our final result. As soon as the document loads, each component sends a request to the backend and renders a placeholder, indicating that it waits for its data to be available. Now, the user can have a better idea of what this page is going to render.

Image 4: The final result

In case of unexpected error in a specific component, an error message gets rendered alongside with a retry button that gives the user the opportunity to request the data for this component again. The other components that managed to retrieve their data successfully should be able to visualize them via the respective chart.

Image 5: Unexpected error to one or more components

Now let’s take a look at the network tab to figure out what has changed. The next two images illustrate the time it took for each request to be completed.

All measurements were made in development environment

In the first image we can see that, in the initial case, it took about 34 seconds for one (and ony) request to be completed.

Image 6: One network request

In the second image we see that it takes about 42 seconds for all requests to be completed. Moreover, instead of 1 request, we have 6 requests that run concurrently. At first glance this does not seem so efficient, on the contrary, it seems to have made things worse.

But if we take a second look, we can see that the first request responses in less than 12 seconds. This means that in 12 seconds from the initial load, the first component renders its data. After about 5 seconds, a second component renders its own data, and so on. It seems that the page loads progressively!

Image 7: Multiple network requests

Comparing the two cases, (in the first one all components render their data after 34 seconds, and in the second case each component renders its data as soon as they are available), we see that the second case provides a better user experience, even if the last component gets rendered after 42 seconds (vs 34 seconds that took in the first case).

The fact that the page has finished loading the first batch of information in 12 seconds (instead of 34) reduces the TTI. As we can see in the following lighthouse report, the page starts to get interactive at 1.8 seconds.

Image 8: Lighthouse report

Summary and next steps

In this post we examined the implementation of a progressive react app and we provided some technical details. We saw the use of React’s context object and how it helped us to achieve specific functionality. Finally, we presented some performance metrics and saw that the final solution is a bit slower than the initial one, but the user experience is clearly better.

But there is always room for improvement. In our case, we can implement a mechanism that would give the ability to the user to change the applied filters without having to wait for all data to get fetched. As soon as we use axios to fetch our data, we can use the CancelToken object provided by the library.

Below we can see the getStats function, the function that we call to fetch the stats data for a specific metric. getStats uses the get function which is a custom implementation of axios that we call SkroutzAxios.

It’s pretty straightforward to implement the cancellation functionality here, we just have to pass the cancelToken property to the options object.

const CancelToken = axios.CancelToken;
const source = CancelToken.source();

const httpRequest = SkroutzAxios;

function get(url) {
  return httpRequest(url, {
    method: 'GET',
    credentials: 'same-origin',
    mode: 'cors',
    cache: 'default',
    cancelToken: source.token // Provide a cancellation token
  }).then(checkStatus);
}

function getStats(searchUri = '', metric = '') {
  let endpoint = `${STATS_ENDPOINT}${searchUri}`;
  if (metric) {
    const symbol = searchUri ? '&' : '?';
    endpoint += `${symbol}metric=${metric}`;
  }
  return get(endpoint);
}

Then, we can just call source.cancel(); when we want to cancel the pending requests. For more details about the usage of CancelToken visit the axios docs.

Finally, as we can see from the previous performance report, we can take some actions in order to improve the overall score:

Reduce the initial server response time because React waits for the document to get served in order to start fetching the data
Remove potential unused javascript because they affect the network activity
Eliminate render-blocking resources and deliver non-critical assets asynchronously

Refactoring a React app to progressively load its data was originally published by John Kapantzakis at Skroutz Engineering on April 07, 2021.

How we classify products at Skroutz

2021-03-05T22:00:00+00:00

Skroutz is a marketplace that hosts more than 8,500 merchants and keeps adding 500 new merchants per month. This translates to more than 80,000 new offers per day with peaks as high as 200,000 on certain occasions.

Our product content team is one of the largest in the organization (140 people as of Mar 2021) but to be able to handle high product loads we had to implement a number of automation tools for product classification.

Merchants have two ways of uploading products to Skroutz:

Via an XML file that always reflects the merchant’s up to date offers including new ones
Through our merchants CMS (used by merchants without a platform of their own)

When a new offer (or a product in the Skroutz jargon) is detected, it is identified and if possible placed in the appropriate category and merged into the corresponding SKU.

Before we move on to describe how the classification is achieved it is important to describe some of the basics:

An SKU is a brand’s unique product that is uniquely described by a part number or an EAN.
An offer is unique to a merchant but generally describes an SKU. The offer should in reality carry all the necessary attributes of the SKU so that it can be correctly identified but, unfortunately, that’s rarely the case.
A category represents a class of SKUs (e.g. smartphones, sneakers). Merchants and Skroutz almost always have a different categorization hierarchy. Our category tree has many levels but only leaves contain SKUs, top level categories are there to help consumers navigate.
SKUs have specifications in a structured format that can be used by consumers to filter results. E.g. a smartphone will have a screen size specification whereas a dress will have a color specification. These specifications are defined on a category level.
An SKU belongs to a brand or a manufacturer (e.g. Samsung)

The above relations are better depicted in the diagram below:

On average, 60% of all incoming products/offers belong to an existing SKU. Ideally, the SKU part numbers or EAN should be enough for SKU classification but in reality those attributes are often either missing or are just plain wrong.

Classification

Our classification tool goes by the name of Tron with two major subtools:

Megatron: classifies incoming products to categories using machine learning
Ngntron (or new generation tron): classifies products into SKUs using feature extraction

The purpose of this post is to describe Ngntron and how feature analysis has helped us build a myriad of satellite tools other than just classification.

SKU classification

Incoming products are plain text representations of their attributes, with all their necessary attributes included like brand name, color, size, etc.

Below are example of those products from various categories:

Xiaomi Poco X3 Dual Sim 6.67" 6GB/128GB 4G NFC Γκρι M2007J20CG
LG Ψυγειοκαταψύκτης GBP62DSNFN (384lt, A+++) Total No Frost
Γυναικεία Παπούτσια Vans | Old Skool Platform Black | Womens Shoes Black VN0A3B3UY281

As evident from the examples above, product descriptions follow no standard pattern and in some cases include marketing information not relevant to the product such as special discounts. Below are the most common problems found in product descriptions:

Part numbers or EANs can refer to product families (e.g. Apple iPhone 12) and not specific variants (e.g. Apple iPhone 12 64GB Black)
Random strings instead of part numbers
Missing or partial information
Country or region specific part numbers / EANs
Multiple part numbers
Redundant or irrelevant information

Our first approach in using plain TF-IDF yielded poor performance. After all our purpose was not to rank products based on relevance but find that one true perfect match or just determine that this is a new product that matches none of the existing.

Feature extraction

The process of feature extraction aims to identify specific attributes in the text representation of the product and tag them or even better link them to known models.

For example, the first product in the list above yields the following results:

product_name = 'Xiaomi Poco X3 Dual Sim 6.67" 6GB/128GB 4G NFC Γκρι M2007J20CG'
analyzer = Ngntron::Analyzers::ProductAnalyzer.new(product_name)
puts analyzer.phrase

The above snippet yields the following results:

[manufacturer] 0,0 =>  Xiaomi
[model] 1,2 =>  Poco X3
[filter] 3,4 =>  Dual Sim
[feature] 3,3 =>  Dual
[feature] 5,5 =>  6.67"
[] 6,6 =>  6GB/128GB
[feature] 7,7 =>  4G
[filter, feature] 8,8 =>  NFC
[filter, feature, color] 9,9 =>  Γκρι
[] 10,10 =>  M2007J20CG
[pn] 11,11 =>  30371

Each identified word in the original phrase has been tagged with one or more tags that correspond to a specific attribute. Similarly:

product_name = 'Γυναικεία Παπούτσια Vans | Old Skool Platform Black | Womens Shoes Black VN0A3B3UY281'
analyzer = Ngntron::Analyzers::ProductAnalyzer.new(product_name)
puts analyzer.phrase

[filter, feature] 0,0 =>  Γυναικεία
[category] 1,1 =>  Παπούτσια
[manufacturer] 2,2 =>  Vans
[filter, feature] 3,4 =>  Old Skool
[model] 5,5 =>  Platform
[filter, feature, color] 6,6 (9, 9)  =>  Black
[] 7,7 =>  Womens
[category] 8,8 =>  Shoes
[pn] 10,10 =>  VN0A3B3UY281

Each tag contains relevant information as to how it was identified and what model it is referencing. For example the tag manufacturer would have the manufacturer id that it matched.

The analyzer employees various heuristics and tricks to make sure that all tags are identified such as aliases (Western Digital vs WD, Call of Duty vs COD), years (2008 vs 08), numbers (IV vs 4) and the list goes on.

Feature comparison

When a new product arrives, its analysis is stored in a serialized format and updated every time the product is changed.

After the category classification has ended the SKU classification takes place by retrieving the product’s analysis and comparing it with existing SKUs.

Based on some predefined strategies such as absolute high entropy PN matchthe comparison phase will yield a match with a certain confidence level. We have 3 confidence levels:

Auto: the product is classified with no human intervention
Semi-Auto: the product is classified but a human must confirm at some point
Manual: a human will approve this classification but until then the new product is not classified

Stats

As of today, more than 45% of incoming products that belong to existing SKUs are classified with no human intervention, another 40% is classified but requires approval, and 10% is classified after a human approves the match. Only 5% of new products escape Ngntron and require a human to look for a match.

With the help of Ngntron, merchants with thousands of products can go live with more than 90% of their product catalog listed on Skroutz in just an hour.

Other uses

We use Ngntron’s feature extraction capabilities not just for classification but for other cases as well. Our internal project QuLA will use the same pipeline to determine if an XML feed is suitable for Skroutz in advance and advise the account team accordingly.

We also use extracted features to guide the content team when retroactively adding specifications to a category.

Scaling

Since we expect to reach 20,000 merchants and 250,000 products per day in the near future, classification automation is one of the most important and high impact processes in Skroutz.

We have already tweaked the algorithm to learn from past classifications and adapt its category based confidence levels. In some categories, more than 90% of products are auto classified, greatly reducing the load of the content team and thus enabling us to scale our merchant base.

Of course even 5% of manual classification on such a scale is a huge load and that’s why the content engineering team is already optimizing Ngntron to further reduce the amount of manual work.

Oh and by they way, they are hiring

If you enjoyed reading this post and are curious to learn how Ngntron and other tools in Skroutz work, checkout our open positions

How we classify products at Skroutz was originally published by George Hadjigeorgiou at Skroutz Engineering on March 05, 2021.

Uncovering a 24-year-old bug in the Linux Kernel

2021-02-10T22:00:00+00:00

As part of our standard toolkit, we provide each developer at Skroutz with a writable database snapshot against which she can develop. These snapshots are updated daily through a pipeline that involves taking an LVM snapshot of production data, anonymizing the dataset by stripping all personal data, and transferring it via rsync to the development database servers. The development servers in turn use ZFS snapshots to expose a copy-on-write snapshot to each developer, with self-service tools allowing rollback or upgrades to newer snapshots.

We use the same pipeline to expose MariaDB and MongoDB data, with a full dataset size of 600GB and 200GB respectively, and a slightly modified pipeline for Elasticsearch. While on-disk data changes significantly for all data sources, rsync still saves significant time by transferring roughly 1/3 of the full data set every night. This setup has worked rather well for the better part of a decade and has managed to scale from 15 developers to 150. However, as with most systems, it has had its fair share of maintenance and has given us some interesting moments.

One of the most interesting issues we encountered led to the discovery of a fairly old bug in the Linux kernel TCP implementation: every now and then, an rsync transfer from a source server would hang indefinitely for no apparent reason, as — apart from the stuck transfer — everything else seemed to be in order. What’s more, for reasons that became apparent later, the issue could not be reproduced at will, although some actions (e.g. adding an rsync-level rate limit) seemed to make the issue less frequent, with frequency ranging from once or twice per week to once every three months.

As is not unusual in these cases, we had more urgent systems and issues to attend to, so we labeled this a “race condition in rsync” that we should definitely look into at some point, and worked around it by throttling the rsync transfers.

Until it started biting us every single day.

rsync as a pipeline

While not strictly necessary, knowing how rsync works internally will help understand the analysis that follows. The rsync site contains a thorough description of rsync’s internal architecture, so I’ll try to summarize the most important points here:

rsync starts off as a single process on the client and a single process on the server, communicating via a socket pair. When using the rsync daemon, as in our case, communication is done over a plain TCP connection
Based on the direction of sync, after the initial handshake is over, each end assumes a role, either that of the sender, or that of the receiver. In our case the client is the receiver, and the server is the sender.
The receiver forks an additional process called the generator, sharing the socket with the receiver process. The generator figures out what it needs to ask from the sender, and the sender subsequently sends the data to the receiver. What we essentially have after this step is a pipeline, generator → sender → receiver, where the arrows are the two directions of the same TCP connection. While there is some signaling involved, the pipeline operates in a blocking fashion and relies on OS buffers and TCP receive windows to apply backpressure.

A ghost in the network?

Our first reaction when we encountered the issue was to suspect the network for errors, which was a reasonable thing to do since we had recently upgraded our servers and switches. After eliminating the usual suspects (NIC firmware bugs involving TSO/GSO/GRO/VLAN offloading, excessive packet drops or CRC errors at the switches etc), we came to the conclusion that everything was normal and something else had to be going on.

Attaching the hung processes using strace and gdb told us little: the generator was hung on send() and the sender and receiver were hung on recv(), yet no data was moving. However, turning to the kernel on both systems revealed something interesting! On the client the rsync socket shared between the generator and the receiver processes was in the following state:

$ ss -mito dst :873
State      Recv-Q Send-Q                  Local Address:Port                                 Peer Address:Port
ESTAB      0      392827 ❶             2001:db8:2a::3:38022                             2001:db8:2a::18:rsync                 timer:(persist,1min56sec,0)
	 skmem:(r0,rb4194304,t0,tb530944,f3733,w401771,o0,bl0,d757) ts sack cubic wscale:7,7 rto:204 backoff:15 rtt:2.06/0.541 ato:40 mss:1428 cwnd:10 ssthresh:46 bytes_acked:22924107 bytes_received:100439119971 segs_out:7191833 segs_in:70503044 data_segs_out:16161 data_segs_in:70502223 send 55.5Mbps lastsnd:16281856 lastrcv:14261988 lastack:3164 pacing_rate 133.1Mbps retrans:0/11 rcv_rtt:20 rcv_space:2107888 notsent:392827 minrtt:0.189

while on the server, the socket state was the following:

$ ss -mito src :873
State      Recv-Q Send-Q                Local Address:Port                                 Peer Address:Port
ESTAB      0      0                   2001:db8:2a::18:rsync                              2001:db8:2a::3:38022                 timer:(keepalive,3min7sec,0)
 	 skmem:(r0,rb3540548,t0,tb4194304,f0,w0,o0,bl0,d292) ts sack cubic wscale:7,7 rto:204 rtt:1.234/1.809 ato:40 mss:1428 cwnd:1453 ssthresh:1431 bytes_acked:100439119971 bytes_received:22924106 segs_out:70503089 segs_in:7191833 data_segs_out:70502269 data_segs_in:16161 send 13451.4Mbps lastsnd:14277708 lastrcv:16297572 lastack:7012576 pacing_rate 16140.1Mbps retrans:0/794 rcv_rtt:7.5 rcv_space:589824 minrtt:0.026

The interesting thing here is that there are 3.5MB of data on the client, queued to be sent (❶ in the first output) by the generator to the server; however, while the server has an empty Recv-Q and can accept data, nothing seems to be moving forward. If Recv-Q in the second output was non-zero, we would be looking at rsync on the server being stuck and not reading from the network, however here it is obvious that rsync has consumed all incoming data and is not to blame.

So why is data queued up on one end of the connection, while the other end is obviously able to accept it? The answer is conveniently hidden in the timer fields of both ss outputs, especially in timer:(persist,1min56sec,0). Quoting ss(8):

       -o, --options
              Show timer information. For TCP protocol, the output format is:

              timer:(<timer_name>,<expire_time>,<retrans>)

              <timer_name>
                     the name of the timer, there are five kind of timer names:

                     on : means one of these  timers:  TCP  retrans  timer,  TCP
                     early retrans timer and tail loss probe timer

                     keepalive: tcp keep alive timer

                     timewait: timewait stage timer

                     persist: zero window probe timer

                     unknown: none of the above timers

persist means that the connection has received a zero window advertisement and is waiting for the peer to advertise a non-zero window.

TCP Zero Windows and Zero Window Probes

TCP implements flow control by limiting the data in flight using a sliding window called the receive window. Wikipedia has a good description, but in short each end of a TCP connection advertises how much data it is willing to buffer for the connection, i.e. how much data the other end may send before waiting for an acknowledgment.

When one side’s receive buffer (Recv-Q) fills up (in this case because the rsync process is doing disk I/O at a speed slower than the network’s), it will send out a zero window advertisement, which will put that direction of the connection on hold. When buffer space eventually frees up, the kernel will send an unsolicited window update with a non-zero window size, and the data transfer continues. To be safe, just in case this unsolicited window update is lost, the other end will regularly poll the connection state using the so-called Zero Window Probes (the persist mode we are seeing here).

The window is stuck closed

It’s now time to dive a couple of layers deeper and use tcpdump to see what’s going on at the network level:

[…]
09:34:34.165148 0c:c4:7a:f9:68:e4 > 0c:c4:7a:f9:69:78, ethertype IPv6 (0x86dd), length 86: (flowlabel 0xcbf6f, hlim 64, next-header TCP (6) payload length: 32) 2001:db8:2a::3.38022 > 2001:db8:2a::18.873: Flags [.], cksum 0x711b (incorrect -> 0x4d39), seq 4212361595, ack 1253278418, win 16384, options [nop,nop,TS val 2864739840 ecr 2885730760], length 0
09:34:34.165354 0c:c4:7a:f9:69:78 > 0c:c4:7a:f9:68:e4, ethertype IPv6 (0x86dd), length 86: (flowlabel 0x25712, hlim 64, next-header TCP (6) payload length: 32) 2001:db8:2a::18.873 > 2001:db8:2a::3.38022: Flags [.], cksum 0x1914 (correct), seq 1253278418, ack 4212361596, win 13831, options [nop,nop,TS val 2885760967 ecr 2863021624], length 0
[… repeats every 2 mins]

The first packet is the rsync client’s zero window probe, the second packet is the server’s response. Surprisingly enough, the server is advertising a non-zero window size of 13831 bytes¹ which the client apparently ignores.

¹ actually multiplied by 128 because of a window scaling factor of 7

We are finally making some progress and have a case to work on! At some point the client encountered a zero window advertisement from the server as part of regular TCP flow control, but then the window failed to re-open for some reason. The client seems to be still ignoring the new window advertised by the server and this is why the transfer is stuck.

Linux TCP input processing

By now it’s obvious that the TCP connection is in a weird state on the rsync client. Since TCP flow control happens at the kernel level, to get to the root of this we need to look at how the Linux kernel handles incoming TCP acknowledgments and try to figure out why it ignores the incoming window advertisement.

Incoming TCP packet processing happens in net/ipv4/tcp_input.c.Despite the ipv4 component in the path, this is mostly shared code between IPv4 and IPv6.

Digging a bit through the code we find out that incoming window updates are handled in tcp_ack_update_window and actually updating the window is guarded by the following function:

/* Check that window update is acceptable.
 * The function assumes that snd_una<=ack<=snd_next.
 */
static inline bool tcp_may_update_window(const struct tcp_sock *tp,
					const u32 ack, const u32 ack_seq,
					const u32 nwin)
{
	return	after(ack, tp->snd_una) || ❶
		after(ack_seq, tp->snd_wl1) || ❷
		(ack_seq == tp->snd_wl1 && nwin > tp->snd_wnd); ❸
}

The ack, ack_seq, snd_wl1 and snd_una variables hold TCP sequence numbers that are used in TCP’s sliding window to keep track of the data exchanged over the wire. These sequence numbers are 32-bit unsigned integers (u32) and are incremented by 1 for each byte that is exchanged, beginning from an arbitrary initial value (initial sequence number). In particular:

ack_seq is the sequence number of the incoming segment
ack is the acknowledgment number contained in the incoming segment, i.e. it acknowledges the sequence number of the last segment the peer received from us.
snd_wl1 is the sequence number of the incoming segment that last updated the peer’s receive window.
snd_una is the sequence number of the first unacknowledged segment, i.e. a segment we have sent but has not been yet acknowledged by the peer.

Being fixed-size integers, the sequence numbers will eventually wrap around, so the after() macro takes care of comparing two sequence numbers in the face of wraparounds.

For the record, the snd_una and snd_wl1 names come directly from the original TCP specification in RFC 793, back in 1981!

Translating the rather cryptic check into plain English, we are willing to accept a window update from a peer if:

❶: our peer acknowledges the receipt of data we previously sent; or
❷: our peer is sending new data since the previous window update; or
❸: our peer isn’t sending us new data since the previous window update, but is advertising a larger window

Note that the comparison of ack_seq with snd_wl1 is done to make sure that the window is not accidentally updated by a (retransmission of a) segment that was seen earlier.

In our case, at least condition ❸ should be able to re-open the window, but apparently it doesn’t and we need access to these variables to figure out what is happening. Unfortunately, these variables are part of the internal kernel state and are not directly exposed to userspace, so it’s time to get a bit creative.

Accessing the internal kernel state

To get access to the kernel state, we somehow need to run code inside the kernel. One way would be to patch the kernel with a few printk() calls here and there, but that would require rebooting the machine and waiting for rsync to hang again. Rather, we opted to live-patch the kernel using systemtap with the following script:

probe kernel.statement("tcp_ack@./net/ipv4/tcp_input.c:3751")
{
  if ($sk->sk_send_head != NULL) {
	  ack_seq = @cast(&$skb->cb[0], "tcp_skb_cb", "kernel<net/tcp.h>")->seq
	  printf("ack: %d, ack_seq: %d, prior_snd_una: %d\n", $ack, ack_seq, $prior_snd_una)
	  seq = @cast(&$sk->sk_send_head->cb[0], "tcp_skb_cb", "kernel<net/tcp.h>")->seq
	  end_seq = @cast(&$sk->sk_send_head->cb[0], "tcp_skb_cb", "kernel<net/tcp.h>")->end_seq
	  printf("sk_send_head seq:%d, end_seq: %d\n", seq, end_seq)

	  snd_wnd = @cast($sk, "tcp_sock", "kernel<linux/tcp.h>")->snd_wnd
	  snd_wl1 = @cast($sk, "tcp_sock", "kernel<linux/tcp.h>")->snd_wl1
	  ts_recent = @cast($sk, "tcp_sock", "kernel<linux/tcp.h>")->rx_opt->ts_recent
	  rcv_tsval = @cast($sk, "tcp_sock", "kernel<linux/tcp.h>")->rx_opt->rcv_tsval
	  printf("snd_wnd: %d, tcp_wnd_end: %d, snd_wl1: %d\n", snd_wnd, $prior_snd_una + snd_wnd, snd_wl1)
	  printf("flag: %x, may update window: %d\n", $flag, $flag & 0x02)
	  printf("rcv_tsval: %d, ts_recent: %d\n", rcv_tsval, ts_recent)
	  print("\n")
     }
}

Systemtap works by converting systemtap scripts into C and building a kernel module that hot-patches the kernel and overrides specific instructions. Here we overrode tcp_ack(), hooked at its end and dumped the internal TCP connection state. The $sk->sk_send_head != NULL check is a quick way to only match connections that have a non-empty Send-Q.

Loading the resulting module into the kernel gave us the following:

ack: 4212361596, ack_seq: 1253278418, prior_snd_una: 4212361596
sk_send_head seq:4212361596, end_seq: 4212425472
snd_wnd: 0, tcp_wnd_end: 4212361596, snd_wl1: 1708927328
flag: 4100, may update window: 0
rcv_tsval: 2950255047, ts_recent: 2950255047

The two things of interest here are snd_wl1: 1708927328 and ack_seq: 1253278418. Not only are they not identical as we would expect, but actually ack_seq is smaller than snd_wl1, indicating that ack_seq wrapped around at some point and snd_wl1 has not been updated for a while. Using the serial number arithmetic rules, we can figure out that this end has received (at least) 3.8 GB since the last update of snd_wl1.

We already saw that snd_wl1 contains the last sequence number used to update the peer’s receive window (and thus our send window), with the ultimate purpose of guarding against window updates from old segments. It should be okay if snd_wl1 is not updated for a while, but it should not lag too far behind ack_seq, or else we risk rejecting valid window updates, as in this case. So it looks like the Linux kernel fails to update snd_wl1 under some circumstances, which leads to an inability to recover from a zero-window condition.

Having tangible proof that something was going on in the kernel, it was time to get people familiar with the networking code in the loop.

Taking things upstream

After sleeping on this, we wrote a good summary of what we knew so far and what we supposed was happening, and reached out to the Linux networking maintainers. Confirmation came less than two hours later, together with a patch by Neal Cardwell.

Apparently, the bug was in the bulk receiver fast-path, a code path that skips most of the expensive, strict TCP processing to optimize for the common case of bulk data reception. This is a significant optimization, outlined 28 years ago² by Van Jacobson in his “TCP receive in 30 instructions” email. Apparently the Linux implementation did not update snd_wl1 while in the receiver fast path. If a connection uses the fast path for too long, snd_wl1 will fall so far behind that ack_seq will wrap around with respect to it. And if this happens while the receive window is zero, there is no way to re-open the window, as demonstrated above. What’s more, this bug had been present in Linux since v2.1.8, dating back to 1996!

² This optimization is still relevant today: a relatively recent attempt to remove the header prediction code and associated fast paths to simplify the code was reverted on performance regression grounds.

As soon as we got the patch, we applied it, rebuilt the kernel, deployed it on the affected machines and waited to see if the issue was fixed. A couple of days later we were certain that the fix was indeed correct and did not cause any ill side-effects.

After a bit of discussion, the final commit landed in linux-net, and from there it was merged into Linux mainline for 5.10-rc1. Eventually it found its way to the stable 4.9 and 4.19 kernel series that we use on our Debian systems, in 4.9.241 and 4.19.153 respectively.

Aftermath

With the fix in place, we still had a couple of questions to answer, namely:

How is it possible for a TCP bug that leads to stuck connections to go unnoticed for 24 years?
Out of an infrastructure with more than 600 systems running all kinds of software, how come we only witnessed this bug when using rsync?

It’s hard to give a definitive answer to these questions, but we can reason about it this way:

This bug will not be triggered by most L7 protocols. In “synchronous” request-response protocols such as HTTP, usually each side will consume all available data before sending. In this case, even if snd_wl1 wraps around, the bulk receiver will be left with a non-zero window and will still be able to send out data, causing the next acknowledgment to update the window and adjust snd_wl1 through check ❶ in tcp_may_update_window. rsync on the other hand uses a pretty aggressive pipeline where the server might send out multi-GB responses without consuming incoming data in the process. Even in rsync’s case, using rsync over SSH (a rather common combination) rather than the plain TCP transport would not expose this bug, as SSH framing/signaling would most likely not allow data to queue up on the server this way.
Regardless of the application protocol, the receiver must remain long enough (for at least 2GB) with a zero send window in the fast path to cause a wrap-around — but not too long for ack_seq to overtake snd_wl1 again. For this to happen, there must be no packet loss or other conditions that would cause the fast path’s header prediction to fail. This is very unlikely to happen in practice as TCP itself determines the network capacity by actually causing packets to be lost.
Most applications will care about network timeouts and will either fail or reconnect, making it appear as a “random network glitch” and leaving no trace to debug behind.

Finally, even if none of the above happens and you end up with a stuck TCP connection, it takes a lot of annoyance to decide to deal with it and drill deep in kernel code. And when you do, you are rewarded with a nice adventure, where you get to learn about internet protocol history, have a glimpse at kernel internals, and witness open source work in motion!

If you enjoyed reading this post and you like hunting weird bugs and looking at kernel code, you might want to drop us a line — we are always looking for talented SREs and DevOps Engineers!

Uncovering a 24-year-old bug in the Linux Kernel was originally published by Apollon Oikonomopoulos at Skroutz Engineering on February 10, 2021.

Speed: The Journey to Delivering a Faster Experience at Skroutz.gr

2020-10-22T21:00:00+00:00

TL;DR

We’ve always placed the user experience first, here at Skroutz. Since a performant application is essential for a seamless journey, speed has always been at our core.

Our rapidly evolving environment -the number of development teams, the adoption of new technologies, the addition of new features etc.- gradually slowed us down.

We knew we had to take action.

For this, we formed a non-typical task-force team to speed us up. We identified the problems, chose our measurement tools and methods and took the plunge.

Measuring performance is not an easy task. It involves both user perception and strictly defined metrics and thresholds.

In order to improve the speed, we tried various solutions. Some worked. Some didn’t. Below you can read in short the key takeaways.

Assets. Our main goal was to optimize the number and timing of requests. By initially loading only the necessary above the fold images and fine tuning our lazy loading mechanisms, we noticed significant gains in terms of initial requests (almost half in our Product page and up to 30 less in our Listing) and therefore some worthy improvement in Speed Index metrics (in some cases up to ~4.5%).

HTML. Excessive DOM size was one of our most critical performance bottlenecks. Our Product pages (the most important section) could reach up to ~8k nodes in some cases, far from Google’s proposal of 1,5k.
We tried various solutions involving windowing (rejected), async loading product cards’ content and showing less user reviews (by risking losing valuable user generated content).
What did make a huge difference was timing: Loading the information when it actually needed to exist. This was achieved by implementing a mechanism that would notify each card when it was about to appear in the viewport. The only element needed beforehand was a single-node placeholder. In some cases the DOM nodes were reduced by 45%, which results in an increase of ~10 points in our overall Lighthouse score!

CSS. Although our styling architecture was in pretty good shape, we thought it might be worth trying critical CSS. The concept was to initially load only the necessary styles for rendering anything above-the fold. This would improve metrics such as First Contentful Paint & Largest Contentful Paint while making the loading feel faster. It turned out that the above metrics were too slightly improved compared to the effort needed to add it in our pipeline. In short, this didn’t work for us.

Javascript. Moving gradually from static to interactive pages caused code bloating, especially at the Javascript side. Our main JS file was including lots of libraries that were not used in every page. This is a problem, especially for mobile devices, due to the fact that JS runs in the main thread.
Our actions, directed to reduce our webpack bundle size in order to release main thread calculations for the initial load, and iterate over the Redux architecture to improve speed after user interaction, led to slightly better performance.

During this journey, we also started addressing some issues on new Web Vitals user-centric metrics. We mainly focused on visual stability, by eliminating any layout shifts.

After a year’s work, we made Skroutz.gr faster. And more stable.

If you are interested in more details, and you’re ready for a deeper technical dive, make yourself a coffee and keep on reading (it will take ~30 minutes to read).

Table of Contents

A Brief History

Speed: not a Metric, but a Users’ Issue

Evolution of Performance Metrics: from Speed Index to Core Web Vitals
› Pagespeed Insights (PSI)
› Core Web Vitals

The Problems of Skroutz.gr
› HTML
› CSS
› Javascript
› Assets

The Journey: What Worked and What Didn’t
› Assets
› HTML
› CSS
› Javascript
› Core Web Vitals: Cumulative Layout Shift (CLS)

Onwards - Closing

A Brief History

Skroutz.gr was always a quite fast and sophisticated web application.

Speed has always been a critical component for Skroutz.gr since we believe that for a modern web experience, it’s important to get fast and stay fast.

Historically, the biggest problem we were facing regarding speed (and the biggest blessing at the same time), was the really huge amount of content (DOM) in some of our most popular pages, which contains a lot of shops and user-generated content, like reviews, questions, etc. This problem becomes bigger and bigger as we add extra information for Products and Categories or extra services (we have developed a Marketplace functionality where users can buy directly from Skroutz.gr).

Back in 2016, the huge DOM of some pages was causing crashes due to memory restrictions in some devices (i.e. iPad), while at the same time the performance was poor, in terms of rendering and painting. To solve these issues at that time, we started requesting and rendering elements asynchronously.

However, since our last major redesign in 2016, lots of things have changed.

Facts like the rapidly growing number of development teams, the adoption of new technologies (i.e. React js, CSS Grid), the addition of more and more features in our pages, etc., led to worse rendering performance, despite the fact that today there are better and more powerful devices our applications are running on.

Rendering speed took a backseat.

On the other hand, one of the main questions we’re regularly asking ourselves here at Skroutz, is whether our website responds to our users’ expectations and what we can do in order to help them with their buying decisions. When it comes to user experience, speed matters.

Today, consumers are more demanding than they’ve ever been. When they weigh up the experience on a site, they aren’t just comparing it with their competitors, they’re rating it against the best in class services they use every day.

Being of “Moderate Speed” was not acceptable for us, so we decided to take action in order to resolve the issues.

We formed a non-typical task-force team, consisting of engineers, SEO-ers and product owners and we started working on, in order to improve our speed.

In the following, we describe things we did, how we measured our actions, what worked for us, what didn’t work, and some takeaways from our experience during the journey.

Speed: not a Metric, but a Users’ Issue

Imagine you’re walking through an unfamiliar city to get to an important appointment.
You walk through various streets and city centers on your way. But here and there, there are slow automatic doors you have to wait for to open and unexpected construction detours lead you astray. All of these events interrupt your progress, increase stress and distract you from reaching your destination.

People using the web are also on a journey, with each of their actions constituting one step in what would ideally be a continuous flow. And just like in the real world, they can be interrupted by delays, distracted from their tasks and led to make errors.
These events, in turn, can lead to reduced satisfaction and abandonment of a site or the whole journey.

In both cases, removing interruptions and obstacles is the key to a smooth journey and a satisfied user [chromium blog].

When it comes to user experience, speed matters. A consumer study shows that the stress response to delays in mobile speed are similar to that of watching a horror movie or solving a mathematical problem, and greater than waiting in a checkout line at a retail store [ref].

Website performance is crucial to a web application’s success.

Amazon found that each additional 1/10th of a second of load time corresponded with a 1% reduction in sales. Walmart found that for every second they improved their page load times they added an additional 2% to their conversion rate [ref]. EBay saw a 0.5% increase in “Add to Cart” count for every 100 milliseconds improvement in search page loading time [ref].

Besides conversion rates, you may know that Google uses the performance of a website as a ranking factor in search results as well!

In his book Usability Engineering (1993), Jakob Nielsen* identifies three main response time limits.

0.1 second — Operations that are completed in 100ms or fewer will feel instantaneous to the user. This is the gold standard that one should aim for when optimising your websites.
1 second — Operations that take 1 second to finish are generally OK, but the user will feel the pause. If all operations take 1 second to complete, a website may feel a little sluggish.
10 seconds — If an operation takes 10 seconds or more to complete, the user may switch over to a new tab, or give up on the website completely (this depends on what operation is being completed. For example, users are more likely to stick around if they’ve just submitted their card details in the checkout than if they’re waiting to load a product page).

* Since these limits published back in 1993, as internet speed have increased and we are now browsing the web at a lightning pace, there is a speculation that the upper limit is pretty smaller, close to 5 seconds or even lower.

Takeaway: Performance is important! It can mean the difference between making a sale, or losing a customer to the competition.

Evolution of Performance Metrics: from Speed Index to Core Web Vitals

Performance is a foundational aspect of good user experiences.

But what exactly is Performance?

And how do we put a page in the fast or in the slow bucket?

Users of the web expect that the pages they visit will be fastly rendered, interactive and smooth. Pages should not only load quickly, but also run well; scrolling should be stick-to-finger fast, and animations and interactions should be silky smooth.

Performance is more about user perception and less about the actual, objective duration. How fast a website feels like it’s loading and rendering has a greater impact on user experience than how fast the website actually loads and renders.

How fast or slow something feels like, depends a lot on whether the user is actively or passively waiting for this thing to happen. Waits can have an active and passive phase. When the user is active - moving the mouse, thinking, being entertained, they are in an active phase.
The passive phase occurs when the user is passively waiting, like staring at a monochrome screen. If both the passive and active waits time were objectively equal, users would estimate that the passive waiting period was longer than the active. If a load, render, or response time can not be objectively minimized any further, turning the wait into an active wait instead of a passive wait can make it feel faster.

Besides perception, as the web evolves over time, the metrics and the thresholds evolve too.

How we measure and assort a page today regarding their rendering speed, may be completely irrelevant tomorrow.

While a lot of things constantly change, there is something that remains the same: human perceptual abilities, which are critical in evaluating an experience.

But how do we practically evaluate whether a page is fast or not in Skroutz all these years?

There are 2 main phases regarding this.

We used to focus on low level timings, like the Time to First Byte (server response, networking), the Speed Index (visual display), the First Paint etc.
Now, we try to incorporate more quality user metrics.

Let’s see the most important ones… starting from Google.

According to Google too, speed matters. For this, Google encourages developers to think broadly about how performance affects a user’s experience of their page and to consider a variety of user experience metrics.

To the time being, the following are some resources that we, at Skroutz, use to evaluate a page’s performance:

Lighthouse, an automated tool and a part of Chrome Developer Tools for auditing the quality (performance, accessibility, and more) of web pages.
PageSpeed Insights, a tool that indicates how well a page performs on the Chrome UX Report and suggests performance optimizations.
Web Vitals is the latest initiative by Google, to provide unified guidance for quality signals that are essential to delivering a great user experience on the web.
Chrome User Experience Report, a public dataset of key user experience metrics for popular destinations on the web, as experienced by Chrome users under real-world conditions.

Google has long used page speed as a signal for rankings, and the new (and different) approach in this signal uses data measured directly by Chrome on users’ desktop and mobile devices. As a result, Google announced that in 2021 the Core Web Vitals metrics will join other user experience (UX) signals to become a ranking signal.

PageSpeed Insights (PSI)

Google’s PageSpeed Insights (PSI) reports on the performance of a page on both mobile and desktop devices, and provides suggestions on how that page may be improved.

PSI provides both lab and field data about a page. Lab data is useful for debugging performance issues, as it is collected in a controlled environment. However, it may not capture real-world bottlenecks. Field data is useful for capturing true, real-world user experience - but has a more limited set of metrics. See How To Think About Speed Tools for more information on the 2 types of data.

At the top of the report, PSI provides a score which summarizes the page’s performance. This score is determined by running Lighthouse to collect and analyze lab data about the page. A score of 90 or above is considered good. 50 to 90 is a score that needs improvement, and below 50 is considered poor.

Core Web Vitals

Core Web Vitals are the subset of Web Vitals that apply to all web pages, should be measured by all site owners, and will be surfaced across all Google tools.

Each of the Core Web Vitals represents a distinct facet of the user experience, is measurable in the field, and reflects the real-world experience of a critical user-centric outcome.

Although the metrics that make up Core Web Vitals is being evolved over time, the current set for 2020 focuses on three aspects of the user experience: loading, interactivity, and visual stability:

Largest Contentful Paint (LCP): measures loading performance. To provide a good user experience, LCP should occur within 2.5 seconds of when the page first starts loading.
First Input Delay (FID): measures interactivity. To provide a good user experience, pages should have a FID of less than 100 milliseconds.
Cumulative Layout Shift (CLS): measures visual stability. To provide a good user experience, pages should maintain a CLS of less than 0.1.

The Problems of Skroutz.gr

Generally, when a user types a URL in her browser, the browser makes a GET request to a remote server, the server responds with some resources, which when they arrive at the browser, they are integrated together in order to render a visual.

For this procedure to deploy, besides networking timings and delays, one of the most critical components is the weight of the resources asked.

These resources are usually, the HTML (from where the DOM gets built), the CSS (from where the CSSOM gets built), probably one or more JS scripts, Images and Fonts (Assets). Let’s break down each one.

HTML

A large DOM tree can slow down page performance in multiple ways.

First of all, a large DOM tree often includes many nodes that aren’t visible when the user first loads the page, which unnecessarily increases data costs for the users and slows down load time. Furthermore, as users and scripts interact with the page, the browser must constantly recompute the position and styling of nodes, causing rendering lags. Last but not least, targeting elements (through CSS or JS) applies to a large number of nodes, which can overwhelm the memory capabilities of devices.

Skroutz’s main issue at the time being, was the excessive DOM size, especially at the Product pages.

Unfortunately, our Product pages are the most important sections of our application and have a lot of content, user generated or not. Even worse, Product pages with a lot of content (and excessive DOM) are the most popular ones, since the content regards many shops, a lot of product information, multimedia, many user reviews etc.

Although many sections were already coming asynchronously, they were still too heavy. At that time, our most heavy Product pages had ~8k nodes. This was far from Google’s Lighthouse proposal of 1,5K nodes total maximum.

Below is a graph of our top visited 3.000 Product pages, regarding Shops and Reviews numbers.

Shops & reviews of top 3.000 products

With the most popular pages having more than 30 shop cards and at least 30 user reviews each, it was clear that we had to find ways to lighten the weight without running the risk of getting hit by an SEO issue (rankings).

That was a quite difficult exercise to solve.

CSS

CSS is, besides HTML, the most critical component for a browser.

The browser can only paint the page once it has downloaded the CSS and built the CSS object model. For this reason, CSS is render blocking.

Browsers follow a specific rendering path: paint only occurs after layout, which occurs after the render tree is created, which in turn requires both the DOM and the CSSOM trees.

Our styling architecture was in pretty good shape (you can read our approach in detail here, which is close to the current state).

We bundle our CSS files depending on the viewport (mobile-first approach) and we further separate them in a few major sections in order for them to be easily handled from the browser (i.e. Books section, logged section etc).

As it was hype during this period, we thought we can try critical CSS, especially on mobile viewports to test if it could speed up the rendering process.

Javascript

When a browser runs many events, it’s going to do it on the same thread that handles user input (called the main thread).

By default, the main thread of the renderer process typically handles most code: it parses the HTML and builds the DOM, parses the CSS and applies the specified styles, and parses, evaluates, and executes Javascript.

The main thread also processes user events. So, any time the main thread is busy doing something else, a web page may not respond to user interactions, leading to a bad experience.

Loading too much Javascript into the main thread (via <script>, etc.) was the main issue for us, especially for mobile devices.

The size of our JS bundle (named skr_load.js) was 312KB (1.2MB) after compression!

The main issues regarding Javascript were the following:

Lack of Tree shaking, many unused components and dead code
Lots of application and library code were in the same big fat JS bundle
Lots of libraries like lodash were fully imported instead of partially
Heavy dependencies included in the abovementioned JS bundle still not used in any other page

Assets

According to HTTP Archive, as of November 2018, images makeup on average 21% of a total webpage’s weight.

So when it comes to optimizing a website, after video content, images are by far the first place one should start!

Optimizing images is more important than scripts and fonts.

And ironically, a good image optimization workflow is one of the easiest things to implement, yet a lot of website owners overlook this.

This was true for us too.

We found many images in different sections that got requested initially, although they weren’t rendered unless the users scrolled down a lot.

The Journey: What Worked and What Didn’t

Having had written down the total set of performance bottlenecks, it was the time for actions.

Although for most web pages it’s pretty straightforward what’s necessary for a better rendering performance, this was not true for us.

Because there is one magic word, regarding speed: diet!

In general, page speed could be improved by reducing the payload across all resources. By simply loading less code. Trimming all the unused and unnecessary bytes of JavaScript, CSS, HTML, and JSON responses served to users.

However, Skroutz.gr is a popular web application with more than 30 million sessions per month.

We had to be very careful in terms of user experience, since even a small change could add-up to a huge drop in sales.

Furthermore, the majority of our visitors come from organic searches, so we had to deploy that diet without running the risk to negatively impact our SEO performance.

Here is how we did it.

Assets (networking)

While, according to our initial analysis, the main bottlenecks were DOM size (HTML) and JS scripting, we opted for the low hanging fruits first.

Assets loading was the first and most obvious place to look for unnecessary initial calls that could easily be made async.

Images’ optimization was our best shot regarding assets, since we don’t have any non-safe webfonts or any other assets.

For the most part, images were loading on scroll and were adequately lightweight and optimized. But we had room for improvement.

Product page

In our Product pages, however, there were a few exceptions, mostly due to our -somewhat outdated- image lightbox.

Although UI-wise there are only 5 visible thumbnails on large screens (image below) and none on mobile, all images’ thumbs were loaded beforehand, along with the first high-res image of the carousel. The more the images, the bigger the problem. Note that our most popular Product pages, like mobile phones, could have anywhere from 20 to 30 images each.

Product page's gallery thumbnails

The lightbox was indeed outdated, but so did the structure of the list holding the thumbs. A brief refactor not only saved the redundant image requests, it also saved 3 DOM nodes per lightbox image (minus the 5 thumbs on desktop).

Most notably, we removed the <img> tags, which also held the data-attributes used to populate the lightbox. We moved the data-attributes to the parent <li> and used anchor tags only for the 5 visible thumbs, placing the images as background-image directly on them.

Background-image, unlike regular <img>, does not load unless visible*, thus saving the extra requests from mobile viewports without the need to have a different markup structure.

Taking into account some additional minor cuts (e.g. async load the 3 images of product suggestions, load first high-res lightbox image only after opening), image requests were reduced to almost half.

In numbers, one of our most popular phones with 25 images, instead of 39, now does 20 image requests, all being necessary above-the-fold images.

After deploying, Speed Index showed a decrease of ~4.5% (see image below, red line for the Product score).

Thumbnails "diet" results graph

Apart from the top (above the fold) section, we already used lazy loading on the product cards in the Product pages, but there was some room for improvement and this involved the reviews section at the bottom of the page.

Down there, we noticed that the user thumbnails were loaded immediately even though they were far down below the fold. After some code inspection, we realized that there was a lazy loading mechanism (using an external library) but it didn’t work properly.

This was caused by a CSS rule that was setting the user thumbnail as a background image on the appropriate element. Thumbnails were loaded immediately* and the lazy loading library didn’t have to do any work at all.

We fixed this by removing the specific CSS rule and replacing the old lazy loading mechanism with a newer one (using Intersection Observer).

The results on pages with 30 reviews were:

30 HTTP requests less
30 - 100kb less on page load

Test results (table below) showed a small improvement even though this can just be score fluctuations from Pagespeed. In any case, it was an easy fix that reduced HTTP calls and network traffic.

User thumbnails load initially	User thumbnails load asynchronously	Difference
60.4	63.1	4.5%

Pagespeed scores for user thumbnails

* Images in stylesheets will trigger an HTTP request only after the render tree has been calculated and the corresponding elements are about to be rendered. However there are inconsistencies among browsers.

Listing page

In Skroutz.gr we have 2 types of Listing layouts: normal & tile.

Normal (list) layout: every row has one product which translates to one image per row.
Tile layout: every row has more than one product, which means more images per row (4 in desktop, 2 in mobile viewports).

In normal layout, we had an average Pagespeed performance score range from 80 to 90+ and in tile layout from 40+ to 50+.

Truth be told, tile rows are bigger (higher) than list rows so their ratio is not exactly 4:1 but generally speaking, tile lists load more images/products than normal lists.

In tile layout lists, we had more than 60 HTTP requests for images for about 800kb of data.

That’s a lot of requests and data we could shave off!

We tried solving this with the native HTML attribute “loading”.

This posed 2 problems:
First, browser coverage is somewhat low (~70%) mainly because of Safari not supporting the feature. (as of 07/2020)
Second, browsers implement native lazy load differently. The biggest difference is between Chrome and Firefox. Chrome is playing it safe, loading a lot of images before being scrolled into view (they’re trying to find the sweet spot). On the other hand, Firefox is really aggressive with lazy loading, only loading images that are 50% or more inside the viewport.

As we couldn’t rely on HTML for this, JS came to the rescue.

We created a React Higher Order Component (HOC) that utilises IntersectionObserver capabilities.

Using this HOC, we implemented lazy loading in Listing images that works in the same way in every browser that supports Intersection Observer API (almost 90% including Safari).

We now have control over the loading threshold and we don’t rely on every different native implementation of every browser.

Running tests with Pagespeed Insights on quite heavy Listing pages (like men’s jackets) yielded some really good results (~10 points improvement).

Below is the graph of a heavy Listing’s page Speed Index and the average pagespeed improvement.

Speed index improvement graph

Before lazy load	After lazy load	Difference
48.5	59.5	22.6%

Pagespeed score for tile layout list

HTML

As already mentioned above, the excessive number of DOM nodes was one of our most critical performance bottlenecks in our pages.

Product page

In Product pages there are rendered shops that sell the product. In some popular ones, due to the large number of shops selling the product, the DOM nodes exceed 8.000!

Undoubtedly, for our products pages this was the most challenging part, since click to shops is the most critical path to a buyer’s journey.

Google’s Lighthouse suggests that in order to optimize large lists, one should use a library called react-window. By using this library we could achieve to only render the list items in our viewport.
In other words, while a user is scrolling through the shops’ list, the actual items that are being rendered are the ones that currently exist in the viewport along with a few items before and after those that have been already displayed.

This eventually did not work for us, and the main disadvantage was that the product cards did not have a fixed height. Although the library provides a solution for dynamic list items, in our case, the shop cards have a lot of information that should be rendered. But the result wasn’t the expected one. Many shop cards failed to be rendered at the right time, mostly on “faster” scrolls and the overall experience seemed broken.

The solution was in another direction.

We had to load the information at the right time when it was actually needed to exist. It was crucial that the cards should maintain their fixed height while being loaded, in order to avoid layout shifts.

In order to achieve this we had to separate the primary information, from which the card’s height was defined, from the secondary one. We considered product links as the primary information because they designate the card’s height and price, shop location, ratings etc. as secondary.

The solution was to render a single node as a placeholder instead of a bunch of nodes that represent secondary information on the initial page load.

Next step was to implement a mechanism that would notify each card that would appear in the viewport and IntersectionObserver suited perfectly for this!

Last and final step, for each card displayed in viewport, we replaced the placeholder with the actual information.

By completing all the aforementioned steps, the number of the DOM nodes was reduced dramatically.

In some cases the DOM nodes were reduced by 45%, which results in an increase of ~10 points in our overall page score.

In addition to the abovementioned, we kept an eye to the users’ reviews section.

The reviews reduction experiment was part of our effort to reduce DOM elements in the Product page, without running the risk of dropping in organic results, from an SEO perspective.

User reviews are the most typical form of user-generated content (UGC). User reviews about a product is one of the most critical things that can impact purchasing decisions. Consumers are actively looking for content that is unique, relevant, and trustworthy. In fact, according to BrightLocal, 88 percent of consumers trust online reviews written by other consumers as much as they trust suggestions from their personal network [ref.].

Yet what many don’t know is that UGC is also good for SEO. Search engines such as Google and Bing rank websites based on whether the sites’ content is relevant and useful. Over 25% of the search results for the 20 largest brands in the world are linked to user-generated content [ref.].

In order to reduce reviews’ number at initial load, we had to carefully implement and deploy an experiment first to see if the SEO can be impacted.

We currently render the first 30 reviews with a “load more” button for loading the rest of them. Every Review has roughly 30 DOM elements which translates to 900 elements more or less on every Product page.

For the experiment, we divided two Product page groups, one with twelve (12) initial reviews and the other with seven (7).

First of all, we wanted to see how the reduced reviews impact the rendering performance.
Second, we kept an eye on the conversion rates and the users’ flow onsite.
Third, we were up at SEO performance, comparing the 2 groups having reduced reviews’ number with a control group (no change).

After running a number of Pagespeed index tests for every before and after state, we got the following results.

30 reviews (group 1)	7 reviews (group 1)	Difference
60.6	69.2	14.2%

30 reviews (group 2)	12 reviews (group 2)	Difference
64.7	73.3	13.3%

Review reduction experiment results on 2 groups of products

We had an improvement of almost 9 points for both groups which leads us to believe that:

We probably reached the biggest improvement we can get from DOM elements reduction.
There is no reason to reduce our initial reviews number to 7 since 12 yields the same improved scores.

Also, regarding the users’ flow and the conversion rates and sales, we didn’t record unusual fluctuations.

Last but not least, we didn’t notice statistically significant SEO performance changes, that would discourage us from exposing the change across the site.

CSS

Critical CSS was a really weird concept the first time we came around it.

The general idea is: Take all the CSS rules you need for rendering above-the-fold elements and put them in your HTML file.

The pros of this trick are that the browser will instantly read this “Critical CSS” and start rendering the above-the-fold elements with their applied rules instead of waiting for a CSS file to download and then do the rendering.

The rest of the CSS is downloaded when onload event is fired thus not blocking the browser from rendering.

Critical CSS affects metrics like First Contentful Paint & Largest Contentful Paint.

After some research for possible implementation methods and an experiment that ran in selected Product pages, we reached the following conclusions:

The change in scores was minuscule (1-2 points) and probably was caused by fluctuations in Pagespeed Index results.
The implementation of critical CSS for production needed a lot of effort. We would probably have to set up an automated job, generating all the critical CSS rules every time a change in our styles was pushed into master.

The combination of high effort & low gains made us stop focusing on this idea and pursue other ways to improve performance and lower rendering times.

Takeaway: Critical CSS didn’t work for Skroutz.gr!

Javascript

In order to optimize our JS performance, we worked on reducing the main bundle file that was causing overload of the main thread (initial request), and on Redux architecture for faster response on user’s inputs.
We finally came up with the following solutions:

Ways to reduce our webpack bundle size

After analysis we started by avoiding libraries’ global imports and enforcing this rule with eslint. For example, requiring only the needed specific lodash functions resulted in a 9% bundle reduction. Enforcing this rule with eslint made sure we won’t come across this issue again.

Then we used code splitting. With webpack you can split your bundle up into many smaller ones and only load the bundles needed by each page. We tried to split our code and ship it in different bundles, but unfortunately this didn’t work for us, because of the many shared heavy dependencies between our main pages.

It did not reduce overall bundle size (it even slightly increased it), so we decided not to proceed with it.

Redesign the state of one main page of our React Redux application into a normalised shape

This initiative was about improving the performance (response) after a user’s action on a page (i.e. filtering the results of a Listing), not for the initial request.

Keeping state normalised plays a key role in improving performance and avoiding unnecessary re-renders of the React components.

In a normalised state each type of data gets its own “table”, each “data table” should store the individual items in an object (with the IDs of the items as keys and the items themselves as the values), any references to individual items should be done by storing the item’s ID and ordering should be indicated by the use of arrays of IDs.

With this normalised shape, no changes in multiple places are required when an item is updated, the reducer logic doesn’t have to deal with deep levels of nesting and the logic for retrieving or updating a given item is now fairly simple and consistent [read more on this].

React hydration takes long

Another problem we found was the hydration on the client.

Hydration is the process by which React attempts to attach event listeners to the existing markup on client side, it is also an important process because it validates that the markup generated from the server and the markup on the client is the same, proof that SSR works as expected.

Hydration is a time consuming process that increases load time and delays TTI. The solution to that problem is progressive hydration, unfortunately due to our SSR implementation we couldn’t implement that.

However we can implement lazy hydration as a replacement but React is already considering including progressive hydration in its core soon.

Core Web Vitals: Cumulative Layout Shifts (CLS) issues

In late May 2020, while we had already progressed in our making-Skroutz-faster journey, Google announced they’ll be “Evaluating page experience for a better web”.

What this meant for us, is that we had to focus on enhancing page experience, according to Google’s Core Web Vitals metrics.

As Google announced, the above metrics will evolve over time. Therefore, it’s likely that we would be chasing a moving target here, but we wanted to see it, so we emphasized on CLS, a user-centric metric for measuring visual stability, that was our main issue at that time, according to Google’s Search Console.

There were 2 main areas that induced layout shifts: Image loads & user interactions.

Image loads

Although it is quite common for image loading to cause layout shifts (LS), it is also easy to solve, by defining image dimensions.

Our most affected page was the Product page with color variations (on desktop), which had two types of images causing LS when loading: main image and color variation thumbs.

The latter was easier to solve, by adding a fixed height placeholder on the container element.

Fixing the main image LS was trickier, because of its orientation-dependent, variable height. Predefining its height was not an option, at least not for all products. While predefined height on portrait images seemed to solve the problem, this wasn’t the case for landscape images.

We then tried preloading the main image. If the network is fast enough to fetch the image before page rendering starts, no LS is caused.

The above fixes eliminated LS that occurred on Product page initial load, which essentially zeroed out lab data CLS.

Although the initial CLS score caused by image loads was not that significant (~0.03) any gain that will keep our pages score < 0.1 (marked as fast by Google) is important.

User interactions

Google search console marked a large number of our URLs as poor, the issue being CLS. The marked issues concerned both Product & Listing pages on mobile viewports.

After some investigation, the cause was found.

CLS was caused by our sticky header.

Header gets sticky after users scroll past a certain point, after which fixed positioning is added. Apart from the header itself, the issue involved sticky navigation on the Product page and sticky filters on the Listing page.

While the issue was a bit more complex (e.g. paddings were added to other elements to keep everything in place) simply put, adding or removing these sticky elements from the static flow of the document caused a Layout Shift.

Even more, this LS kept adding up each time our header got stuck or unstuck, resulting in significant CLS scores.

A simplified description of the solution is that we explicitly declared the heights of the sticky element containers. The containers then functioned as placeholders, maintaining the sticky element heights, even when they got out of the static flow.

A similar problem occurred in our product cards, where the shop’s rating and location were displayed. This information is fetched asynchronously which means that in the initial render the content of that section is empty. Once the data is fetched and the section populated, the container’s height changes, causing the next cards to be pushed down.

The solution was simple in that case too, we just had to specify the height of the placeholder’s container.

After the above mentioned fixes, our pages got improved, being now marked as “good URLs” instead of “URLs need improvement”, as the images below show.

Yeah!

Skroutz's good CLS improved by 45%! (based on Chrome User Experience report)

Good URLs rising after our CLS fixes (Google Search Console)

Onwards - Closing

After a lot and fun work for over a year, varied from things that were low-effort to a few that were advanced, we’ve done it.

We’ve made Skroutz.gr faster.

Performance is a feature at Skroutz. But it is also a competitive advantage. Optimized experiences lead to higher user engagement, conversions, and ROI.

Striving for speed is a never-ending journey. Although we achieved a better performance during the last year -and hopefully a better user experience for our visitors-, this is not the end of the story.

We are now in a training mode, we are setting-up a “speed mentality” to our Front-End engineers, especially for the latest and greatest things on rendering performance (Core Web Vitals). This post is part of the training!

We are also establishing an additional continuous monitoring system, that is a set of tools and methodologies that we will further apply to the existing ones, in order to have the new performance metrics under our daily radars.

We strive for fast pages and fast development. At the same time.

We have lots to do more! :)

Congratulations if you made it to the bottom of this huge post.

We hope you got some valuable points from our speed journey.

Have you tried optimizing your speed before?
Yes? No? Kinda?
Let us know, writing your experience and your findings in a comment below.

Best,
Skroutz Devs.

top image source: unsplash

Speed: The Journey to Delivering a Faster Experience at Skroutz.gr was originally published by Skroutz Engineering Team at Skroutz Engineering on October 22, 2020.

Process Optimization

2020-06-07T08:23:09+00:00

In every stage of a business, there will be some processes as part of everyday operations. Some examples include the onboarding of a new customer, replying to customer support tickets, or interviewing candidates for a new position.

All processes, whether well defined and documented or just common knowledge, start small and simple and almost certainly end up huge and complicated. Drawing a parallel with the world of physics, processes obey the rule of inverted entropy: every process wants to transition from small and simple (low energy) to big and complex (high energy).

This transition will not happen overnight but with small distinctive steps that will eventually slow down the team’s performance. In most of those cases, the slowdown will not be attributed to the increasing complexity of one or more processes but to the high load of the team which will result in more hires and further performance degradation.

Apart from the obvious problems that a complex process has, taking more time to complete and requiring more resources, there is one more hidden in the background that poses an even bigger threat than a slowdown: teams with complex processes don’t scale. Onboarding a new member takes a huge amount of time and the more people you add the more managers are required to control the complexity.

So how does someone optimize a process? As a rule of thumb, expect a significant efficiency boost in any process optimization effort. In High output management, Andy Grove says:

This is called work simplification. To get leverage this way, you first need to create a flow chart of the production process as it exists. Every single step must be shown on it; no step should be omitted in order to pretty things up on paper. Second, count the number of steps in the flow chart so that you know how many you started with. Third, set a rough target for reduction of the number of steps. In the first round of work simplification, our experience shows that you can reasonably expect a 30 to 50 percent reduction”

Andy refers to production steps as in a factory but work simplification can be applied to any process. To keep your processes small and simple you need to do two things: don’t allow them to get complex and inspect them frequently to reduce the number of steps.

Both actions require asking the same questions either while introducing a new step or when inspecting a process to optimize it.

Optimizing

Process optimization is mostly about inspecting a process and eliminating all unnecessary steps or optimizing them if elimination is not possible. To identify those steps we need to take a look at the most common causes of process complexity which are described below.

Better safe than sorry

Some processes have a number of steps to ensure that nothing ever goes wrong. Say, for example, your support team has a process for handling customer requests of a certain type that is working fairly well. At some point, an angry customer reports a not so common case where your process failed in a way that it created a lot of frustration or actual damage. One or a few members of the team took a lot of heat and the customer eventually churned.

To avoid this happening again the manager of the team will add an extra step with additional checks to make sure the process is fail-safe. With time, a few of those not so common cases will translate to a number of additional steps being added to the process.

Legacy

This is a classic especially for processes that have been around for a long time. A step was introduced to gather some extra information required by law but that law no longer exists. Because of the distance between those handling the process and those designing it the step will sit there for quite some time even though it’s no longer required.

Scope creep

It’s not so uncommon to find processes with steps that apply to only a specific attribute of the business but have no scope. An e-commerce platform, for example, may require specific handling of a certain category that will be introduced as a new step. That step, however, isn’t that critical to all other categories and could be easily scoped affecting only a small share of the team’s resources.

Premature steps

Every process that has more than a few steps will probably have dependencies between them. Signing up a new customer, for example, will require an up-front setup fee and probably a few other steps. That setup-fee step should be placed at the top and all other steps should be blocked until the customer has made the initial payment. In many cases, processes pack all steps interdependently resulting in extra work that goes wasted if that critical step is never completed.

Requiring various levels of authority

Processes may have steps that can’t be executed by the same individual and requires a different, usually higher, level of authority. People with authority are usually far less than the people executing tasks which results in a bottleneck.

No automation

This is probably the easiest one to address. Some steps are gradually degraded to something that can be automated or in other cases the technology required for automation just wasn’t there when the process was designed (e.g. checking the creditworthiness of an individual). Note that optimizing parts of the step but not the whole step will still increase efficiency.

Designing a process is equally important as maintaining it after it has been deployed. The most common cause of process inefficiency is lack of maintenance, it’s not uncommon to find processes that haven’t been inspected for years. Heavy load processes should be inspected every 3-6 months, while less frequent can be inspected in bigger intervals.

Process Optimization was originally published by George Hadjigeorgiou at Skroutz Engineering on June 07, 2020.

Hiring engineers while working from home

2020-05-31T21:00:00+00:00

Or: How We Learned to Stop Worrying and Love the Engineering Interview Process

Skroutz is hiring! We write this phrase on our social media and blog posts, and we discuss it internally within our hiring teams. We are growing rapidly and hiring people is one of our top priorities.

Even when the pandemic hit and we started working from home, not only did we not stop our hiring efforts, but we doubled down on them and quickly adapted our interview process to the new realities. To put it into perspective, of the 29 engineers we have hired in 2020, 25 were hired after March! In this post we will discuss how we think about interviewing engineers, how our approach has evolved over time, and how with a few tweaks the same approach worked well when it became fully remote.

When this big hiring initiative started, our engineering hiring process was in need of rethinking. It wasn’t really broken - we have hired many good people - but with more colleagues joining our hiring team and more positions open than ever, it had started to show its age. So we wrote down the things that concerned us about it and spent a couple of weeks thinking, discussing, and reading relevant articles and book chapters. In the end, we came up with a process that, while not radically different, seemed to iron a few kinks out.

Of course, a hiring process that looks good on paper might not withstand colliding with the real world. So we decided to try this process out on one of our openings first and then share the results. Meanwhile, other divisions within the company are experimenting with different variations of the process, which means we get to meet and exchange experiences after!

Before moving forward, we must note that hiring is an inherently flawed process. Judging individuals for their skills is messy, hard, and at times unfair. Candidates are called to compete in something that barely resembles their everyday work. In practice, the whole hiring process boils down to minimizing “false negatives” in the early stages - candidates that should have moved forward but didn’t - and “false positives” in the hiring decision - candidates that were hired but weren’t a good fit after all. This is tricky and you’re bound to make mistakes. Andy Grove wrote that “careful interviewing doesn’t guarantee you anything, it merely increases your odds of getting lucky”. In this post we share our experience and our current understanding which might change in the future, so please take everything we say with a grain of salt.

Search Team Job Opening

The search team was looking to hire two engineers, and we chose that job opening to try our process. We received the first resume on 7 February and the last (we removed the job listing) on 12 March. 83 candidates applied in total, and we eventually hired 3 of them - two joined the search team and one joined the content engineering team. Overall we were quite happy with the way it turned out: we are confident that we made the right decisions and are excited to start working together with our new colleagues.

What didn’t go that well was our response times: on average we needed 10 days from the day we received a resume to the day we conducted the first screening with a candidate. The average time from resume to job offer was 44 days. We can attribute this to three main reasons: the first is the fact that we had many open positions simultaneously and our HR department was, at the time, understaffed for the candidate load. The second reason is the fact that only four people were involved in the interviews, which put a cap on the total interviews we could arrange per day. The third reason is of course the elephant in the room: the COVID-19 pandemic and the resulting lockdown which made us switch the whole recruitment process online.

Interview Process Revamped

Our old process consisted of three parts: first filter through the incoming resumes. Then, do a screening call with those that we think might fit the role. Finally, do an onsite interview with the most promising candidates. The onsite interview consisted of two parts: a coding exercise and some database-related questions.

Before the lockdown happened, we were thinking of adapting that process in order to address a few issues we had identified. First, the screening call would be more structured, with specific things to check for.

Furthermore, we decided to introduce a second screening call that would include a simple coding exercise. The reasoning behind this was that onsite interviews are very “expensive” both for the candidate that would have to come over to our offices and spend a few hours there, as well as for the interviewers who would end up spending a large portion of their day preparing for and conducting the interview. It made sense then to only call the most promising candidates for onsite interviews, and we had identified that the coding exercise would help us do that.

Finally, the onsite interview would be split into three distinct parts: first another coding exercise, a bit harder this time. Then, a question about system design. Finally, a chat around the candidate’s past experience.

Interview Process Turned Remote

…then one day we stopped going to the office altogether!

Fortunately, the process above was easy to adapt for a remote setting. First of all, there would only be one screening call. Onsite didn’t make sense anymore since there was no “site” to go to and the whole process was just a series of video calls on Google Meet. So we decided to split that into two calls: first the coding challenge, and then the system design and past experience discussions.

This is an overview of the full hiring process, assuming the candidate always reaches the next step:

Screen resume
Do a screening call
Do a coding exercise call
Do a system design/past experience call
Make an offer they can’t refuse

Note that all interviews are conducted by engineers who are members of the recruiting team and there are always at least two interviewers in each call to reduce bias. The whole process is facilitated by the HR department, members of which also join the call on the last step to make a few questions themselves.

Screening Call

The goal of a screening call is to get a first impression of the candidate and try to establish whether they might be a good fit for the position. We spoke with 17 people (20% of the total applicants), and 10 of them went forward to the next step.

We try to keep the length of screening calls at around 45 minutes, and having a predetermined structure helps a lot. First, we ask the candidate to talk a bit about their experience, urging them to be concise. This serves as a nice icebreaker and also becomes the starting point for follow up questions so we can understand a bit more about their background.

We then ask them a bit about their motivation, the reason they applied for the specific position, and also what they want to primarily focus on for the next couple of years. Of course there is no “right” answer here, but this is a nice way to move the conversation forward and learn more about the candidate and their expectations.

Finally, and depending on the position the candidate is interviewing for, we ask a couple “knowledge” questions. These can be of two types, the first is checking whether the candidate knows about something that is considered a requirement for the role. For example, we might ask about Ruby symbols if the position requires Ruby background. The second type is something based on the candidate’s experience. For example, a candidate might talk about (or mention on their resume) their experience with a distributed systems project and we could follow up with asking about consistency or availability issues. This way we can get an impression of the depth of their knowledge, but also of their skills in communicating technical topics.

Coding Exercise

Of the 17 people we did a screening call with, we moved forward with the coding exercise call with 10 (59%).

What’s nice about having the coding exercise as a separate step, is that you can be less strict in the screening call, because they can get pretty messy; there were times that we were left uncertain whether we actually learned something useful about the candidate. In such cases, moving forward with the coding exercise was an easy decision, since it would give an opportunity to the candidate to do well, while being easier for us to judge. An example would be giving a chance to people with little interviewing experience that were visibly stressed during the screening call.

The goal of this step is to determine the problem-solving, coding, and communication skills of the candidate. First, we let them know that while reaching a solution is important, we also care about communicating their thinking out loud. We also note that the quality of the code is important and that we want to simulate a scenario where we work as a team to solve a problem but the candidate takes the lead and we just follow along. Finally, we let the candidate know that there is a time limit of 45 minutes and we actually try to conclude the call within that range.

In practical terms, we use Coderpad for this step. What’s nice about it is it allows us to watch the candidate’s progress in real time, and it offers an environment on which we can actually run and test the code. In preparation for the call we will have created a “pad” with the exercise description, and some boilerplate code in the language the candidate is most experienced with.

As for the exercise itself, we try to pick problems that are not trivial but also that do not require a certain “aha moment” to figure out the solution. Rather, we prefer problems that are amenable to the sort of incremental problem solving that is common in day to day work. During the call, we encourage the candidate to take some time to think about the problem, even use pencil and paper if they want to. We also try to give them hints if they are stuck, and try to steer them towards a solution, sometimes by giving them specific examples to work with.

In order to gain confidence about the exercises we picked, we actually did a couple simulations with Skroutz engineers: we actually logged on Coderpad, gave them the exercise, and watched them trying to solve it within the predetermined 45 minute limit. While of course this isn’t realistic as there is no stress involved, it gave us a nice (albeit optimistic) baseline and reduced the doubt about the quality of the exercise significantly. What’s more, we have decided to adopt this approach before approving any exercise to be used in Skroutz interviews.

Finally, we gave the same exercise to every candidate. While this might sound a bit risky (it might leak, or they might already be familiar with it), in practice it was very helpful in that we were able to judge the candidates compared to each other, rather than in absolute terms. This helped increase our confidence about the people we chose to move to the next step of the process.

System Design & Past Experience

Of the 10 people that did the coding exercise, 5 went forward to the next step. We should note here that all five candidates were good engineers, able to communicate their thinking, and we would probably be happy working with any of them.

The final session has two parts. The first and most technical is a 45 minute discussion around a system design topic. The second is a 30 minute discussion based on the candidate’s past experience.

In the system design, we ask the candidate to assume that we are an engineering team that gets assigned to create a new system/service/website, etc. We want them to take the lead and tell us how they would approach this task. These are intentionally open ended: for example “create a twitter clone”, “design a system that enables users to ‘like’ posts”, and “design a bit.ly clone”, are all potential topics.

Note that there is not a single right answer and that what we are looking for can be adapted based on the candidate’s experience and seniority. For a junior candidate we could focus more on database schema design, API endpoints, and queries. On the other hand, we would expect a more senior candidate to do requirement gathering and trade-off discussion before diving into a design proposal.

What’s nice about this type of question is that we can dive as deep as we want: we can explore potential issues with the proposed systems, e.g. “what happens if there’s a spike in traffic?”, we can ask how the proposed systems could be extended, e.g. “how can we support lists in our twitter clone?”, or even alternative technologies, e.g. “is a relational database ideal for this scenario?” It goes without saying that this type of question requires preparation from the interviewers as well.

We conclude this call with a 30 minute discussion based on the candidate’s past experience. This is mostly informal and includes questions on technical topics, for example we could ask “what is a project you worked on you are especially proud of?”, “what is the weirdest bug you have encountered?”, but also possibly touch on teamwork-related topics, for example “what was for you a good experience of a well-functioning team?”, or “how did you resolve disagreements with your lead?”, etc.

Touching on such topics can let us determine the seniority of a candidate in various areas, not strictly technical. This can play a role in determining the team and the manager we assign them to, should they join us. Moreover, many of the answers we get are truly interesting, informative and sometimes even surprising.

Conclusion and Lessons Learned

We hired 3 of the 5 people that made it to the last interview stage, a 4% of the total applicants. All in all we are quite happy with the process and we felt we learned a lot along the way.

Other teams have started adopting some key parts of this process:

The coding & design step is permanently split into two calls instead of a very long (~3 hours) one. We believe this helps with scheduling and can reduce fatigue for both the candidates and the interviewers.
The interviewers are always suitably prepared and are expected to take notes during the interview and submit their feedback on the candidate within a couple of days.
For each open position we try to determine “knowledge requirements” beforehand. That is, things a candidate must know in order to be considered for the position.
When researching coding exercises we take care that they are not trivial and do not require a single aha moment to solve.
We have a common pool of coding exercises, so we can experience how different people try to solve them and judge their performance compared to each other, rather than in absolute terms.
Before a coding exercise enters the pool, we do a simulation where our colleagues try their hand in solving them!
Similarly, we have prepared a pool of system design questions.

Of course the process is continuously evolving and we try to get better in time. Asking the candidates for feedback is very helpful in that regard. We want to treat the whole process as we would a product proposal: first develop some assumptions on what we can improve, then try the changes out, and finally adapt the process or the assumptions accordingly, based on the outcome.

We believe that people are what matters most in an organization. A proper hiring process then is critical for growing the organization successfully - maintaining the core values intact and an excellent level of technical aptitude. It also shapes the candidate’s initial impression of the organisation and its people. Thus we believe that we should keep working on it, and that sharing our experience is important.

Hiring engineers while working from home was originally published by Nikos Fertakis at Skroutz Engineering on May 31, 2020.

Performance Management @ Skroutz

2020-02-13T22:00:00+00:00

Introduction

This is a post on how we are managing performance at Skroutz and how we transitioned from informal semi-annual feedback meetings to a structured continuous performance management framework. It is about how we translated our value Set big goals. Take small steps to an actual set of events, processes, and tools that drive our performance daily and support our career development.

History

At Skroutz, we have always cared about our people’s personal and professional development.

In our early days, George H., Vassilis and George A. were having meetings with all team members giving them feedback and helping them grow. As our team grew bigger, this task was assigned to people managers who continued meeting with their team members talking about strengths and improvement points as well as setting developmental goals.

In both instances, it was an informal discussion where both sides shared feedback based mainly on recent events and was ending with a few actionables vaguely stated. It was a process that could serve its purpose if we were to stay a small-medium sized company.

However, our vision is greater than this and therefore, we needed to set up a process that our people’s happiness and professional development would remain at the focal point.

Following the example of companies like Adobe and Google, we introduced a new approach called Continuous Performance Management (CPM).

The differences between CPM and the traditional performance appraisals are summarized in the table below.

Continuous Performance Management	Performance Appraisals
Continuous feedback	Annual or semi-annual feedback
Coaching	Directing
Democratic	Autocratic
Process focused	Outcome focused
Strength-based	Weakness-based
Fact driven	Prone to bias

CPM in action

Continuous performance management (CPM) is a framework where managers and team members collaborate to create short-term developmental goals and meet on a more regular basis to promote growth, recognition and happiness. The idea is that everyone can rise to the top and be successful with their current set of skills.

CPM consists of various components that each one serves a different purpose and all of them are supplementary to each other.

One-on-One conversations

Collaboration is a core element of work-life at Skroutz and 1-1s ensure that a manager-team member relationship has this characteristic. In a nutshell, one-on-one conversations promote an ongoing forward-looking dialogue between us and our manager. So, every two weeks we meet with our manager for 30 to 45 minutes. Topics of discussion vary; we share updates on work progress, ask for guidance and support, make questions regarding tasks, team and company OKRs, talk about personal matters, follow up with developmental goals and the list goes on. This time is about us!!

In a recent internal survey regarding CPM, there was a unanimous response that having regular 1-1 conversations was one of the best practices we have ever rolled out. During the last 8 months, we all have experienced genuine communication with our manager and we have received the support and guidance we needed to achieve our tasks and goals.

On the downfall, finding an available meeting room at Skroutz Awesome Factory resembles a treasure hunting. :)

Performance & Career development discussions

Another component of CPM is the performance discussion, which takes place quarterly and it serves as a feedback and development mechanism.

This event gives us the opportunity to look back on our 1-1 conversations, on feedback that was shared over the previous 3 months and have a future-forward talk with our manager about our career development.

At the beginning of every quarter, we take a self-assessment, which we send to our manager. S/he then prepares a performance review doc and s/he sends it to us prior to our discussion, so that we are all prepared. During our talk, we recognize superpowers and accomplishments, but most importantly discuss our future, what are our career aspirations, what skills we need to develop in order to fulfill these aspirations, we set priorities and agree on action items for us and our manager.

Goals and action items set in this discussion will be a recurring topic in our 1-1s for the following quarter.

Peer and Manager feedback

Skroutz grew on receiving feedback. Getting feedback on features, services and processes is part of who we are. We believe that feedback can help us become better at what we do. This mindset applies to all of us, as well.

In the context of the CPM framework, we run peer-review surveys as well as manager-review surveys. This way, we have the opportunity to give and receive actionable feedback from our peers and from our team members, in the case of people managers. The purpose of a feedback survey is to assist each one of us to better understand our strengths and weaknesses and to get an insight into aspects of our work needing professional development.

We run our first peer and manager reviews in June. Each one of us got feedback that was true and deep inside we already knew about it. Yes, we got a bit defensive when we first read our report, but then we took out of it action items that fueled some of our 1-1 talks.

The benefits

We have transitioned to the CPM framework for less than a year and the positive impact on our daily work-life was obvious from the beginning.

To start with, we are now more aware of where we stand. At any time, we know what we did well, what we need to work on and we have the support and guidance we need to achieve our goals. Feedback on performance is given with specific actionables and in a timely manner. Good efforts and accomplishments are given the appropriate recognition.

Performance discussions are currently more fruitful since recency bias has been eliminated, they are more focused in the future and we are constantly examining opportunities for development.

Our relationships with our managers have improved significantly and our interaction is more meaningful. Our people managers act as coaches and mentors and focus their attention on how they could help each one of us to grow and work towards our goals.

To sum up

Continuous performance management has helped us reinforce our culture of continuous growth, feedback, and recognition. It has contributed to making our values come alive.

We still have some fine-tuning to do, but we are all certain that it is a framework that will ensure that our people will be in the centre of our attention and efforts, no matter how big Skroutz becomes.

Performance Management @ Skroutz was originally published by Roza Tapini at Skroutz Engineering on February 13, 2020.

[Case Study] How we optimized our Crawl Budget

2019-10-30T21:00:00+00:00

Introduction

This is a story about the technical side of SEO on a large e-commerce website like Skroutz.gr, with nearly 1 million sessions daily and how we dealt with some significant technical issues we found a year and a half ago.

Let’s give you a sneak peek on the milestones of our efforts, which are covered in this case study. During the last 1.5 year, we managed to:

Decrease our index size by 18 million URLs while improving our Impressions, Clicks and Average Position.
Create a real-time crawl analyzer tool that can handle millions of URLs.
Implement a custom alert mechanism for important SEO index and crawl issues.
Automate the technical SEO process of merging or splitting e-commerce categories.

If you are interested to see why and how we did all the above, grab a seat!

Table of Contents

Part 1: SEO Analysis (February 2018)
› What issues initiated our analysis
› How we did the Analysis

Part 2: Action Plan and Execution (Feb 2018 - June 2019)
› Action Plan
› Execution

Part 3: Results

What We Learned

But before we take off, let us introduce ourselves.

Skroutz.gr is the leading price comparison search engine and marketplace of Greece and a top-1000 ranked website globally by Similar Web. Skroutz.gr helped its merchants generate a Gross Merchandise Volume (GMV) of €535 Million in 2018 (≈20% of Greece’s total Retail Ecommerce GMV).

Except for the main B2C price comparison service, Skroutz.gr also provides a B2B price comparison service and a new food online delivery service for the Greek market, namely SkroutzFood. Finally, Skroutz.gr operates its own marketplace of 500+ merchants.

SEO Challenges in Large Sites: The case of Skroutz.gr

So, what challenges does a Site with millions of pages encounter?

First of all, imagine the difficulties you have when you optimize your Rankings for an average-sized site, such as keyword research and monitoring, on-page SEO, etc. Now think about the same things on a million-page size website; You have to deal with a vast amount of data and automate things in a way that does not compromise quality.

Besides this, SEO is not just rankings…

Indeed, large website SEOs have another big headache: Crawling and Indexing. These essential steps take place even before Google ranks your content and can be extremely complicated on huge sites.

Note: In this case study we focused on the Google Search Engine and GoogleBot. However, all search engines are operating similarly.

Most problems which occurred are related mostly to Crawl Budget and Duplicate Content. More specifically:

Crawl Budget: Google has a crawl rate limit for every website. If the website has fewer than a few thousand URLs, it will be usually crawled just fine. However, if you have a site with a million or more pages, you need to enhance your structure so that crawlers have a far easier time accessing and crawling your most important pages
Duplicate Content: If the same content appears at more than one web address, you’ve got duplicate content. While there is no duplicate content penalty, duplicate content can drive, sometimes, on ranking and traffic drops. As Moz says, this happens because GoogleBot *don’t know whether to direct the link metrics (trust, authority, anchor text, link equity, etc.) to one page, or keep it separated between multiple versions and, secondly, it doesn’t know which version(s) to rank for query results.

For example, Skroutz.gr has more than 3,000,000 products in 3,000 categories. It also uses a faceted navigation with more than 13,000 filters (which can be combined - up to 3 filters), three sorting options and an internal search function. Most of these options produce a unique page (URL).

Thus, large sites’ administrators have to:

automate the monitoring the site’s SEO performance
watch out for thin or duplicate content issues so that they don’t confuse Google about what pages are essential for them
control over what pages are crawled and indexed

Part 1: SEO Analysis (February 2018)

What issues initiated our analysis

If everything goes like clockwork, most of your rankings are on top 3 positions and the organic traffic growth is stable, then, it is hard to suspect that something might not be going so well SEO-wise. That was the case with Skroutz.gr back in 2018.

If you look at the graph of GA sessions over the past five years below, it’s evident that our traffic is increasing every year with a 15-20% YoY organic growth, even surpassing 30Μ monthly sessions (80% of that traffic is organic).

So what was the first sign that something was not going as expected? Three issues raised flags, especially when we realized the correlation between them.

1. Index Size

The first one was the index size we saw on Search Console (nearly 25 million URLs) compared to the “real” number of our pages that we thought we had.

2. Increased time for new pages to get indexed and rank high

Delays in rankings recovery were more evident in cases we had to break a broad category into 2-3 subcategories. This kind of splitting produces many new URLs, as well as many 301s redirects from the old URLs to the new ones (e.g., old filter URLs).

Ahrefs History Chart for “foam roller” keyword

For example, look at the data above from Ahrefs History Charts for the keyphrase “foam roller” (click here to see Greek SERPs). Foam Roller products used to be in a broader category called Gym Balance Equipment (green line). On 03/03/2018, the content team decided to create a new category named Foam Rollers (blue line) and moved the relevant products there.

As you can see, historically, we ranked on 1st place for “foam rollers” with the internal search page skroutz.gr/c/1338/balance_gym.html?keyphrase=foam+roller. On 03/03/2018, we created skroutz.gr/c/2900/foam-rollers.html category and we redirected the first URL, plus a few hundred relevant URLs (e.g., skroutz.gr/c/1338/balance_gym.html?keyphrase=foam+rollers), to the latter.

Based on previous years’ stats, a new URL needed just a few days to a couple of weeks to recover it’s rankings, after the consolidation of the signals. Yet, in this case, it took almost three months (!) to rank in the first place. Besides this, old redirected URLs remained indexed for months instead of being removed after a few days. That indicated that our crawling efficiency had decreased over the years.

3. Increased time for metadata to refresh in Google Index

Titles and Meta Descriptions weren’t updated in Google’s index as fast as in the previous years, especially for pages with low traffic.

As a result, fresh content and schema markups (availability, reviews, and others) weren’t reflected in Google SERPs within a reasonable time.

How we did the Analysis

Step 1 - A first look at the problem

At first, we wanted to validate our concerns about the index bloat of 25M pages. So, we tried to figure out how many out of the 25M pages, were actually supposed to be in the SERPs.

We drilled down the different types of landing pages, calculating an estimation of:

the number of currently indexed pages per type, using Google Search Operators
their share of the total traffic, using Internal Analytics tools
the number of indexed pages we should have, based on some criteria like current or potential traffic

Type Of Page	Current Estimated Indexed Pages	Share of Total Organic Traffic	Indexed Pages we should have
Homepage	1	24%	1
Product Pages [like Apple iPhone XR (64GB)]	4.5M	20%	3M
Clean Category Pages [like Sneakers]	2500	25%	2450
Category Filter Pages [like Stan Smith Sneakers or Nike Sneakers]	2.5M	10%	1.5M
Internal Search Pages [like ps4]	14M	20%	1M
Other Pages (Blog, Guides, Compare Lists, Pagination, Parameters)	4M	1%	1M
Total	25M	100%	6.5M

The results were stunning.

Actually indexed pages versus our estimated pages differed by a count of nearly 19 million URLs
The Internal Search pages index bloat seemed the most crucial issue. Probably we had tons of meaningless, regarding traffic, indexed pages
Product and Filter Pages had a reasonable amount of low-quality pages
Pagination Pages were the top suspect of the 4M pages of “Other Pages” type. This announcement might explain why :-)

To tackle these issues, we all agreed that we should face the problem starting with a quick sprint and following-up with more sophisticated solutions down the road.

Step 2 - Setting up the team and the tools

After the above analysis, we formed a wider vertical “SEO purpose” team, which included SEO Analysts, Developers and System Engineers. This team would analyze the problem deeper, create an action plan and implement the proposals.

In our first meeting, we decided that an extensive crawl analysis is needed to fully understand the magnitude of the problem. We chose to set up an in-house real-time crawl monitoring tool, instead of a paid solution for the following reasons:

Scalability: analyze more than 25 million pages and see the changes in behavior every time we needed.
Real-Time Data: see the impact on the behavior of the crawler, right after a significant change
Customization: customize the tool and add whatever function we wanted for every different situation

Note: Depending on your needs, you can use other paid tools such as Deepcrawl or Botify, which have some handy ready-to-use features.

As we already had some experience with the ELK Stack (this is one of our primary analytics tools), we decided to set up an internal crawl monitoring tool using Kibana.

Kibana is a powerful tool and helped us find a lot of significant crawl issues. If we had to choose just one thing that expanded our capabilities on crawl monitoring, that would be the annotation of pageviews with rich meta tags. With the use of rich meta tags, URLs carry additional structured information which provides a way to query a specific subset.

For example, let’s say that we have the URL: skroutz.gr/c/3363/sneakers/m/1464/Nike/f/935450_935460/Flats-43.html?order_by=popularity

Some of the information that we inject on that URL is the following:

Page Type: Filter Page (other option could be Internal Search Page for example)
Category ID: 3363
Number of Filters Applied: 3
Type of Filters: Normal Filter (Flats), Brand (Nike), Size (43)
HTTP Status: 200
URL Parameters: ?order_by=popularity

With that kind of information we are able to answer questions like:

Does GoogleBot crawl pages with more than 2 filters enabled?
How much does the Googlebot Crawl a specific popular category?
Which are the top URL parameters that GoogleBot crawls?
Does GoogleBot crawl pages with filters like “Size” which are nofollow by default?

Tip: You can use this information, not only for SEO purposes, but also for debugging.

For example, we use page load speed information to monitor the page speed per Page Type (Product Page, Category Page etc.) instead of monitoring just the average site speed.

Imagine how much you can drill down to find Googlebot’s crawl patterns using simple Kibana Queries.

How we inject the URL information

We use custom HTTP headers. These headers flow through our application stack and any component, like our Realtime SEO Analyser, can extract and process the information it needs. At the end, and before the response is returned to the client, we strip meta headers off the request.

To sum things up, Kibana gave us the ability to do three critical things:

See every single Google Bot crawl hit on a real-time basis
Narrow results with Filters such Product Category, URL Type (Product, Internal Search, etc.) and many more
Create Visualizations or Tables to monitor the crawl behavior thoroughly The line chart shows monthly crawls for Category Filters, Internal Search and Product Pages.

Step 3 - Conclusions of the Analysis

After much digging in through the crawl reports combined with traffic stats and at least one month of continuous monitoring on both real-time data and previous months’ log data (at least ten months), we had, at last, our first findings. We will sum up the most important below:

The Good

GoogleBot crawled our most popular product pages (200k out of 4.5M) almost every day. These pages had high authority and many backlinks, so it was kind of expected

The Bad

With an average daily crawl budget of 1M and our index of 25M, Googlebot could only crawl 4% of our total pages every day
More than 50% of our daily crawl budget was spent on internal search pages. Besides this, most of those pages didn’t have traffic at all
In addition to the above, we saw a weird pattern, with a significant volume of internal search page URLs with the same generic keyphrase. For example, “v2” keyphrase was present on thousands of URLs. Examples:

We never thought that the combinations of internal searches with category pages would be crawled and indexed at such a high rate.

Part 2: Action Plan and Execution (Feb 2018 - June 2019)

Action Plan

After the analysis, our team decided on the next actions. Based on the findings, the most crucial problem was the crawl and index bloat of URLs with Internal Search Queries. We suspected the index bloat to be the main cause of the issues mentioned in Part 1.

We devised an action plan for the upcoming months consisting of two different projects:

A. Primary Crawl Budget Optimization (CBO) Project:

Find and fix crawling loopholes which create more and more indexable internal search pages
Decrease the index size of internal search pages by removing or consolidating those pages accordingly

B. Secondary Crawl Budget Optimization (CBO) Project:

Enhance the crawl and indexing phase of new URLs when we create a new category or when we merge two or more categories into one. We saw that rankings were very slowly recovered in such cases
Create an alert mechanism for important crawl issues

Execution

A. Primary Crawl Budget Optimization (CBO) Project:

1. Find crawling loopholes

At first, we wanted to see if any loopholes in our link structure allowed Googlebot to find new crappy internal search pages.

Before we start with the execution, it would be helpful to fully understand the way that our search engine works and how internal search pages are created.

Search Function on Skroutz.gr

As we said earlier, Skroutz.gr has always had search at the forefront, meaning that the vast majority of our users search for a product instead of just browsing. In fact, we have more than 600,000 searches per day!

That’s why we have a dedicated Search Team of 5 engineers who strive to enhance the experience of the user after he types a query inside the search box. The Search Team has created dozens of mechanisms to make our search engine return, in most cases, high quality and relevant results to the user. That’s why our bounce rate on those pages is very low (under 30%), near the site’s average.

Internal Search Pages: How are they created?

Firstly, we should point out that all internal search pages of Skroutz.gr have the parameter “?keyphrase=” on the URL.

There are 2 types of internal search pages; Let’s see which they are.

After a user inputs a query into the search box, our search mechanism will try to find the most relevant results from all the categories and return

a mixed category search page. Example: skroutz.gr/search?keyphrase=shoes

a dedicated category search page. Example: skroutz.gr/c/3363/sneakers.html?keyphrase=shoes

Important note: Every link of a mixed category search page, point to a dedicated category search page. On our example, if a user is on skroutz.gr/search?keyphrase=shoes and clicks on “Sneakers”, he will be moved to skroutz.gr/c/3363/sneakers.html?keyphrase=shoes.

That’s how an internal search page is created. 95% of indexed internal search pages are dedicated category pages.

It is now apparent what the loophole was… The few mixed category search pages had dozens of follow links to dedicated category search pages with the same query. With this loophole, every different search query could create hundreds of category internal search pages.

That’s why skroutz.gr/search?keyphrase=v2 was creating tons of new dedicated category internal search pages like

Googlebot can follow all red links. Thus every time it had access to a mixed category page, it could follow and crawl hundreds of new internal search pages for every different category that matched with user query.

We fixed this issue by

making all those links no-follow, except for some valuable, valid keyphrases (we will explain what a valid keyphrase is shortly)
checking browsing and UX stats of the 70,000 most popular internal searches and redirect 20,000 of them directly inside a specific category filter or internal category search. As a result, both Googlebot and users won’t see the mixed category pages when there is no reason.

For example, we saw that >95% of the users who search for iphone wanted to see the mobile phone and not any accessories. So, instead of showing a mixed category page:

We redirect the user directly to a dedicated category search, based on their search intent:

2. Decrease index size

After the fix of the issue with the mixed category pages, like skroutz.gr/search?keyphrase=v2, which created more and more new dedicated category search pages, it was time to deal with the latter.

Dedicated category search pages, like skroutz.gr/c/3363/sneakers.html?keyphrase=v2, had an index of enormous size. So, we had to see how many pages are crawled by Googlebot and which of them had the quality to be indexed.

This task took us more than one year to finish (February 2018 till June 2019). It was massive and expensive in terms of hours and workforce, but it was worth it.

For this task, we decided to create a mechanism so that the SEO team could consolidate our no-index pages without the involvement of a developer.

But how and where could we consolidate the internal search pages?

That was pretty easy! We found out that most of the internal search URLs were near duplicates of existing category filters. Example:

skroutz.gr/c/25/laptop.html?keyphrase=ultrabook (Internal Search Page)
skroutz.gr/c/25/laptop/f/343297/Ultrabook.html (Filter Page)

So, what have we done?

At first, we created a dashboard with all internal search keyphrases for every category, combined with traffic and number of crawls (we called it Keyphrase Curation Dashboard). As we said earlier, every keyphrase may be present on more than one internal search URL.

Then, we added quick action buttons, so the SEO team could eventually do the following actions, without the help of a developer:

redirect (Consolidate)
noindex or
mark the keyphrase as a valid, valuable internal search URL

When someone chooses the Redirect action, they are presented with a pop-up so they can choose the redirect targets (maximum 2 Filters + 1 Manufacturer Filter).

Why did we group by keyphrase and not just URLs?

Because the same keyphrase is present in many URL combinations (Filter + Keyphrase), every action we made for one keyphrase could affect dozens of similar URLs with the same keyphrase and save us time.

For example, let’s say that we have these two internal search URLs with the keyphrase >ultrabook in Laptop Category:

skroutz.gr/c/25/laptop.html?keyphrase=ultrabook (Keyphrase)

skroutz.gr/c/25/laptop/m/355/Asus.html?keyphrase=ultrabook (Filter + Keyphrase)

For both URLs, the dashboard would show us ultrabook as the keyphrase, but we know that Laptop Category has a filter for Ultrabooks.

We could select the action Redirect and choose Ultrabook filter as the redirect target. Then the mechanism would redirect the above URLs to the following URLs respectively:

skroutz.gr/c/25/laptop/f/343297/Ultrabook.html

skroutz.gr/c/25/laptop/m/355/Asus/f/343297/Ultrabook.html

The mechanism gathered an immense amount of keywords, reaching 2.7 millions in total! These 2.7Μ keyphrases where part of 14M indexed URLs (estimated).

After that, our team began to manually curate these keywords starting from the most popular in terms of traffic and crawl hits. Also, our dev team helped with some handy automations like grouping keyphrases with the same product results and handled them all together with one action.

All the above internal search keyphrases had the same number of product results in Laptops Category. As you can see, they are all about Dell Laptops. So, they could be redirected at once in Dell category Filter.

This step helped to curate around 5% of the total keyphrases. The index size decreased in July 2018, from 25M to 21M, but it wasn’t enough.

Along with our manual efforts, we created some automated scripts and mechanisms for redirecting and mostly no-indexing internal search pages. Some of the most important were the following:

No-index Scripts: We No-indexed all dedicated category search pages:
- with zero organic sessions in the last two months or
- nearly zero organic sessions and up to 3 crawls over the previous 6 months
Redirect Scripts: We Redirected a dedicated category search page:
- to the category clean URL, if the search (keyphrase) was returning all the products of the category. For example “sneaker” keyphrase was returning 100% of products on “Sneakers” category. So, skroutz.gr/c/3363/sneakers.html?keyphrase=sneakers is redirected to the clean category listing)
- to a specific filter URL, using a script that could linguistically identify combinations of category names with filters or manufacturers just from the keyphrase. For example “stan smith black” query is matching two different filters: “Stan Smith” and “Black”. So, if a user search for “stan smith black” he will redirected to the category page with the two filters enabled.

Note: In the last few months, we are running a more sophisticated mechanism that uses some intelligence from the above linguistic identifier script combined with other factors. The mechanism can decompose every search query, match it’s keyphrases to existing filters and redirect the internal search URL to the specific filter/ filter combination URL.

This mechanism handles a significant number of the daily internal search queries: 120,000 (18%).

When everything was done by the SEO team, manually or automatically, we made a final big step to curate the long tail of the internal search pages. We created a basic SEO training course with Workshops, 1-1 hands-on and Wiki Guides for many of Content Team members. These members could, in turn, help us with the procedure. The SEO team, of course, was always keeping an eye out on this ongoing process.

The sharing of this knowledge has greatly benefited us in many ways. For example, because of human curation for crawl budget optimization, our content teams gained a better view of the things our visitors are searching for, which helped them to create more useful category filters.

In conclusion, after nearly one year of manual and automated curation, we finally curated 2,700,000 keyphrases, which correspond to approximately 14,000,000 URLs!

Specifically, from the 2,700,000 million internal search keyphrases:

2,200,000 were no-indexed (don’t expect these to be removed immediately from Google Index. We saw some delays ranging from a few days to a few months)
300,000 were redirected to a filter page url or a category clean url
200,000 were marked as valid keyphrases

And that was our primary project.

Before we see the results of our efforts, let’s see what else have we done in a few words.

Secondary SEO Project (Optimize Merge/Split Categories and Alert Mechanism)

1. Optimize SEO when merging/splitting categories

While working on our primary project (crawl budget), we also allocated time for some secondary tasks.

The first one was to optimize and automate our procedure when we merged or split categories so that we won’t lose any SEO value and to provide a better user experience.

By merging categories, we mean the merge of 2 different categories, like “Baby Shampoo” and “Kids Shampoo”, into one
By splitting categories, we mean the process of dividing one category into two or more categories. For example, “Jackets” category can be divided into Women’s Jackets and Men’s Jackets categories.

All the above result in lots of redirects, so the SEO juice must be “transferred” from the old URLs to the new ones. What we did to optimize the whole procedure was to create an easy to use Merge/Split Tool, so the content team (which is responsible for the products) can easily map the old URLs with the new ones.

The merge tool shows all the filters from both categories, so the content team can map them or copy them. The mechanism will then use this information to make the redirects automatically.

2. Create alert mechanism

Alongside Keyphrase Curation, we built a mechanism that sends out notifications to the SEO team, when a critical crawling or indexing issue arises.

How does this mechanism work?

Depending on the alert type, the mechanism sends an alert when a metric (numeric):

exceeds a specific threshold (for example 20,000 Not Found Pages)
differs significantly from the normal statistical fluctuation of the last 30 days

As for the tools we use, we have set up alert rules in Grafana Alerting Engine that get delivered to a Slack channel.

After a notification is received, we use Kibana Monitoring tool to deeper analyze the root of the problem.

Some examples of the alerts we have set:

Sitemap Differences: Before the daily update of our sitemap files, the mechanism compares each generated file with the already submitted one. If they differ a lot, the alert mechanism informs us and blocks the sitemap submission instantly, until we validate the data
Noindex Crawls: If crawls of Noindex Pages fall outside of a specified safe range
Not Found Crawls: If crawls of 404 Pages fall outside of a specified safe range
Redirect Counts: If crawls of Pages with enabled redirect fall outside of a specified safe range

Part 3: Results

1. Decreased Index Size

The above graph shows our index size after seven months of hard work and 90% of keyphrases being curated.

As for today?

Even better!

We have now dramatically closed the gap between the actual and expected indexed pages, meaning we reduced the size from 25M to only 7.6M.

One interesting thing that we observed is this:

GoogleBot doesn’t stop crawling a URL immediately, even if you mark it as no-index. So, if you think that a no-index tag will save your crawl budget instantly, you are wrong.

Notably, we saw that in some cases, GoogleBot returned after 2 or 3 months to crawl a no-index page. We created some metrics for these, and we saw that:

Only half of our no-index internal search URLs haven’t been crawled for at least three months
Only, 38.28% of our no-index internal search URLs haven’t been crawled for at least six months

2. Increased Filter Crawl Rate

If we take, for example, the fluctuation of Internal Search Pages Crawls (blue) versus Filter Pages Crawls (green) during the last year, it’s clear that we forced Googlebot to crawl the Filter Pages more frequently than Internal Search Pages.

3. Decreased time for new URLs to be indexed and ranked

Instead of taking 2-3 months to index and rank unique URLs, as we saw in a previous example, indexing and ranking phase now take only a few days.

On 29/03/2019, we redirected an internal search page skroutz.gr/c/1487/Soutien.html?keyphrase=bralette to a Category page skroutz.gr/c/3361/Bralettes.html.

4. Filter URLs have increasing visibility

Take a look at the Data Studio chart below, with data (Clicks) of Search Console from June 2018 to May 2019. You can see how the organic traffic of Filter URLs is increasing compared to the Internal Search Keyphrases URLs traffic, which is slightly decreasing.

5. Average Position Improved, pushing up impressions and clicks

The table from Google Search Console compares the summer of 2018 (exactly when we started the SEO Project) versus the summer of 2019.

What We Learned

Over the past two years, we’ve learned a lot during this technical SEO project, and we want to share some things which could eventually help the community.

So here it goes; these are the five most important things we learned:

Takeaway 1:

Crawling monitoring is a must for large sites. You can find insights in such a way you would have never guessed. By monitoring, we don’t necessarily mean real-time like we did. You can also use a website crawler like Jet Octopus or Screaming Frog every month or after a critical change in your site. You would be amazed by the value you could earn by doing this.

An interesting example of other useful insights you can get from crawl monitoring is some critical issues we found when we switched our category listing product pages to React. Without getting into details, after React deployment, GoogleBot started to crawl like crazy no-indexed pages that shouldn’t be crawled, despite being nofollowed from every other link. With crawl monitoring, we were able to immediately see what type of pages had that issue.

We saw that most of the crawls on no-indexed pages where a combination of size and manufacturer filters on the category with the ID 1764

After all, we found out that GoogleBot executed an inline <script /> and interpreted some relative URL paths as regular URLs, which then crawled at a high rate. We validated this assumption with the addition of a dummy URL in the script, which we later saw that GoogleBot was able to crawl.

Takeaway 2:

Googlebot doesn’t stop crawls immediately after you change a page to no-index. It can take some time. We saw no-indexed URLs to be crawled for months before they have been removed from the Google Index.

Takeaway 3:

Consolidating URLs can backfire easily if not done right. Every URL that is redirected to another must be highly related (nearly duplicate) to each other. We have seen that redirects to irrelevant pages had the opposite results.

Takeaway 4:

Always pay attention when merging or splitting categories. We saw that even if you keep your rankings stable, there might be a delay of up to a few months where you can lose many clicks. Mapping old URLs to new ones and 301 redirects can really help.

Takeaway 5:

SEO is not a one-person show or even one-team show. Sharing of SEO knowledge and cooperation with other teams can empower the entire organization in many ways. For example, Search Team of Skroutz.gr has made a magnificent work by setting-up most of the technical infrastructure of the tools and mechanisms we used on our SEO project.

Finally, you can’t imagine how many SEO issues we have found using feedback from other departments such as Content Teams and Marketing. Even the CEO of Skroutz.gr himself, has helped a lot on technical issues we had (Scripts etc.).

That’s all folks.

Congratulations on getting to the very end of this, quite large :-), case study!

Have you ever used any insights from the crawling behavior of GoogleBot to solve issues on your site? How did you deal with them? Let us know in the comments section below!

Vasilis Giannakouris,
on behalf of Skroutz SEO Team

[Case Study] How we optimized our Crawl Budget was originally published by Vasilis Giannakouris at Skroutz Engineering on October 30, 2019.

Agile Summit Athens 2019

2019-10-03T21:00:00+00:00

On September 19th and 20th we* attended Agile Summit in Athens. Agile Summit is an international conference gathering world class speakers, agile experts & practitioners from around the world. Skroutz supports Agile Summit and last year’s conference was quite inspiring, so we decided to attend it again this year. here are our notes.

Applying the Heart of Agile

From Alistair Cockburn, Creator Heart of Agile

Alistair Cockburn is one of the authors of the Agile Manifesto (in 2001) and shared with us the principles of heart of agile. His point of view is that the whole idea of the Agile Manifesto is simple. But since 2001 Agile became more and more complicated, more things have been put on it. Agile became a complete industry. Heart of Agile says that we should go back to the essence, which is four words:

Collaborate
Deliver
Reflect
Improve

He didn’t go extensively through the framework, and prompted the audience to see his full presentation from a conference in Denmark. He described a framework for learning and mastering skills, called Shu Ha Ri and Kokoro, which is also explained in the video, so it’s highly recommended.

Innovation at scale

From Yariv Adan, Product Manager @Google

Yariv shared a few insights on how Google is enabling innovative products. His main focus was the 20% time projects, which is responsible for multiple products with more than 1B users like Gmail, Google Translate or Google News. Achieving and maintaining that in these highly competitive markets requires constant product & technology innovation. He shared his observations & principles through his 10+ years experience at Google about the process to generate them through TGIF meetings, and Google’s what makes a good manager research & continuously iterations . Some key points:

Always focus on the User
Launch & iterate, rather than perfection
Ideas come from everywhere -> Share everything
Empower people -> Data, not opinions
Let people pursue Dream

Lessons from an ex-Project Manager turned Product Manager

From Emma Septon, Account Manager @ProdPad

Emma Septon talked about the principles and best practices she learned and applied in order to help her with the new role of the Product Manager making the transition from the Project Manager. Key Points:

Customer should be on the centre. Typically the stakeholders and clients ask for a feature, not a solution. Listen to them to deeply understand their problems.
Define the strategy and focus on it. Put it as the first priority and learn to say no to irrelevant requests. ‘Five whys’ technique will help to focus on the why’s and not on the how’s.
Represent the plan in a way that can be understood by everyone in the business. A roadmap (now-next-later) will help in that direction, while a time-based project plan, like gantt chart might be more challenging since it may need to be redone many times.

Find balance between focusing on strategy and day-to-day development involvement (which may be time consuming).
Use Eisenhower matrix technique to define what is important and needs immediate action and what can be delegated or even eliminated.

Focus on outcomes and not on outputs. Output is just about implementing a feature while outcome is about meeting the objectives; is a learning experience.

Empathy is a technical skill

From Andrea Goulet, CEO of Corgibytes

What is Empathy? Is it a feeling? Technicians can’t access empathy? Is just a high-level, touchy-feely fad? Nope. Andrea demonstrated how empathy is a crucial skill for developing software and focused on giving us practical and immediately actionable advice for making empathy a central focus of our daily development practice.

Andrea mentioned the differences between cognitive and mirrored empathy and how to exercise yourself in order to build a stronger empathy, for example:

Start with a broad topic
User the fewest number of words
Avoid introducing words the speaker may not have heard
Try not to say “I”
Be supportive and present
Resist the urge to demonstrate how smart you are
Neutralize your reactions

Conclusion

Wrapping up, Agile Summit as one of the biggest agile conferences in Southern Europe was quite inspiring once again. The organisation of the conference was great as well as the talks. With so many interesting people to interact, learn and exchange experiences was clearly met our expectations. See you there next year! You are more than welcome to leave a comment.

Written By:

* Stavroula Vasilopoulou, Giorgos Tsiftsis, John Makridis, Dimitris Promponas, Vagelis Tzortzis

Agile Summit Athens 2019 was originally published by John Makridis at Skroutz Engineering on October 03, 2019.

Entropy changes in Debian or 'why a VM boots in 5 minutes?'

2019-09-09T00:00:00+00:00

Intro

At Skroutz we operate a wide variety of services comprising the ecosystem behind Skroutz.gr, a comparison shopping engine which evolved to an e-commerce marketplace. We run these services on our own infrastructure, bare metal servers and virtual machines. All hosts are running Debian GNU/Linux, which on July 6th 2019 had its latest stable release, called Buster. Buster came with lots of changes in included packages, as expected in a major release.

We started experimenting with dist-upgraded Buster hosts a couple of months before the official release, as soon as Buster got in “freeze” state. This strategy would give us a taste of what to expect with the new software versions and how to get better prepared to smoothly upgrade the operating system underneath our services with minimum disruption.

The problem

The issue we’re going to discuss in this post manifests pretty simply: after dist-upgrading a virtual machine to Buster and rebooting it, it took a couple of minutes before we could actually regain access via ssh. Virtual machine reboots are part of routine maintenance work to keep our services up-to-date and secure. When orchestrating such works across a fleet of hundred hosts, we certainly would like to avoid spending minutes before verifying that each host did come back up and healthy.

Investigation

It’s widely known that virtual machines do not enjoy the privilege of high quality randomness as the physical hosts do, since a virtual machine’s devices are emulated by design, thus do not feature unpredictable behavior, a useful ingredient for randomness ¹ ² ³.

Various references, e.g. Debian bug reports ⁴ ⁵, suggested that this behavior was to be attributed to OpenSSL and how it gathers entropy via the getrandom() system call. But all these online references were not descriptive enough or conclusive, so we opted for digging deeper and understand the issue.

Kernel ring buffer displays important information coming from the kernelspace and it’s the first place we looked at. Consider this snippet from a Buster VM that just booted:

# journalctl -k | grep random
Apr 17 12:05:06 somevm kernel: random: get_random_bytes called from start_kernel+0x93/0x531 with crng_init=0
Apr 17 12:05:06 somevm kernel: random: fast init done
Apr 17 12:05:06 somevm kernel: random: systemd: uninitialized urandom read (16 bytes read)
Apr 17 12:05:06 somevm kernel: random: systemd: uninitialized urandom read (16 bytes read)
Apr 17 12:05:06 somevm kernel: random: systemd: uninitialized urandom read (16 bytes read)
Apr 17 12:06:48 somevm kernel: random: crng init done
Apr 17 12:06:48 somevm kernel: random: 7 urandom warning(s) missed due to ratelimiting

Three important points stand out:

before anything else it’s the kernel entry point which requests randomness with get_random_bytes() kernel function. We will explain its behavior and usage below.
systemd (userspace) is also requesting randomness while bringing up system’s services
crng init (crng stands for cryptographic random number generator) takes almost 2 minutes since boot

kernel’s `get_random_bytes()`

get_random_bytes() is an in-kernel interface to provide random bytes. In our case, it is called from kernel’s entry point ⁶ if CONFIG_STACKPROTECTOR is set, which is true for kernels packaged in Debian. That message is printed if CONFIG_WARN_ALL_UNSEEDED_RANDOM is not set (again true for Debian) to inform us that we don’t have a fully seeded CRNG. In case you’re curious, these numbers are required for GCC’s “stack-protector” feature. When a function gets called, a random number is placed on the stack, just before the return address. This number is called “canary” and is validated by the kernel after returning. If an attacker performs a stack-based buffer overflow, the canary value will be overwritten. The kernel will detect this attack and throw a kernel panic ⁷.

A quick look into the kernel codebase shows us that it is unlikely that the boot process will actually block here, rather we have a clear indication that kernel’s CRNG is not properly initialized and we’ll see how that affects userspace processes that depend on that.

systemd ssh.service

Following lines in dmesg show that systemd has started as well and it actually reads bytes from urandom, albeit uninitialized.

systemd allows us to print a tree of the time-critical chain of systemd units (including services) as well as the time spend for each one to be started. This is done via:

# systemd-analyze critical-chain
The time after the unit is active or started is printed after the "@" character.
The time the unit takes to start is printed after the "+" character.

graphical.target @1min 45.121s
└─multi-user.target @1min 45.121s
  └─ssh.service @1min 34.242s +10.857s
    └─network.target @3.887s
      └─networking.service @1.096s +2.790s
        └─network-pre.target @1.095s
          └─ferm.service @288ms +807ms
            └─systemd-journald.socket @287ms
              └─system.slice @282ms
                └─-.slice @282ms

It’s clear that ssh service takes somewhat longer than usual to get up. Its journal reads:

# journalctl -u ssh.service
-- Logs begin at Wed 2019-04-17 11:53:56 EEST, end at Wed 2019-04-17 11:56:43 EEST. --
Apr 17 11:54:00 somevm systemd[1]: Starting OpenBSD Secure Shell server...
Apr 17 11:55:30 somevm systemd[1]: ssh.service: Start-pre operation timed out. Terminating.
Apr 17 11:55:30 somevm systemd[1]: ssh.service: Control process exited, code=killed, status=15/TERM
Apr 17 11:55:30 somevm systemd[1]: ssh.service: Failed with result 'timeout'.
Apr 17 11:55:30 somevm systemd[1]: Failed to start OpenBSD Secure Shell server.
Apr 17 11:55:30 somevm systemd[1]: ssh.service: Service RestartSec=100ms expired, scheduling restart.
Apr 17 11:55:30 somevm systemd[1]: ssh.service: Scheduled restart job, restart counter is at 1.
Apr 17 11:55:30 somevm systemd[1]: Stopped OpenBSD Secure Shell server.
Apr 17 11:55:30 somevm systemd[1]: Starting OpenBSD Secure Shell server...
Apr 17 11:55:41 somevm sshd[1184]: Server listening on 0.0.0.0 port 22.
Apr 17 11:55:41 somevm sshd[1184]: Server listening on :: port 22.
Apr 17 11:55:41 somevm systemd[1]: Started OpenBSD Secure Shell server.

It seems that ssh.service gets stuck in its ExecStartPre command:

# systemctl cat ssh.service | ag ExecStartPre
ExecStartPre=/usr/sbin/sshd -t

sshd -t just checks the validity of configuration files and sanity of keys. So, why is it blocking? To get an insight on why ExecStartPre times out, we decided to enrich it like this:

#!/bin/sh
strace -f -c -w /usr/sbin/sshd -t > /tmp/sshd_strace_`date +%s` 2>&1

We basically wrap the sshd invocation with strace and instruct it to keep aggregate time statistics about each system call made by the executable. Our intention is to identify the system call sshd is spending most of its time at before finally get killed by systemd.

After rebooting the VM we got our sshd strace logfiles:

# ls -l /tmp/sshd_strace*
-rw-r--r-- 1 root root 2152 Apr 17 12:49 /tmp/sshd_strace_1555494448
-rw-r--r-- 1 root root 2152 Apr 17 12:49 /tmp/sshd_strace_1555494538

This is the output of the first attempt (which gets killed by systemd):

# cat sshd_strace_1555494448
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.96  101.669156   101669156         1           getrandom
  0.01    0.007609        7609         1           execve
  0.01    0.006644         120        55           read
  0.01    0.006289          49       128           mmap
  0.00    0.004297         104        41           mprotect
[...]
------ ----------- ----------- --------- --------- ----------------
100.00  101.706415                   444         7 total

It’s self-evident that sshd spends the whole time trying to acquire randomness via getrandom() system call.

The second systemd attempt to get sshd up actually succeeds with the strace log reading:

# cat sshd_strace_1555494538
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.94   11.543144    11543143         1           getrandom
  0.02    0.001813          34        52           close
  0.01    0.001594          12       128           mmap
  0.01    0.000753          16        47           openat
  0.01    0.000585          10        55           read
  0.00    0.000564          13        41           mprotect
[...]
------ ----------- ----------- --------- --------- ----------------
100.00   11.549977                   444         7 total

Notice that the second attempt succeeds (12:49:10) exactly at the same time getrandom() returns a result, which coincides exactly with the timestamp the kernel’s entropy pool gets initialized:

# journalctl -k | grep random
Apr 17 12:47:25 somevm kernel: random: get_random_bytes called from start_kernel+0x93/0x531 with crng_init=0
Apr 17 12:47:25 somevm kernel: random: fast init done
Apr 17 12:47:25 somevm kernel: random: systemd: uninitialized urandom read (16 bytes read)
Apr 17 12:47:25 somevm kernel: random: systemd: uninitialized urandom read (16 bytes read)
Apr 17 12:47:25 somevm kernel: random: systemd: uninitialized urandom read (16 bytes read)
Apr 17 12:49:10 somevm kernel: random: crng init done
Apr 17 12:49:10 somevm kernel: random: 7 urandom warning(s) missed due to ratelimiting

# journalctl -u ssh.service
-- Logs begin at Wed 2019-04-17 12:47:25 EEST, end at Wed 2019-04-17 12:52:23 EEST. --
Apr 17 12:47:28 somevm systemd[1]: Starting OpenBSD Secure Shell server...
Apr 17 12:48:58 somevm systemd[1]: ssh.service: Start-pre operation timed out. Terminating.
Apr 17 12:48:58 somevm systemd[1]: ssh.service: Control process exited, code=killed, status=15/TERM
Apr 17 12:48:58 somevm systemd[1]: ssh.service: Failed with result 'timeout'.
Apr 17 12:48:58 somevm systemd[1]: Failed to start OpenBSD Secure Shell server.
Apr 17 12:48:58 somevm systemd[1]: ssh.service: Service RestartSec=100ms expired, scheduling restart.
Apr 17 12:48:58 somevm systemd[1]: ssh.service: Scheduled restart job, restart counter is at 1.
Apr 17 12:48:58 somevm systemd[1]: Stopped OpenBSD Secure Shell server.
Apr 17 12:48:58 somevm systemd[1]: Starting OpenBSD Secure Shell server...
Apr 17 12:49:10 somevm systemd[1]: Started OpenBSD Secure Shell server.

Quick sidenote: We were curious about why sshd is calling getrandom() even if its just validating its configuration. A quick look at sshd’s source code, shows that it seeds its RNG during startup, even if its just validating its configuration:

int
main(int ac, char **av)
{
[...]
    seed_rng();
[...]
    if (test_flag > 1) {
[...]
        parse_server_match_config(&options, connection_info);
        dump_config(&options);
    }
[...]

seed_rng() is invoking RAND_status(), an OpenSSL library function which, finally, executes getrandom().

Changes for `getrandom()` system call

So we’ve identified that ssh.service blocks waiting for getrandom() syscall. Then our focus shifted to understanding why/when getrandom() blocks and how is that related with the kernel’s CRNG.

First, it’s clear that whether getrandom() will read from /dev/urandom or /dev/random and whether will it block or not is controlled by the relevant flags: GRND_RANDOM and GRND_NONBLOCK(check getrandom(2) for more). A quick search showed that neither OpenSSH nor OpenSSL (which OpenSSH relies on for cryptography) do not set any of these flags, meaning getrandom() will have its default behavior: will block until the kernel’s CRNG is ready.

If these flags are not set then either the system call or the CRNG did change in the meantime. And this meant digging into kernel source code and git history… :D Debian Stretch features kernels from the 4.9.x linux-stable tree while Debian Buster features kernels from the 4.19.x series.

Pondering over the output of git log -p v4.9..v4.19 -- drivers/char/random.c is really an enjoyful activity but we’ll spare you the time and directly point you to commit 43838a23a05fbd13e47 by Theodore Ts’o. This commit is entitled random: fix crng_ready() test and was introduced in linux 4.17 as a response to multiple security issues reported by Google’s Project Zero. It basically changes the crng_ready() function to be more strict about when linux’s CRNG is safe for cryptographic use cases:

diff --git a/drivers/char/random.c b/drivers/char/random.c
index e027e7fa1472..c8ec1e70abde 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -427,7 +427,7 @@ struct crng_state primary_crng = {
  * its value (from 0->1->2).
  */
 static int crng_init = 0;
-#define crng_ready() (likely(crng_init > 0))
+#define crng_ready() (likely(crng_init > 1))

But how does this commit affect getrandom() syscall? The following block is getrandom’s definition from linux v4.9.144 (just a kernel version in a Stretch host), ie before random: fix crng_ready() test was applied.

if (!crng_ready()) {
    if (flags & GRND_NONBLOCK)
            return -EAGAIN;
    crng_wait_ready();
    if (signal_pending(current))
                   return -ERESTARTSYS;
}
return urandom_read(NULL, buf, count, NULL);

Upon early boot, getrandom() would treat crng_init == 1 good enough and would just return urandom_read, so would not block. This was was not considered “secure” enough. After applying random: fix crng_ready() test getrandom() behavior changed: it would block (unless called with GRND_NONBLOCK) until CRNG was really cryptographically ready, i.e. crng_init == 2.

Resolution

As soon as we pinpointed the reason ssh (and other userspace software) could block early on boot when calling getrandom() we urged to evaluate possible solutions. Our goal is to assist the virtual machine to get “good enough” entropy early on when booting. Providing QEMU guests with quality entropy is not a novel issue, rather it’s a recurring one when one needs to operate a cryptographically intensive application within a virtual machine.

We discarded the option of running a userspace daemon, such as HAVEGED inside every VM. Currently, as far we are concerned, there are no practical attacks against HAVEGED, but it has received a lot of criticism for low-quality entropy, state leaking, etc ⁸ ⁹. Also, from a infrastructure perspective, we aim to provide anything that’s necessary to VMs, without having to perform modifications inside the guests. Users should be able to use our virtualization infrastructure without having to modify images, due to an “unwanted” side-effect on the host’s kernel.

Instead, we’d prefer a cleaner approach and turned our attention to VirtIO RNG ¹⁰. VirtIO RNG is a paravirtualized device for QEMU, that exposes a hardware RNG inside the guest. Enabling it for QEMU instances basically allows physical hosts to inject randomness in virtual guests by exposing a special purpose device, /dev/hwrng. VirtIO RNG is configurable and can be wired up on the host to retrieve entropy from various sources, such as /dev/{,u}random or even a hardware RNG. The downside of this solution for us was that it was not immediately available in our virtual machines cluster manager, Ganeti. Such a missing feature can also be seen as a contribution opportunity though! So Nikos got down to implement what was missing for the KVM hypervisor in Ganeti ¹¹.

In the meantime another possible solution emerged: RDRAND. This is a x86 CPU instruction, available on modern Intel (Ivy Bridge and later) and AMD processors, that returns random numbers as supplied by the hardware’s cryptographically secure pseudorandom number generator ¹². In other words one may trust the physical CPU to fetch “cryptographically secure” numbers. Using RDRAND is possible under certain conditions, which we luckily met:

physical host’s CPUs has to support this instruction. In our case, all bare metal servers consisting our Ganeti cluster did actually feature modern enough Intel CPUs.
Linux kernel has to use randomness provided by the CPU. Indeed, this functionality has been made available in Linux v4.19 by Theodore Ts’o and has been enabled in Debian since debian/4.19.20-1~9.

Apart from RDRAND, new Intel x86 CPUs expose yet another instruction, called RDSEED. RDSEED returns numbers of “seed-grade entropy”, the output of a true RNG that should be used by software seeding a pseudo-RNG. This would provide even better quality of entropy to our hosts, together with a possible speed gain. Unfortunately, not all hosts in our fleet support this instruction, so we dismissed the idea.

Finally, we were able to expose RDRAND CPU flag to all our guests by simply modifying ganeti cluster’s KVM cpu_type hypervisor parameter like so:

gnt-cluster modify -H kvm:cpu_type=SandyBridge\\,\\+pcid\\,\\+invpcid\\,\\+rdrand

This allowed Buster guests to properly initialize their kernel CRNG and all subsequent calls to getrandom() did not block.

Trusting the CPU to provide “cryptographically secure” random numbers may raise some concerns, given that hardware vendors have been found to compromise their products’ security/integrity when pressured or instructed by high-power, high-influence institutions. ^_^ This is even highlighted by Theodore Ts’o in the respective aforementioned commit. Our decision to use the RDRAND instruction and trust the CPU was preceded by weighing various related parameters: we already trust the CPU for all the things, being the dominant, followed by the fact that Debian has enabled that by default.

Conclusion

RDRAND and RDSEED helps the kernel quickly initialize its CRNG, inducing non-blocking calls to getrandom(), thus no lag during boot. RDRAND provides an acceptable seed for randomness, not necessarily a high quality entropy flow. This should be acceptable for most applications/cases where a pseudo-random generator like urandom is sufficient.
VirtIO RNG also solves the CRNG early boot starvation issue.
VirtIO RNG is the way to go when the guest machines needs high-quality (and probably high volume of) entropy.
VirtIO RNG support was not available for Ganeti at the time of our investigation, but we did work for such a feature. We thereby judged RDRAND was an acceptable short-term solution and went for it.

If you have any questions, ideas, thoughts or considerations, feel free to leave a comment below.

Searching at Skroutz: from Kafka to Elasticsearch

2019-08-26T00:00:00+00:00

Powering Search

At Skroutz we make extensive use of Elasticsearch. One of the major use cases is powering the site’s search and filtering capabilities that assist our users finding the product they are looking for. We are happy to serve around 1.2M searches on an average day.

At the heart of search, lies Elasticsearch and its documents. Each document corresponds to a categorized, manufactured item available for sale, namely Stock Keeping Unit, or SKU for short. Searches require complex queries that involve multiple attributes composed by several of its database record values along with several fields calculated during serialization. Some of those attributes are:

SKU name
Category name
Manufacturer name
Minimum price
Current Availability

Numerous changes, such as product price updates, are performed on our relational database almost constantly. Modifications should be reflected to the Elasticsearch index state with as little latency as possible, thus keeping the search results up to date. The nature and origin of the changes varies, as we collect the availability and price information from shops that we collaborate with at regular intervals. In addition, our Content Teams continuously enrich SKU, manufacturer or category information, which also may happen through automated, complex pipelines, such as category classification operations.

It becomes apparent that we need a robust way to keep the database and the Elasticsearch documents in sync. Our choice is asynchronous updates triggered by hooking into ActiveRecord, as we are powered by Ruby on Rails. We are writing to the database synchronously since we consider it our ground truth. However doing the same in Elasticsearch for every single event would add a major performance overhead on each transaction, as the serialization process is inherently expensive. Asynchronous updates allow for retries in case of possible intermittent failures. The indexing operations are designed to be idempotent and resilient to certain failure scenarios, so the sequence of updates for a single document can be repeated or reordered, thus the index state will eventually converge.

The Beanstalk era

Our legacy implementation used a popular tool called beanstalk; a work queue daemon with a simple architecture. It accepts messages through the network and holds everything in memory, while also employing a write-ahead log for persistence. A beanstalkd process was co-located with every application server of our fleet and every time an update occurred in the database, the application enqueued a message to beanstalk. The worker process would then consume the message and perform the necessary work.

Beanstalk pipeline architecture

This pipeline has a few problems. This beanstalk ensemble is not centralized, which translates to an uneven load distribution among workers. That decentralized aspect also complicated our deployment process, as we had to account for many hosts in case we wanted to retry or debug something. Consider what happens if a change affects multiple documents. An update on an associated entity (such as a category name change) would mean that we need to update any affected fields for all SKUs associated with said entity. As with individual updates, this associated entity update will be handled from an application server so it would block the entire queue, while all the others workers are idle. As mentioned before, we have to use denormalization in several cases in order to make the SKU attributes searchable.

Another big concern of ours, was that updates for a single entity were not ordered. An example will clarify the situation. Let’s imagine two update events for the same SKU occurring simultaneously or very close to each other. It’s a matter of chance which application server will handle each request, and it is almost certain that they will end up at different servers, and thus different beanstalk queues. If the processing times overlap, a race condition could occur. This is highly unlikely to happen but we wanted to remove this possibility entirely, since it adds some mental overhead, particularly as the application scales.

Here is a diagram illustrating the race condition

Beanstalk flow race condition

This solution served us very well for many years, but due to our scaling and operational needs, we decided it was time to move on to more sophisticated pipelines.

Considerations for the new Message Queue

Given that we usually have to process hundreds of thousands of updates daily, it was necessary to be able to decouple them from the primary database updates and also be able to keep track, monitor, and possibly automatically retry them in case of an intermittent failure. Another concern of operational nature, is the ability to be able to perform a point-in-time recovery process, which can happen in case of a bug or if an index modification requires re-indexing. In this case, we need to identify which documents were modified during a given time range, and be able to perform the necessary update operations again, so that the Elasticsearch state eventually converges.

As discussed, we had some problems on our hands:

Eliminate race conditions (strict ordering)
Introduce distributed processing (horizontal scaling)
Introduce persistence
Introduce pause and rewind capabilities

Regarding the concurrency issues, we could take advantage of Elasticsearch versioning. Provided that we would always send the current version of the document along with the update request, this technique would render our potential race condition issues impossible. However, that would increase the contention on the Elasticsearch cluster and our database because that would also require help from it as the Elasticsearch document version would be stored there.

After some whiteboard sketches, we decided to go with Apache Kafka, as the use case sounds well suited for it. We already are huge fans of the system and we have a production cluster deployed for other company projects, so this was a no-brainer.

The new pipeline

Kafka is a distributed log at its core, offering by default both distributed processing and strict ordering guarantees. Both of these aspects are a result of an ingenious and pretty simple decision. In Kafka, a stream of records is called a topic. A topic is split into partitions, and the cluster allows only a single consumer to read from a partition. To accomplish strict ordering and avoid race conditions, we also need all messages that concern the same entity to be consistently stored at the same partition. Since the client determines the partition of the topic that a message will be stored in, this can be accomplished by using the document ID (database primary key) as a key. Partitioning schemes may vary, with the simplest being hashing the key value and applying a modulo operation, with the divisor being the total number of partitions.

Furthermore, Kafka also offers substantial throughput, by distributing partitions evenly across many machines (called brokers). All published messages are persisted on disk, so there is no possibility of message loss. Messages are not removed after being consumed and Kafka stores the per-partition offset that each consumer group has reached. Retention period is customizable and messages are available for several days after their consumption.

This explanation could go on forever if we were to get into more intricate details about Kafka, so we’ll refer you to the official documentation.

Our architecture can now distribute the load to multiple consumers while also having persistent and centralized storage. It looks like this:

Kafka pipeline architecture

This offers us a much more future-proof architecture that can withstand growth. It gives us the ability to quickly add more resources to a bottlenecked component. In case our load increases in the future, topics can easily be repartitioned to allow for more consumers in a matter of minutes, thus allowing us to add more workers to the pool. Kafka guarantees that after the rebalance, the order is still strict and the updates are distributed and blazingly fast.

The use of Kafka also allows us to have more visibility and finer operational control on the whole pipeline process. In the old architecture, all workers had to be stopped for the process to be paused. However, in Kafka, the position of each consumer (which is called offset) is maintained by the cluster and can be rewound based either on a timestamp or on an explicit offset position. Therefore, we are now one command away from rewinding the consumers to the position they were, say, two hours ago. This is a tremendous gain, in cases of bugs or maintenance windows.

Achieving strict ordering

One of the biggest problems that we faced while implementing the aforementioned solution was bulk updates. As described, there are some kinds of updates that concern multiple documents, such as a category update. On our legacy pipeline, these updates were handled by the Elasticsearch Bulk API mainly for performance reasons.

However, since we wanted to preserve strict ordering, we needed to do some kind of unrolling of those bulk updates into their respective document level updates and enqueue those documents consistently using the same topic and message key. We’ll take the category update as an example again. If the category has N SKUs, we need a service to produce N messages, one update message for each SKU.

Besides correctness, another reason to implement the unrolling process was to ensure that processing time on the consumer remains low. Kafka is generally optimized for small message processing times and consumers are required to continuously verify that they are working, as a liveness check. Failing to send heartbeats causes a session timeout. It is configurable by the session.timeout.ms variable, but a high value is not recommended.

If a consumer is executing a long-running process, the broker can potentially consider the consumer inactive and will trigger a rebalance, thus removing it from the consumer group. That same message, however, will be picked again by another consumer, after the rebalance, since the cluster thinks that the message has not been consumed yet. One can understand that if the job is inherently big, this can go on forever, triggering rebalances and timeouts every time and effectively bringing the whole pipeline to a halt.

Implementing the above correctly was tricky because the unrolling process itself can end up going over the Kafka processing limits. We ended up with a “two-level unrolling” technique. Processing a bulk update message will first split the entire document collection to batches of a predefined size (e.g. 1000) and produce one message for each batch. When each batch message is in turn consumed, it produces the corresponding update messages in the document-level update topic.

For code simplicity, developer sanity, and correctness, we considered having a dedicated topic for each different type of update, but we settled with two. The first topic and its consumers handle the bulk updates and enqueue into the second topic which actually performs the Elasticsearch write requests. Of course, most flows in our application enqueue directly into the document-level update topic.

Adaptive Throttling

Early on during development, we encountered a problem. Now that the bulk updates that come through are translated to document level updates, our system could easily flood itself, because producing a message to Kafka is of the order of a millisecond per message and during unrolling we can potentially produce hundreds of thousands of messages.

Therefore, bulk updates are expected to be completed, at a later time (depending on the number of affected SKUs). On the other hand, individual updates should perform with low latency, as the changes are generally expected to be visible in search results within seconds.

Kafka does not support priorities at all, and we could not implement a priority system on top of it, because we would lose the strict ordering guarantee. We needed a mechanism which would monitor and throttle the bulk consumer processes specifically when there were more urgent updates that need to pass through.

We ended up using an external counter in order to coordinate that process. The concept was that we would allow only a certain number of updates that originate from a bulk update operation to be enqueued within a certain time interval.

The flow is as follows:

A new bulk update is generated and is consumed.
It is unrolled into smaller batches, each one covering a different range of the SKU primary-key space.
Batch messages are again consumed by the same consumer. If the counter is zero, the consumer will increment it by the size of the batch. Otherwise, it switches to a polling mode until it becomes zero, thus throttling the process.
The consumer will then proceed to enqueue the document level messages.
The Document-level consumers will pick them up, and upon completion the counter will be decremented by one.

Eventually the counter will reach zero, when the batch is done, effectively allowing the next batch to be enqueued. This enables time windows for other updates to be enqueued and processed. Note that we also check whether we are about to cross the Kafka session.timeout limit at step (2) above, since the total processing time should not exceed this threshold. So there are two termination conditions for the polling loop.

The concept is that a feedback loop is established between the two consumers, allowing the batch consumer to have an insight on whether the document consumer has the availability to process the next batch. Additionally, in cases where we need to throttle more aggressively, we can reduce the batch size and the system will adapt.

Redis was a strong candidate for such a counter since it is accessible from all the consumers and can be easily monitored and operated upon in case we needed to run ad-hoc commands for debugging reasons. Its atomic operations and TTL capabilities were also important properties, as we also have a TTL on the counter in case something goes wrong and becomes stale.

Conclusion

We are pretty satisfied with this new pipeline, and we enjoyed the ride, learning a lot about Kafka and distributed systems in general. Apart from much greater performance, we feel our new architecture will last for many years to come, as it offers huge flexibility to both our developer and operations teams.

If you have any questions, ideas, thoughts or considerations, feel free to leave a comment below.

Searching at Skroutz: from Kafka to Elasticsearch was originally published by George Papanikolaou at Skroutz Engineering on August 26, 2019.

Speeding Up Our Build Pipelines

2019-08-23T00:00:00+00:00

Maintaining a high velocity in development teams requires us to continuously improve our daily workflows. Build pipelines specifically, make up for a big chunk of these workflows since they’re involved whether we’re developing, testing or deploying our code.

At Skroutz it’s not unusual to perform over 30 deployments during the course of a day, while the test suite needs to be run even more frequently. And that’s for the main application only.

As our organization grew, certain build pipelines got slow to the point where they became too disruptive. After all, each minute we’re waiting for a deployment to finish means we can’t work on things that matter.

In this post we will see how these issues led us to create mistry, an open source general-purpose build server.

Background

Our infrastructure is hosted and maintained in-house, so it was a straightforward process to determine where the majority of time was spent during our most critical pipelines.

With proper instrumentation set up, we could start pinpointing significantly slow processes in our daily workflows.

Asset compilation

Asset Pipeline is the Ruby on Rails component that takes care of minifying, concatenating, obfuscating and compressing web assets (mostly JS and CSS files); a process called asset compilation. The compiled asset files are those served to the end users. This can be a slow process depending on the size of the application.

In most conventional Rails setups, asset compilation happens as part of the deployment process. To deploy the main application, we use Capistrano:

$ cap production deploy

Capistrano then takes over and sequentially executes a bunch of commands (copy the new code to application servers, restart services etc.) One of the commands is the following:

$ rails assets:precompile

This compiles the asset files and saves them to a specific path in the local file system. Eventually the files are copied to the application servers ready to be served to the end users.

In our setup, deployment commands (including asset compilation) are executed by a dedicated machine, unsurprisingly called the Deployer. At a high level, the process is illustrated in the following diagram.

Deployment flow before mistry

Deployer is a black box for most development teams, which means there is no visibility in the asset compilation process. For example, one cannot easily inspect the assets for development or debugging purposes.

Most important is the fact that asset compilation is tightly coupled to the deployment process. This has important ramifications, one of them being the fact that when a revision is deployed to staging and then to production, assets have to be compiled separately each time for both environments, even though the resulting files are identical.

Dependency resolution

Another process significantly slowing down our workflows was dependency resolution.

In order for the main application to boot, its runtime dependencies must be present in the system. This means that CI workers, application servers and engineers must all go through dependency resolution multiple times a day.

Dependencies are essentially Ruby libraries (a.k.a. gems) that are managed by Bundler. Given some files that describe the set of application dependencies along with their version constraints, Bundler decides which gems are needed and downloads them.

A typical Rails monolith contains hundreds of dependencies, which makes dependency resolution a slow process since it involves a lot of network I/O.

The premise

By reflecting on the aforementioned processes, we spotted an opportunity of saving significant amounts of time and resources in a non-disruptive manner; that is, without major changes to our infrastructure or workflows.

We noticed a common pattern among these pipelines: a command is executed with a certain input, we wait until it’s finished and then use its output. The key observation however, is that the output is purely dependent on the input.

in asset compilation the input is the application source code (everyone can compile the assets provided the code), while the output is the actual assets (CSS, JS files).
in dependency resolution the input are the files that describe the dependencies of the application and their versions, Gemfile and Gemfile.lock; while the output is the resulting gem bundle (i.e. Ruby source files).

Given the above observations, we had some ideas in mind.

Since we know the command will be executed sooner or later (e.g. assets will have to be compiled when we eventually deploy), we can execute it now and save its output for whenever it’s needed. So by the time it’s actually needed, the output will be readily available, saving a lot of time in the otherwise slow process.

For example, we can compile the assets right after a commit is pushed to the master branch. This way deployment will not stall waiting for the asset compilation; the assets will be ready and will be shipped right away to the application servers.

Furthermore, the fact that the output is purely dependent on the input means we can save outputs of individual command executions and reuse them when identical commands (i.e. same input) are to be executed.

For example, given a Gemfile and Gemfile.lock, we can perform the dependency resolution once, save the resulting bundle and reuse it between multiple machines that would otherwise have to go through the same resolution process again.

Both of these optimizations could save us a lot of time and computational resources.

The solution

To bring the above ideas to life, we imagined some kind of build server able to execute arbitrary commands inside isolated environments (we’ll call these executions “builds”).

Builds produce a desired output (we’ll call them “artifacts”) that is saved in the server and is readily available to anyone that needs it.

Builds can be scheduled by humans and machines alike and the resulting artifacts can be downloaded from the server. Progress of builds can be inspected via a web interface exposed by the server.

Together with the server we imagined an accompanying CLI client, offering a drop-in replacement for the currently slow commands in our existing pipelines. So instead of executing the actual command, we would execute the CLI that schedules a build in the server, waits until it’s complete and then downloads the resulting artifacts.

The end result would be the same as before: some files (the artifacts) are saved in the system that executes the command. In the case of web assets, the asset files are placed under public/assets. In the case of Bundler, the gem files are placed under vendor/bundle.

This way changes in our workflows are kept to a minimum. For example, in the deployment process only a single line would have to change, from:

# compiles assets and saves them to public/assets/
$ rails assets:precompile

to:

# schedules a build to compile the assets, waits until it's finished and
# downloads the resulting artifacts to public/assets/
$ imaginary-cli build rails-assets --path public/assets/

After this seemingly small change however, the pipeline would be much more efficient:

work is performed at most once since results are reused between identical command invocations.
work is performed eagerly so that results are readily available by the time they’re needed.

These optimizations minimize resource consumption in terms of CPU, memory and network bandwidth, but more important, they make the develop-test-ship cycle faster by reducing the execution time of our core pipelines.

Implementation

After some brainstorming sessions we had the main idea sketched out. We moved forward with a prototype implementation after setting the initial requirements:

custom build recipes and execution environments should be supported (we call these “projects”). Anyone should be able to add their own project.
builds should run in isolation from one another and in a sandboxed environment.
builds should be parameterized. For example, we should be able to compile the assets of our Rails application for a specific revision (i.e. SHA1 of a commit).
builds should be optionally incremental (a.k.a. partial builds). The Rails Asset Pipeline for example, caches intermediate files when compiling assets so that subsequent compilations are faster. Similarly, Bundler skips gems that are already present in the file system. To support such cases, the server should optionally persist selected files across builds of the same project.

Containers were a natural fit for the first two requirements. We decided that build recipes would be provided in the form of Dockerfiles. This makes builds essentially Docker images that are executed to produce the desired artifacts. Containers provide us with the isolation we want, while engineers can run the builds in their own machines for debugging purposes, using the very same images the server uses.

We decided that the server would expose a JSON API for clients to interact with. Together with the server (mistryd), a client CLI (mistry) would be used to schedule builds by interacting with the JSON API. When scheduling a build, one has to specify the project (recipe) and optionally some build parameters. After it’s scheduled, the CLI blocks until the build is finished and finally downloads the resulting artifacts using rsync.

# schedule a build with a custom parameter (commit) using the CLI client and
# download the artifacts when finished
$ mistry build --host mistry.skroutz.gr --project rails-assets --commit=ab34af

We chose the rsync protocol for transferring build artifacts. This means network usage is minimized since files are only downloaded if they are not present (or if they have changed) in the local file system. This is important since we knew the majority of web assets remain mostly unchanged between application revisions. The same is true about dependencies, they don’t change very often between different commits.

Choosing a file system

We knew artifacts could potentially occupy a lot of disk space, since for instance, assets would have to be saved in the server for each revision of the application. As another example, in case of dependency resolution a gem bundle can easily result in hundreds of megabytes. Keeping different gem bundles in the server would quickly result in excessive disk space consumption. Fortunately there was a way to tackle this issue.

The key observation here is that many of these artifacts are identical between builds. For example, as we mentioned above only some assets usually change (if at all) between revisions of the application. Also most of the dependencies are not changed between revisions, which means a large portion of the gem bundles remains unchanged.

Copy-on-write (CoW) file systems to the rescue. For minimizing disk usage in such usage patterns and also to support incremental builds, a file system with copy-on-write semantics was a natural fit. In a CoW file system, even if multiple copies of the same file exist (or large files with very few differences between them), the data blocks are not actually duplicated. In our case where most of the application assets and dependencies remain unchanged, this translates to significant disk space savings.

In CoW file systems, cloning files or entire directories is naturally a fast operation, since data blocks are not actually copied in the traditional sense (i.e. they’re not duplicated). This fits great for supporting incremental builds, since we can copy almost instantly the artifacts of a previous build to serve as a starting point for a new build.

We went with Btrfs with which we were already familiar, as our production file system. However we designed mistry to support pluggable file system adapters. In that sense, adding support for another file system like ZFS is fairly straightforward.

The result

After a few iterations we had a working build server that served all of our aforementioned needs.

By incorporating mistry in our build pipelines, deployment times were reduced by up to 11 minutes (that’s how much compiling the assets previously took). The migration was transparent and didn’t disrupt any workflows of the engineering teams. Nothing has changed on the surface, yet things have changed under the hood. During deployment for example, Deployer does not actually compile the assets anymore but merely fetches them from mistry.

Deployment flow after mistry

We call mistry a general-purpose build server because it can be used to speed up different kinds of pipelines. Asset compilation and Bundler dependency resolution happened to be the cases that affected us the most, but there are many other potential use cases. For instance, we plan on using it to speed up yarn install invocations and we recently started using it for generating our static documentation pages.

mistry is open sourced under the GPLv3 license. There are still a lot of rough edges (e.g. the web view is a bare-bones page without much functionality outside of showing logs) but the core is fully functional. It can be deployed with different kinds of file systems, although Btrfs is recommended for production environments.

As a next step, we are planning to open source our build recipes for everyone to use.

Documentation can be found in the README and in the wiki. Please let us know if something is missing.

Conclusion

This was the story of how we spotted the opportunity for improvements in our daily workflows and built a tool to implement them.

We’ve been using mistry in production for a year and we are pretty happy with it. There are a lot of features and enhancements to be done yet; contributions are more than welcome.

We encourage you to give mistry a try if you believe it might be a good fit for your projects. Feel free to open an issue for bugs, questions or ideas.

We’d be happy to hear any feedback in the comments section.

Speeding Up Our Build Pipelines was originally published by Agis Anastasopoulos at Skroutz Engineering on August 23, 2019.

Skroutz Engineering

Growing the documentation of our android project using Dokka

Our goal

Dokka

Dokka’s flow

Entry points for plugins

Documentation node

Skroutz Dokka Plugin

First steps

Have docs that contain only the code that has comments

Have a way to group code, from different files/packages, together.

Showing tags in the documentable’s page

Making tags searchable

Be able to add a visual hint such as an image

The block-tag

Rendering the image

Final result

Links:

The Importance of Having a Healthy Chapter

The example

A healthy chapter

Day to day

Long term

Handling inertial scroll in combination with scroll snapping

Implementation details

The problem

An in-depth investigation

Impact on user experience

References

Core Web Vitals Real-time Monitoring at Skroutz.gr

Core Web Vitals Continuous Real-Time Monitoring

Lab data is not enough

How we measure Core Web Vitals

How we get alerted for Core Web Vitals issues

Examples Of How Core Web Vitals Helped Us

1. Server-side rendering gone wrong

2. New fashion categories layout shifts

3. CSS Grid module issues

Conclusion

Skroutz contributes to Hotwire's upstream

Including url in turbo:before-fetch-request event

Adding the target element to turbo:before-fetch-(request|response)events

Introducing turbo:frame-render and turbo:frame-load events

Introducing test runner options

Avoiding race condition between visit tests

Summary

Monolith Diaries: Upgrading Rails

Introduction

Organizing the upgrade

Who

How

Tracking

Communication

When

Cross team work

Take baby steps - don’t jump at once

Preparation

Changelogs

Gems update

rake app:update

Test Suite

Deprecations

Working for the upgrade

Backportable changes

Non-backportable changes

Delivering the upgrade

Sanity testing

Core testing

Application testing

Spread the news

Deployment

Canary Release

Create a detailed plan

Monitoring

Next steps

Hotwire @ Skroutz: Lazy load data with minimum effort

What is Hotwire?

The order show page

Lazy load with vanilla Javascript

Introducing Turbo Frames

Including url in `turbo:before-fetch-request` event

Adding the target element to `turbo:before-fetch-(request|response)`events

Introducing `turbo:frame-render` and `turbo:frame-load` events

`rake app:update`