Skroutz EngineeringJekyll2024-03-01T16:41:08+00:00https://engineering.skroutz.gr/Skroutz Engineering Teamhttps://engineering.skroutz.gr/https://engineering.skroutz.gr/blog/growing-the-documentation-of-our-android-project2024-02-29T05:08:13+00:002024-02-29T05:08:13+00:00Leonidas Partsashttps://engineering.skroutz.gr<p>In the last few years the android team has grown significantly and with that so did our codebase.
We are at a state that the lack of documentation has become an issue but not for what you might think.
Documenting a class as to how it works is not as essential as making the same class easy to discover!</p>
<p>Couple of our problems:</p>
<ul>
<li>The biggest issue we have is the inability to reason as to what we support.
For example, we have a concept called <code class="language-plaintext highlighter-rouge">Section</code>. Each section has its own type and based on that type we
render it with a different layout. Being able to see, at a glance, which types we render has become
nearly impossible since every relevant component might reside οn a different package or even module.</li>
<li><code class="language-plaintext highlighter-rouge">Do we have anything for [place need here]?</code>. This question is being asked a bit too often and its answer
depends either on the mnemonic of the rest of the team or on the efficiency of the IDE’s search as long as the
name of the function/class etc is descriptive enough.</li>
</ul>
<h2 id="our-goal">Our goal</h2>
<p>It is clear that we need to have some kind of documentation that will allow us to <strong>discover</strong> easily what can help us.
A documentation that, apart from listing all classes, functions etc, can have custom lists like the one with all of our sections.</p>
<p>So, based on that we decided that we need to:</p>
<ol>
<li>Have a way to group code, from different files/packages, together.</li>
<li>Be able to add a visual hint such as an image (a picture is worth a thousand words).</li>
<li>Have docs that contain <strong>only</strong> the code that has comments. Everything else is just a distraction.</li>
</ol>
<h2 id="dokka">Dokka</h2>
<p>We decided to use <a href="https://github.com/Kotlin/dokka">Dokka</a> to achieve our goal.
It is a tool written and maintained by Jetbrains and can be extended by a plugins system allowing each team
to add the functionality it needs.</p>
<h4 id="dokkas-flow">Dokka’s flow</h4>
<p>In a very abstracted and simplified way we can describe Dokka’s flow like this:
<img src="https://engineering.skroutz.gr/images/growing-documentation/dokka-in-high-level.png" alt="img" /></p>
<ul>
<li>First, you provide to it anything that can be represented by modules, classes, functions etc. This is the <code class="language-plaintext highlighter-rouge">Input</code>.</li>
<li>That input is being translated to a list of <code class="language-plaintext highlighter-rouge">Documentables</code> where each documentable is one of
the aforementioned concepts.</li>
<li>The documentables are then transformed to a tree of <code class="language-plaintext highlighter-rouge">Pages</code> (one page per documentable) where each page is a collection of information
represented by structures such as titles, texts, links etc.</li>
<li>Finally these pages are being rendered to a desired format such as an HTML or Markdown page. This is the <code class="language-plaintext highlighter-rouge">Output</code>.</li>
</ul>
<h4 id="entry-points-for-plugins">Entry points for plugins</h4>
<p>You might be wondering where do we write our plugin’s code? For that we need to see the above flow in more details:
<img src="https://engineering.skroutz.gr/images/growing-documentation/dokka-in-low-level.png" alt="img" /></p>
<p>Here, every arrow is an extension point:</p>
<ol>
<li>By default Dokka provides a way to translate Java/Kotlin code to documentables but it also allows us to add our own translations too.
The resulted documentables are being organized in modules. These are not, necessarily, the modules we have in our project,
even though that is the case in an android project.</li>
<li>At this point Dokka provides us a list of modules and allows us to transform them however we need.
We can add, remove, change all kinds of documentables including the list of provided modules.</li>
<li>Here is where all modules are being merged into one. Dokka expects to have a single merger and
provides a default implementation for it. Anything we provide must override the default one.</li>
<li>Yet another transformation point, like in step 2, only that this time we have a single module with all documentables in it.</li>
<li>Moving from documentables to pages Dokka expects to have a single translator. Again,
it provides a default implementation and anything we provide must override it.</li>
<li>At this point Dokka provides us with a tree of pages and the ability to add one or more transformations
for that tree. We can modify the tree by adding, removing or updating a page.</li>
<li>The final entry point is where Dokka allows us to provide our own renderer. By default it uses one of
its own implementations that renders the tree of pages into HTML pages.</li>
</ol>
<h4 id="documentation-node">Documentation node</h4>
<p>Creating a documentation relies on two things, the code and, of course, the comments.</p>
<p>If a piece of code has a doc-comment, its corresponding documentable will have a documentation node
which is nothing more than a list of <code class="language-plaintext highlighter-rouge">TagWrapper</code>s.
A <code class="language-plaintext highlighter-rouge">TagWrapper</code> is used to represent anything that KDoc supports (the description -both summary and detailed-,
the author, the since tag etc) plus any custom tag that will be used to extend KDoc. This custom tag
is being represented in code by <code class="language-plaintext highlighter-rouge">CustomTagWrapper</code>.</p>
<h2 id="skroutz-dokka-plugin">Skroutz Dokka Plugin</h2>
<h4 id="first-steps">First steps</h4>
<p>We decided to have the plugin as part of our repository.</p>
<p>For that we:</p>
<ol>
<li>created a Java/Kotlin library module and made it depend on <code class="language-plaintext highlighter-rouge">org.jetbrains.dokka:dokka-core</code> and <code class="language-plaintext highlighter-rouge">org.jetbrains.dokka:dokka-base</code>.</li>
<li>created a class that extends <code class="language-plaintext highlighter-rouge">DokkaPlugin</code> and</li>
<li>added a file named <em>org.jetbrains.dokka.plugability.DokkaPlugin</em> in the module’s resource folder (src/main/resources) under the path <code class="language-plaintext highlighter-rouge">META-INF/services</code>.
The file points to the class we created: <code class="language-plaintext highlighter-rouge">gr.skroutz.dokka.plugin.SkzDokkaPlugin</code>.</li>
</ol>
<p>Now every time we run one of Dokka’s gradle tasks (ex: dokkaHtmlMultiModule) our plugin’s code is being loaded
and executed for every module that is configured to create documentation.</p>
<p>Configuring a module:</p>
<ol>
<li>Dokka must be added in the <code class="language-plaintext highlighter-rouge">plugins { }</code> section and</li>
<li>Our plugin must be given as a dependency <code class="language-plaintext highlighter-rouge">dokkaPlugin(project(":skroutz-dokka-plugin"))</code></li>
</ol>
<h4 id="have-docs-that-contain-only-the-code-that-has-comments">Have docs that contain only the code that has comments</h4>
<p>Even though it was not the first in our list it was the place to start since we did not want the
clutter of having many documentables that offer nothing, since they don’t have any comments.</p>
<p>By default Dokka creates a page for every documentable. We didn’t want that. If our documentation has
a page it will be because there is a comment in it.</p>
<p>For that we chose to go with entry point #2 and wrote a new <code class="language-plaintext highlighter-rouge">PreMergeDocumentableTransformer</code>.</p>
<p>Its job is to filter the provided list of modules and keep only those that have at least one package which, on its turn,
has at least one documentable with a comment.</p>
<p>Implementation notes:</p>
<ul>
<li>We used <code class="language-plaintext highlighter-rouge">SuppressedByConditionDocumentableFilterTransformer</code> which is designed for exactly that. Suppressing a documentable or not:</li>
</ul>
<figure class="highlight"><pre><code class="language-kotlin" data-lang="kotlin"><span class="k">private</span> <span class="kd">class</span> <span class="nc">KeepOnlyDocumentablesWithComments</span><span class="p">(</span>
<span class="n">context</span><span class="p">:</span> <span class="nc">DokkaContext</span>
<span class="p">):</span> <span class="nc">SuppressedByConditionDocumentableFilterTransformer</span><span class="p">(</span><span class="n">context</span><span class="p">)</span> <span class="p">{</span>
<span class="k">override</span> <span class="k">fun</span> <span class="nf">shouldBeSuppressed</span><span class="p">(</span><span class="n">d</span><span class="p">:</span> <span class="nc">Documentable</span><span class="p">):</span> <span class="nc">Boolean</span> <span class="p">{</span>
<span class="k">return</span> <span class="n">d</span> <span class="p">!</span><span class="k">is</span> <span class="nc">DPackage</span> <span class="p">&&</span> <span class="p">!</span><span class="n">d</span><span class="p">.</span><span class="nf">hasDocumentation</span><span class="p">()</span>
<span class="p">}</span>
<span class="p">}</span></code></pre></figure>
<ul>
<li>We used an extension function for checking if a documentable has comments:</li>
</ul>
<figure class="highlight"><pre><code class="language-kotlin" data-lang="kotlin"><span class="k">internal</span> <span class="k">fun</span> <span class="nc">Documentable</span><span class="p">.</span><span class="nf">hasDocumentation</span><span class="p">():</span> <span class="nc">Boolean</span> <span class="p">{</span>
<span class="kd">val</span> <span class="py">hasDocumentation</span> <span class="p">=</span> <span class="n">documentation</span><span class="p">.</span><span class="n">values</span><span class="p">.</span><span class="nf">flatMap</span> <span class="p">{</span> <span class="n">it</span><span class="p">.</span><span class="n">children</span> <span class="p">}.</span><span class="nf">isNotEmpty</span><span class="p">()</span>
<span class="k">if</span> <span class="p">(</span><span class="n">hasDocumentation</span><span class="p">)</span> <span class="k">return</span> <span class="k">true</span>
<span class="k">return</span> <span class="n">children</span><span class="p">.</span><span class="nf">any</span> <span class="p">{</span> <span class="n">it</span><span class="p">.</span><span class="nf">hasDocumentation</span><span class="p">()</span> <span class="p">}</span>
<span class="p">}</span></code></pre></figure>
<p>and the key part here is the recursion. This supports cases like a class
that, on its own, does not have a comment but one of its properties/methods does.</p>
<h4 id="have-a-way-to-group-code-from-different-filespackages-together">Have a way to group code, from different files/packages, together.</h4>
<p>The combination of Dokka and KDoc allows the usage of custom block-tags so we decided to leverage it
for creating groups of code. Each time we want a certain class/function etc to be part of a group
we <em>tag it</em> by using <code class="language-plaintext highlighter-rouge">@tags name-of-group</code> in its doc-comment:</p>
<figure class="highlight"><pre><code class="language-kotlin" data-lang="kotlin"><span class="cm">/**
* Renders a SKU in the list layout.
*
* @tags section item, rendered sku
*/</span></code></pre></figure>
<p>For that we had to implement yet another <code class="language-plaintext highlighter-rouge">PreMergeDocumentableTransformer</code>.</p>
<p>Its job is to</p>
<ol>
<li>collect, from all modules, all the documentables that their comment includes our custom block tag</li>
<li>group them by the tag’s name</li>
<li>create a package for every group (tag)</li>
<li>create a module that has all these new packages</li>
</ol>
<figure class="highlight"><pre><code class="language-kotlin" data-lang="kotlin"><span class="k">internal</span> <span class="kd">class</span> <span class="nc">CreateTagsModule</span> <span class="p">:</span> <span class="nc">PreMergeDocumentableTransformer</span> <span class="p">{</span>
<span class="k">override</span> <span class="k">fun</span> <span class="nf">invoke</span><span class="p">(</span><span class="n">modules</span><span class="p">:</span> <span class="nc">List</span><span class="p"><</span><span class="nc">DModule</span><span class="p">>):</span> <span class="nc">List</span><span class="p"><</span><span class="nc">DModule</span><span class="p">></span> <span class="p">{</span>
<span class="kd">val</span> <span class="py">allDocumentables</span> <span class="p">=</span> <span class="n">modules</span>
<span class="p">.</span><span class="nf">flatMap</span> <span class="p">{</span> <span class="n">module</span> <span class="p">-></span> <span class="n">module</span><span class="p">.</span><span class="n">packages</span> <span class="p">}</span>
<span class="p">.</span><span class="nf">flatMap</span> <span class="p">{</span> <span class="n">pckg</span> <span class="p">-></span> <span class="n">pckg</span><span class="p">.</span><span class="nf">allDocumentables</span><span class="p">()</span> <span class="p">}</span>
<span class="kd">val</span> <span class="py">hasTags</span> <span class="p">=</span> <span class="n">allDocumentables</span>
<span class="p">.</span><span class="nf">any</span> <span class="p">{</span> <span class="n">documentable</span> <span class="p">-></span> <span class="n">documentable</span><span class="p">.</span><span class="nf">hasTags</span><span class="p">()</span> <span class="p">}</span>
<span class="kd">val</span> <span class="py">sourceSets</span> <span class="p">=</span> <span class="n">modules</span><span class="p">.</span><span class="nf">first</span><span class="p">().</span><span class="n">sourceSets</span>
<span class="k">return</span> <span class="k">if</span> <span class="p">(</span><span class="n">hasTags</span><span class="p">)</span> <span class="n">modules</span> <span class="p">+</span> <span class="nf">createTagsModule</span><span class="p">(</span><span class="n">allDocumentables</span><span class="p">,</span> <span class="n">sourceSets</span><span class="p">)</span> <span class="k">else</span> <span class="n">modules</span>
<span class="p">}</span>
<span class="k">private</span> <span class="k">fun</span> <span class="nf">createTagsModule</span><span class="p">(</span><span class="n">allDocumentables</span><span class="p">:</span> <span class="nc">List</span><span class="p"><</span><span class="nc">Documentable</span><span class="p">>,</span> <span class="n">sourceSets</span><span class="p">:</span> <span class="nc">Set</span><span class="p"><</span><span class="nc">DokkaConfiguration</span><span class="p">.</span><span class="nc">DokkaSourceSet</span><span class="p">>):</span> <span class="nc">DModule</span> <span class="p">{</span>
<span class="kd">val</span> <span class="py">tagPackages</span> <span class="p">=</span> <span class="n">allDocumentables</span>
<span class="p">.</span><span class="nf">filter</span> <span class="p">{</span> <span class="n">documentable</span> <span class="p">-></span> <span class="n">documentable</span><span class="p">.</span><span class="nf">hasTags</span><span class="p">()</span> <span class="p">}</span>
<span class="p">.</span><span class="nf">flatMap</span> <span class="p">{</span> <span class="n">documentable</span> <span class="p">-></span> <span class="n">documentable</span><span class="p">.</span><span class="nf">allTags</span><span class="p">().</span><span class="nf">map</span> <span class="p">{</span> <span class="n">tag</span> <span class="p">-></span> <span class="n">tag</span> <span class="n">to</span> <span class="n">documentable</span> <span class="p">}</span> <span class="p">}</span>
<span class="p">.</span><span class="nf">groupBy</span> <span class="p">{</span> <span class="n">entry</span> <span class="p">-></span> <span class="n">entry</span><span class="p">.</span><span class="n">first</span> <span class="p">}</span>
<span class="p">.</span><span class="nf">mapValues</span> <span class="p">{</span> <span class="n">entry</span> <span class="p">-></span> <span class="n">entry</span><span class="p">.</span><span class="n">value</span><span class="p">.</span><span class="nf">map</span> <span class="p">{</span> <span class="n">it</span><span class="p">.</span><span class="n">second</span> <span class="p">}</span> <span class="p">}</span>
<span class="p">.</span><span class="nf">map</span> <span class="p">{</span> <span class="n">entry</span> <span class="p">-></span> <span class="nf">createTagPackage</span><span class="p">(</span><span class="n">entry</span><span class="p">.</span><span class="n">key</span><span class="p">,</span> <span class="n">entry</span><span class="p">.</span><span class="n">value</span><span class="p">,</span> <span class="n">sourceSets</span><span class="p">)</span> <span class="p">}</span>
<span class="k">return</span> <span class="nc">DModule</span><span class="p">(</span>
<span class="n">name</span> <span class="p">=</span> <span class="nc">TAGS</span><span class="p">,</span>
<span class="n">packages</span> <span class="p">=</span> <span class="n">tagPackages</span><span class="p">,</span>
<span class="n">documentation</span> <span class="p">=</span> <span class="nf">emptyMap</span><span class="p">(),</span>
<span class="n">sourceSets</span> <span class="p">=</span> <span class="nf">emptySet</span><span class="p">()</span>
<span class="p">)</span>
<span class="p">}</span></code></pre></figure>
<p>Implementation notes:</p>
<ul>
<li>Dokka does not allow a documentable to be part of more than one pages. This means that
simply creating a new package with the tagged documentables would cause a failure. That is why
for every new package we made copies of the necessary documentables and add those to it.</li>
</ul>
<figure class="highlight"><pre><code class="language-kotlin" data-lang="kotlin"><span class="k">private</span> <span class="k">fun</span> <span class="nc">Documentable</span><span class="p">.</span><span class="nf">makeCopyForTag</span><span class="p">(</span><span class="n">tag</span><span class="p">:</span> <span class="nc">String</span><span class="p">):</span> <span class="nc">Documentable</span> <span class="p">{</span>
<span class="kd">val</span> <span class="py">newDri</span> <span class="p">=</span> <span class="n">dri</span><span class="p">.</span><span class="nf">copy</span><span class="p">(</span><span class="n">extra</span> <span class="p">=</span> <span class="n">tag</span><span class="p">)</span>
<span class="k">return</span> <span class="k">when</span> <span class="p">(</span><span class="k">this</span><span class="p">)</span> <span class="p">{</span>
<span class="k">is</span> <span class="nc">DFunction</span> <span class="p">-></span> <span class="nf">copy</span><span class="p">(</span><span class="n">dri</span> <span class="p">=</span> <span class="n">newDri</span><span class="p">,</span> <span class="n">extra</span> <span class="p">=</span> <span class="nc">PropertyContainer</span><span class="p">.</span><span class="nf">withAll</span><span class="p">(</span><span class="nc">IsCopy</span><span class="p">))</span>
<span class="k">is</span> <span class="nc">DProperty</span> <span class="p">-></span> <span class="nf">copy</span><span class="p">(</span><span class="n">dri</span> <span class="p">=</span> <span class="n">newDri</span><span class="p">,</span> <span class="n">extra</span> <span class="p">=</span> <span class="nc">PropertyContainer</span><span class="p">.</span><span class="nf">withAll</span><span class="p">(</span><span class="nc">IsCopy</span><span class="p">))</span>
<span class="k">is</span> <span class="nc">DTypeAlias</span> <span class="p">-></span> <span class="nf">copy</span><span class="p">(</span><span class="n">dri</span> <span class="p">=</span> <span class="n">newDri</span><span class="p">,</span> <span class="n">extra</span> <span class="p">=</span> <span class="nc">PropertyContainer</span><span class="p">.</span><span class="nf">withAll</span><span class="p">(</span><span class="nc">IsCopy</span><span class="p">))</span>
<span class="k">is</span> <span class="nc">DClasslike</span> <span class="p">-></span> <span class="nf">makeCopy</span><span class="p">(</span><span class="n">newDri</span><span class="p">)</span>
<span class="k">is</span> <span class="nc">DParameter</span> <span class="p">-></span> <span class="nf">copy</span><span class="p">(</span><span class="n">dri</span> <span class="p">=</span> <span class="n">newDri</span><span class="p">,</span> <span class="n">extra</span> <span class="p">=</span> <span class="nc">PropertyContainer</span><span class="p">.</span><span class="nf">withAll</span><span class="p">(</span><span class="nc">IsCopy</span><span class="p">))</span>
<span class="k">else</span> <span class="p">-></span> <span class="k">throw</span> <span class="nc">IllegalStateException</span><span class="p">(</span><span class="s">"I don't know what to do with $this"</span><span class="p">)</span>
<span class="p">}</span>
<span class="p">}</span></code></pre></figure>
<ul>
<li>This transformer is set to run after the one that filters out all documentables with no comments</li>
</ul>
<figure class="highlight"><pre><code class="language-kotlin" data-lang="kotlin"><span class="k">internal</span> <span class="kd">val</span> <span class="py">createTagsModule</span> <span class="k">by</span> <span class="nf">extending</span> <span class="p">{</span>
<span class="n">dokkaBase</span><span class="p">.</span><span class="n">preMergeDocumentableTransformer</span> <span class="n">with</span> <span class="nc">CreateTagsModule</span><span class="p">()</span> <span class="nf">order</span> <span class="p">{</span> <span class="nf">after</span><span class="p">(</span><span class="n">suppressDocumentablesWithNoDocumentation</span><span class="p">)</span> <span class="p">}</span>
<span class="p">}</span></code></pre></figure>
<h6 id="showing-tags-in-the-documentables-page">Showing tags in the documentable’s page</h6>
<p>One thing we wanted was to have our custom tags render in a page just like <code class="language-plaintext highlighter-rouge">@since</code> or <code class="language-plaintext highlighter-rouge">@author</code> do.</p>
<p>For that Dokka provides an abstraction (<code class="language-plaintext highlighter-rouge">CustomTagContentProvider</code>) that you can implement and provide the way you want your
custom tag to be structured.</p>
<p>For our <code class="language-plaintext highlighter-rouge">@tags</code> tag we choose to go with a title and the tags underneath it:</p>
<figure class="highlight"><pre><code class="language-kotlin" data-lang="kotlin"><span class="k">override</span> <span class="k">fun</span> <span class="nc">PageContentBuilder</span><span class="p">.</span><span class="nc">DocumentableContentBuilder</span><span class="p">.</span><span class="nf">contentForDescription</span><span class="p">(</span>
<span class="n">sourceSet</span><span class="p">:</span> <span class="nc">DokkaConfiguration</span><span class="p">.</span><span class="nc">DokkaSourceSet</span><span class="p">,</span>
<span class="n">customTag</span><span class="p">:</span> <span class="nc">CustomTagWrapper</span>
<span class="p">)</span> <span class="p">{</span>
<span class="nf">group</span><span class="p">(</span><span class="n">sourceSets</span> <span class="p">=</span> <span class="nf">setOf</span><span class="p">(</span><span class="n">sourceSet</span><span class="p">),</span> <span class="n">styles</span> <span class="p">=</span> <span class="nf">emptySet</span><span class="p">())</span> <span class="p">{</span>
<span class="nf">header</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="nc">TAGS</span><span class="p">)</span>
<span class="nf">comment</span><span class="p">(</span><span class="n">customTag</span><span class="p">.</span><span class="n">root</span><span class="p">)</span>
<span class="p">}</span>
<span class="p">}</span></code></pre></figure>
<h6 id="making-tags-searchable">Making tags searchable</h6>
<p>One of the <code class="language-plaintext highlighter-rouge">PageTransformer</code>s (entry point #6) that Dokka offers out of the box is <code class="language-plaintext highlighter-rouge">SearchbarDataInstaller</code>.
Its job is to create the file that populates the search functionality.</p>
<p>We decided to add a descendant of <code class="language-plaintext highlighter-rouge">SearchbarDataInstaller</code> and create a search record for every
tag we come across. For that we made sure that when a package related page gets processed we check
if it contains a tag-package documentable and if it does we create a search record for that tag:</p>
<figure class="highlight"><pre><code class="language-kotlin" data-lang="kotlin"><span class="k">override</span> <span class="k">fun</span> <span class="nf">processPage</span><span class="p">(</span><span class="n">page</span><span class="p">:</span> <span class="nc">PageNode</span><span class="p">):</span> <span class="nc">List</span><span class="p"><</span><span class="nc">SignatureWithId</span><span class="p">></span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">page</span><span class="p">.</span><span class="nf">isCopy</span><span class="p">())</span> <span class="k">return</span> <span class="nf">emptyList</span><span class="p">()</span>
<span class="k">if</span> <span class="p">(</span><span class="n">page</span> <span class="p">!</span><span class="k">is</span> <span class="nc">PackagePageNode</span><span class="p">)</span> <span class="k">return</span> <span class="k">super</span><span class="p">.</span><span class="nf">processPage</span><span class="p">(</span><span class="n">page</span><span class="p">)</span>
<span class="kd">val</span> <span class="py">tagPackage</span> <span class="p">=</span> <span class="n">page</span><span class="p">.</span><span class="n">documentables</span><span class="p">.</span><span class="nf">firstOrNull</span> <span class="p">{</span> <span class="n">it</span> <span class="k">is</span> <span class="nc">DPackage</span> <span class="p">&&</span> <span class="n">it</span><span class="p">.</span><span class="n">extra</span><span class="p">[</span><span class="nc">IsTagPackage</span><span class="p">]</span> <span class="p">!=</span> <span class="k">null</span> <span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="n">tagPackage</span> <span class="p">!=</span> <span class="k">null</span><span class="p">)</span> <span class="p">{</span>
<span class="n">tagPackageNames</span><span class="p">.</span><span class="nf">add</span><span class="p">(</span><span class="n">page</span><span class="p">.</span><span class="n">name</span><span class="p">)</span>
<span class="k">return</span> <span class="n">page</span><span class="p">.</span><span class="n">dri</span><span class="p">.</span><span class="nf">map</span> <span class="p">{</span> <span class="nc">SignatureWithId</span><span class="p">(</span><span class="n">it</span><span class="p">,</span> <span class="n">page</span><span class="p">)</span> <span class="p">}</span>
<span class="p">}</span>
<span class="k">return</span> <span class="k">super</span><span class="p">.</span><span class="nf">processPage</span><span class="p">(</span><span class="n">page</span><span class="p">)</span>
<span class="p">}</span>
<span class="k">override</span> <span class="k">fun</span> <span class="nf">createSearchRecord</span><span class="p">(</span><span class="n">name</span><span class="p">:</span> <span class="nc">String</span><span class="p">,</span> <span class="n">description</span><span class="p">:</span> <span class="nc">String</span><span class="p">?,</span> <span class="n">location</span><span class="p">:</span> <span class="nc">String</span><span class="p">,</span> <span class="n">searchKeys</span><span class="p">:</span> <span class="nc">List</span><span class="p"><</span><span class="nc">String</span><span class="p">>):</span> <span class="nc">SearchRecord</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">name</span> <span class="p">!</span><span class="k">in</span> <span class="n">tagPackageNames</span><span class="p">)</span> <span class="k">return</span> <span class="k">super</span><span class="p">.</span><span class="nf">createSearchRecord</span><span class="p">(</span><span class="n">name</span><span class="p">,</span> <span class="n">description</span><span class="p">,</span> <span class="n">location</span><span class="p">,</span> <span class="n">searchKeys</span><span class="p">)</span>
<span class="kd">val</span> <span class="py">tag</span> <span class="p">=</span> <span class="n">name</span><span class="p">.</span><span class="nf">removePrefix</span><span class="p">(</span><span class="nc">TAG_PACKAGE_PREFIX</span><span class="p">)</span>
<span class="k">return</span> <span class="nc">SearchRecord</span><span class="p">(</span>
<span class="n">name</span><span class="p">,</span>
<span class="n">tag</span><span class="p">,</span>
<span class="n">location</span><span class="p">,</span>
<span class="nf">listOf</span><span class="p">(</span><span class="n">tag</span><span class="p">)</span>
<span class="p">)</span>
<span class="p">}</span></code></pre></figure>
<p>Implementation notes:</p>
<ul>
<li>In order to have our transformer executed we had to override the default one</li>
</ul>
<figure class="highlight"><pre><code class="language-kotlin" data-lang="kotlin"><span class="k">internal</span> <span class="kd">val</span> <span class="py">makeTagsSearchable</span> <span class="k">by</span> <span class="nf">extending</span> <span class="p">{</span>
<span class="n">dokkaBase</span><span class="p">.</span><span class="n">htmlPreprocessors</span> <span class="n">providing</span> <span class="o">::</span><span class="nc">MakeTagsSearchable</span> <span class="k">override</span> <span class="n">dokkaBase</span><span class="p">.</span><span class="n">baseSearchbarDataInstaller</span>
<span class="p">}</span></code></pre></figure>
<ul>
<li>Every documentable provides a container where you can add custom properties.
We used that to characterize every copied documentable with the property <code class="language-plaintext highlighter-rouge">IsCopy</code> and every
every tag-package with <code class="language-plaintext highlighter-rouge">IsTagPackage</code> during the documentables’ transformations.</li>
</ul>
<figure class="highlight"><pre><code class="language-kotlin" data-lang="kotlin"><span class="k">internal</span> <span class="n">data</span> <span class="kd">object</span> <span class="nc">IsCopy</span> <span class="p">:</span> <span class="nc">ExtraProperty</span><span class="p"><</span><span class="nc">Documentable</span><span class="p">>,</span> <span class="nc">ExtraProperty</span><span class="p">.</span><span class="nc">Key</span><span class="p"><</span><span class="nc">Documentable</span><span class="p">,</span> <span class="nc">IsCopy</span><span class="p">></span> <span class="p">{</span>
<span class="k">override</span> <span class="kd">val</span> <span class="py">key</span><span class="p">:</span> <span class="nc">ExtraProperty</span><span class="p">.</span><span class="nc">Key</span><span class="p"><</span><span class="nc">Documentable</span><span class="p">,</span> <span class="err">*</span><span class="p">></span> <span class="p">=</span> <span class="nc">IsCopy</span>
<span class="p">}</span>
<span class="k">internal</span> <span class="kd">class</span> <span class="nc">IsTagPackage</span><span class="p">(</span><span class="kd">val</span> <span class="py">tag</span><span class="p">:</span> <span class="nc">String</span><span class="p">)</span> <span class="p">:</span> <span class="nc">ExtraProperty</span><span class="p"><</span><span class="nc">DPackage</span><span class="p">></span> <span class="p">{</span>
<span class="k">override</span> <span class="kd">val</span> <span class="py">key</span><span class="p">:</span> <span class="nc">ExtraProperty</span><span class="p">.</span><span class="nc">Key</span><span class="p"><</span><span class="nc">DPackage</span><span class="p">,</span> <span class="err">*</span><span class="p">></span> <span class="k">get</span><span class="p">()</span> <span class="p">=</span> <span class="nc">IsTagPackage</span>
<span class="k">internal</span> <span class="k">companion</span> <span class="k">object</span> <span class="p">:</span> <span class="nc">ExtraProperty</span><span class="p">.</span><span class="nc">Key</span><span class="p"><</span><span class="nc">DPackage</span><span class="p">,</span> <span class="nc">IsTagPackage</span><span class="p">></span>
<span class="p">}</span></code></pre></figure>
<p>This way we where able to keep here only the pages that contained our tags.</p>
<h4 id="be-able-to-add-a-visual-hint-such-as-an-image">Be able to add a visual hint such as an image</h4>
<p>Grouping code is very helpful. There are cases though, like the one with sections, that it wasn’t enough.
We wanted every group item to have a preview of how it looks so that we can easily pick and choose
what fits our needs.</p>
<p>For supporting that we had to break it to two parts:</p>
<ul>
<li>First we needed to add support for one more block-tag. One that will be used to provide the name of an image.</li>
<li>Then we had to make sure that the image is being rendered in the resulted page</li>
</ul>
<h6 id="the-block-tag">The block-tag</h6>
<p>We wanted to make it as easy as possible for the commenter:</p>
<ol>
<li>Take a screenshot</li>
<li>Give it the name you want (ex: <code class="language-plaintext highlighter-rouge">image-name.png</code>)</li>
<li>Move it to a specific folder (ex: <code class="language-plaintext highlighter-rouge">images/previews</code>)</li>
<li>Add the block tag <code class="language-plaintext highlighter-rouge">@preview image-name.png</code> to the comment</li>
</ol>
<figure class="highlight"><pre><code class="language-kotlin" data-lang="kotlin"><span class="cm">/**
* Renders a list of pills horizontally.
*
* @tags section
* @preview section-pills.png
*/</span> </code></pre></figure>
<p>Then, another implementation of <code class="language-plaintext highlighter-rouge">CustomTagContentProvider</code> will make sure that the block-tag will be
structured as an image:</p>
<figure class="highlight"><pre><code class="language-kotlin" data-lang="kotlin"><span class="k">private</span> <span class="k">fun</span> <span class="nc">PageContentBuilder</span><span class="p">.</span><span class="nc">DocumentableContentBuilder</span><span class="p">.</span><span class="nf">previewComment</span><span class="p">(</span><span class="n">customTag</span><span class="p">:</span> <span class="nc">CustomTagWrapper</span><span class="p">)</span> <span class="p">{</span>
<span class="kd">val</span> <span class="py">text</span> <span class="p">=</span> <span class="p">(</span><span class="n">customTag</span><span class="p">.</span><span class="n">root</span><span class="p">.</span><span class="n">children</span><span class="p">.</span><span class="nf">first</span><span class="p">().</span><span class="n">children</span><span class="p">.</span><span class="nf">first</span><span class="p">()</span> <span class="k">as</span> <span class="nc">Text</span><span class="p">)</span>
<span class="kd">val</span> <span class="py">customDocTag</span> <span class="p">=</span> <span class="nc">CustomDocTag</span><span class="p">(</span>
<span class="n">children</span> <span class="p">=</span> <span class="nf">listOf</span><span class="p">(</span>
<span class="nc">Img</span><span class="p">(</span>
<span class="n">params</span> <span class="p">=</span> <span class="nf">mapOf</span><span class="p">(</span>
<span class="s">"href"</span> <span class="n">to</span> <span class="s">"images/previews/${text.body}"</span><span class="p">,</span>
<span class="s">"alt"</span> <span class="n">to</span> <span class="nc">ALT_SKZ</span>
<span class="p">)</span>
<span class="p">)</span>
<span class="p">),</span>
<span class="n">name</span> <span class="p">=</span> <span class="n">customTag</span><span class="p">.</span><span class="n">name</span>
<span class="p">)</span>
<span class="nf">comment</span><span class="p">(</span><span class="n">customDocTag</span><span class="p">)</span>
<span class="p">}</span></code></pre></figure>
<h6 id="rendering-the-image">Rendering the image</h6>
<p>The content provider sets the image’s structure but, at this stage, it does not know anything about the page
that will use it. So the image’s path is not correct and the page will no be able to find it.</p>
<p>To fix it we wrote a <code class="language-plaintext highlighter-rouge">PageTransformer</code> that changes the image’s path after taking into consideration
the page’s position in the tree of pages:</p>
<figure class="highlight"><pre><code class="language-kotlin" data-lang="kotlin"><span class="k">override</span> <span class="k">fun</span> <span class="nf">invoke</span><span class="p">(</span><span class="n">input</span><span class="p">:</span> <span class="nc">RootPageNode</span><span class="p">):</span> <span class="nc">RootPageNode</span> <span class="p">{</span>
<span class="kd">val</span> <span class="py">locationProvider</span> <span class="p">=</span> <span class="n">locationProviderFactory</span><span class="p">.</span><span class="nf">getLocationProvider</span><span class="p">(</span><span class="n">input</span><span class="p">)</span>
<span class="k">return</span> <span class="n">input</span><span class="p">.</span><span class="nf">transformContentPagesTree</span> <span class="p">{</span> <span class="n">contentPage</span> <span class="p">-></span>
<span class="kd">val</span> <span class="py">hasPreviewImage</span> <span class="p">=</span> <span class="n">contentPage</span><span class="p">.</span><span class="n">content</span><span class="p">.</span><span class="nf">allContentNodes</span><span class="p">().</span><span class="nf">any</span> <span class="p">{</span> <span class="n">it</span> <span class="k">is</span> <span class="nc">ContentEmbeddedResource</span> <span class="p">&&</span> <span class="n">it</span><span class="p">.</span><span class="n">altText</span> <span class="p">==</span> <span class="nc">ALT_SKZ</span> <span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="n">hasPreviewImage</span><span class="p">)</span> <span class="p">{</span>
<span class="kd">val</span> <span class="py">count</span> <span class="p">=</span> <span class="n">locationProvider</span><span class="p">.</span><span class="nf">ancestors</span><span class="p">(</span><span class="n">contentPage</span><span class="p">).</span><span class="nf">count</span><span class="p">()</span>
<span class="k">return</span><span class="nd">@transformContentPagesTree</span> <span class="n">contentPage</span><span class="p">.</span><span class="nf">modified</span><span class="p">(</span>
<span class="n">content</span> <span class="p">=</span> <span class="n">contentPage</span><span class="p">.</span><span class="n">content</span><span class="p">.</span><span class="n">mapTransform</span><span class="p"><</span><span class="nc">ContentEmbeddedResource</span><span class="p">,</span> <span class="nc">ContentNode</span><span class="p">></span> <span class="p">{</span>
<span class="kd">val</span> <span class="py">prefix</span> <span class="p">=</span> <span class="s">"../"</span> <span class="p">*</span> <span class="n">count</span>
<span class="n">it</span><span class="p">.</span><span class="nf">copy</span><span class="p">(</span><span class="n">address</span> <span class="p">=</span> <span class="n">prefix</span> <span class="p">+</span> <span class="n">it</span><span class="p">.</span><span class="n">address</span><span class="p">)</span>
<span class="p">}</span>
<span class="p">)</span>
<span class="p">}</span>
<span class="n">contentPage</span>
<span class="p">}</span>
<span class="p">}</span></code></pre></figure>
<h2 id="final-result">Final result</h2>
<p>As we already said, an image is worth a thousand words, so this is how our docs are starting to look:
<img src="https://engineering.skroutz.gr/images/growing-documentation/list-with-previews.png" alt="img" />
<em>this is the page for the tag <code class="language-plaintext highlighter-rouge">section</code></em></p>
<h2 id="links">Links:</h2>
<ol>
<li><a href="https://kotlin.github.io/dokka/1.9.10/developer_guide/introduction/">Developer’s guide for writing a Dokka plugin</a></li>
<li><a href="https://kotlinlang.org/docs/dokka-introduction.html">Documentation for using Dokka</a></li>
<li><a href="https://github.com/Kotlin/dokka/blob/master/dokka-subprojects/README.md">Code of default plugins that come with Dokka</a></li>
<li><a href="https://kotlinlang.org/docs/kotlin-doc.html">KDoc</a></li>
</ol>
<p><a href="https://engineering.skroutz.gr/blog/growing-the-documentation-of-our-android-project/">Growing the documentation of our android project using Dokka</a> was originally published by Leonidas Partsas at <a href="https://engineering.skroutz.gr">Skroutz Engineering</a> on February 29, 2024.</p>https://engineering.skroutz.gr/blog/the-importance-of-having-a-healthy-chapter2022-12-06T22:00:00+00:002022-12-06T22:00:00+00:00Leonidas Partsashttps://engineering.skroutz.gr<p>At Skroutz, every product engineer belongs both to a product team and a chapter. A product team contains people from all crafts and
is responsible in delivering new features to our users. A chapter on the other hand contains only engineers of a certain craft and
is responsible for all technical aspects of a project.</p>
<p>The mobile team has two chapters. One for Android engineers and one for iOS. Both chapters have weekly meetings where we inform each other on what we are doing and
discuss ways to move our codebase and project forward.</p>
<p>As you can probably guess, for a big project like this, a meeting once a week is not enough to keep it scalable, maintainable and up to date.
This is a constant effort which requires organization and most of all good communication. This is were a healthy chapter shines. This is what our Android chapter is!</p>
<h2 id="the-example">The example</h2>
<p>In a recent PR I noticed that we keep using the following convention:</p>
<figure class="highlight"><pre><code class="language-kotlin" data-lang="kotlin"><span class="kd">class</span> <span class="nc">CampaignTracking</span><span class="p">(</span>
<span class="kd">val</span> <span class="py">tag</span><span class="p">:</span> <span class="nc">String</span><span class="p">,</span>
<span class="kd">val</span> <span class="py">trackableActions</span><span class="p">:</span> <span class="nc">List</span><span class="p"><</span><span class="nc">ActionType</span><span class="p">></span>
<span class="p">)</span> <span class="p">:</span> <span class="nc">RootObject</span> <span class="p">{</span>
<span class="k">fun</span> <span class="nf">shouldTrackClick</span><span class="p">():</span> <span class="nc">Boolean</span> <span class="p">=</span> <span class="n">trackableActions</span><span class="p">.</span><span class="nf">contains</span><span class="p">(</span><span class="nc">ActionType</span><span class="p">.</span><span class="nc">CLICK</span><span class="p">)</span>
<span class="k">fun</span> <span class="nf">shouldTrackImpression</span><span class="p">():</span> <span class="nc">Boolean</span> <span class="p">=</span> <span class="n">trackableActions</span><span class="p">.</span><span class="nf">contains</span><span class="p">(</span><span class="nc">ActionType</span><span class="p">.</span><span class="nc">IMPRESSION</span><span class="p">)</span>
<span class="p">}</span></code></pre></figure>
<p>where we add a helper method for each supported enum.</p>
<p>This, in my opinion, violates the <a href="https://en.wikipedia.org/wiki/Open%E2%80%93closed_principle">open close principle</a> since every change in the <code class="language-plaintext highlighter-rouge">ActionType</code> will require a change in <code class="language-plaintext highlighter-rouge">CampaignTracking</code> too.
The thing is that because I feel comfortable with the team I am in I didn’t just thought of it, I shared my thoughts in our slack channel.
The main argument for having the convention was readability so I even argued that, in Kotlin, something like this</p>
<figure class="highlight"><pre><code class="language-kotlin" data-lang="kotlin"><span class="k">if</span> <span class="p">(</span><span class="nc">TrackableActionType</span><span class="p">.</span><span class="nc">IMPRESSION</span> <span class="k">in</span> <span class="n">campaign</span><span class="p">.</span><span class="n">trackableActions</span><span class="p">)</span> <span class="p">{</span>
<span class="o">..</span><span class="p">.</span>
<span class="p">}</span></code></pre></figure>
<p>is readable too!</p>
<p>Soon after my comment a discussion started where a colleague suggested a simple and elegant solution:</p>
<figure class="highlight"><pre><code class="language-kotlin" data-lang="kotlin"><span class="kd">class</span> <span class="nc">CampaignTracking</span><span class="p">(</span>
<span class="kd">val</span> <span class="py">tag</span><span class="p">:</span> <span class="nc">String</span><span class="p">,</span>
<span class="kd">val</span> <span class="py">trackableActions</span><span class="p">:</span> <span class="nc">List</span><span class="p"><</span><span class="nc">ActionType</span><span class="p">></span>
<span class="p">)</span> <span class="p">:</span> <span class="nc">RootObject</span> <span class="p">{</span>
<span class="k">fun</span> <span class="nf">isActionTracked</span><span class="p">(</span><span class="n">type</span><span class="p">:</span> <span class="nc">ActionType</span><span class="p">):</span> <span class="nc">Boolean</span> <span class="p">=</span> <span class="n">trackableActions</span><span class="p">.</span><span class="nf">contains</span><span class="p">(</span><span class="n">type</span><span class="p">)</span>
<span class="p">}</span></code></pre></figure>
<p>which fixes both the initial problem and the one that I introduced by removing the methods all together.</p>
<p>You see, by having a method like <code class="language-plaintext highlighter-rouge">isActionTracked</code> we hide implementation details like the fact that we use a list for trackable actions.
Exposing something like this makes the code hard to change/scale.</p>
<p>This example might seem trivial and the solution simple but imagine having lots of these small changes every day. The project will self heal in no time!
And all that because we, as a chapter, are not afraid of suggesting things.</p>
<h2 id="a-healthy-chapter">A healthy chapter</h2>
<p>In a healthy chapter every member is trusted, is not afraid to ask questions, can express an opinion and, above all, listens to the other team members.
In such an environment ego comes last and knowledge/information flows through the team ending up in having all decisions shaped and accepted by everyone.</p>
<p>The fact that we are such a team has helped in applying certain practises that allow the project, and us, to grow both in a day to day base and in a long term:</p>
<h5 id="day-to-day">Day to day</h5>
<p>Having a group of talented and capable engineers is not enough if they don’t communicate.</p>
<p>This is why we have adopted two rules in our chapter:</p>
<ol>
<li>
<p><strong>Don’t remain stuck for more than a couple of hours, ask!</strong>
The project is big and chances are that the problem you are facing has already been solved so, ask!
Someone will either point you to the proper file or will search / pair with you and help you solve it.
At the end of the day, the team can be an excellent rubber duck. Try forming the question and an answer might pop up on its own!</p>
</li>
<li>
<p><strong>If you feel that you want to challenge a decision, do it!</strong>
As we saw from the example above both the project and the team will benefit from it.</p>
</li>
</ol>
<h5 id="long-term">Long term</h5>
<p>Our project is old and the codebase big so for keeping it up to date we need to have a plan and make small steps through a long period of time.
This is why we have a board where we add, discuss and monitor our long running tasks.</p>
<p>Tasks that aim to help the project move forward but cannot be resolved
by one person or in one “sprint”. Tasks like the migration from Java to Kotlin, the migration from callbacks to coroutines, moving from the deprecated <code class="language-plaintext highlighter-rouge">onActivityResult</code>
to something that suits our needs (spoiler: we ended up creating a tiny library for that) and many more.</p>
<p>The process has four steps:</p>
<ul>
<li>Every new idea and suggestion is added in an inbox. Nothing detailed. Just a small description like “Usage of Hilt” or “Introduction of Jetpack Compose”.</li>
<li>If there is someone that wants to investigate the proposed task she assigns the task on her and delivers a proof of concept to the chapter.</li>
<li>With the POC in hand the chapter discusses if its worth moving forward or not.</li>
<li>If the suggestion gets accepted we fine grain the proposed change and extract a detailed action plan.</li>
</ul>
<p>A simple process for sure but it can work effectively only when having a healthy team:</p>
<ul>
<li>No fear for criticism ends up in having these ideas and suggestions</li>
<li>Feeling trusted ends up in stepping up and taking the initiative to do the investigation and propose a solution</li>
<li>Expressing freely our opinions ends up in having solid and structured plans</li>
</ul>
<p>The mobile apps have changed a lot during the last couple of years and managed to close the gap with mobile web offering to our users a great and complete experience.
All that, while still maintaining a proper codebase, wouldn’t be possible if the chapter didn’t have such professional engineers that love their craft.</p>
<p>Stay tuned, more to come!</p>
<p>feature image: <a href="https://unsplash.com/photos/tFTYlAc9pyw">unsplash</a></p>
<p><a href="https://engineering.skroutz.gr/blog/the-importance-of-having-a-healthy-chapter/">The Importance of Having a Healthy Chapter</a> was originally published by Leonidas Partsas at <a href="https://engineering.skroutz.gr">Skroutz Engineering</a> on December 06, 2022.</p>https://engineering.skroutz.gr/blog/handling-inertial-scroll-in-combination-with-scroll-snapping2022-05-15T22:00:00+00:002022-05-15T22:00:00+00:00Angelos Chalarishttps://engineering.skroutz.gr<p>At Skroutz, we aspire to provide the most intuitive and hassle-free user experience. As a result, we constantly iterate over interface elements, redesigning, polishing and tailoring them to users’ needs. One such iteration was the recent redesign of the image gallery on fashion pages.</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/FivPWyhhlgY" frameborder="0"> </iframe>
<p>The aim of the redesign was to provide a more premium user experience on fashion categories. We opted to increase the main image size, as such categories are mainly image-driven, while also making image browsing easier and faster via scroll and thumbnail interactions. Additionally, we redesigned the image preview modal on desktop to better cater to user needs.</p>
<h2 id="implementation-details">Implementation details</h2>
<p>Before we go any further, it’s worth explaining how the component works from a technical standpoint. Without getting into too much detail, here’s a quick overview of the scrollable gallery area implementation:</p>
<ul>
<li>The outer container layer, <code class="language-plaintext highlighter-rouge">.slides-container</code>, has a 3:4 aspect ratio which locks it into a fixed size. This is done to ensure fashion images which are always cropped to this ratio are displayed correctly.</li>
<li>The inner container layer, <code class="language-plaintext highlighter-rouge">.slides</code>, fits the outer container and uses <code class="language-plaintext highlighter-rouge">overflow-y: auto</code> to be vertically scrollable. It also uses <code class="language-plaintext highlighter-rouge">scroll-snap-type: y mandatory</code> to create a snapping behavior on scroll.</li>
<li>Inside the inner container, there are multiple <code class="language-plaintext highlighter-rouge">.slide</code> elements. Each one is sized to fill the area and has a <code class="language-plaintext highlighter-rouge">scroll-snap-align: start</code> property to ensure that it snaps to the top of the container.</li>
</ul>
<p>There are also various other implementation details that come into play, such as JavaScript event handling, updating component state, highlighting the current slide thumbnail and so on.</p>
<h2 id="the-problem">The problem</h2>
<p>After implementing and deploying the new design, we received internal reports about the gallery not responding correctly to certain user interactions. Specifically, some users reported that touchpad scrolling would lock the page to the gallery after reaching the end of the gallery slides. This would effectively prevent users from scrolling down the rest of the page until they scrolled up again. Here’s what this looked like in action:</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/LdVKtYSoQP0" frameborder="0"> </iframe>
<h2 id="an-in-depth-investigation">An in-depth investigation</h2>
<p>Bug reports aren’t always clear or easy to reproduce. In this case, we had some difficulty tracking down the issue. We finally managed to pinpoint it to touchpads and, more specifically, to their inertial scrolling behavior. Due to the nature of this behavior, OS and browser made a huge difference in reproducing it. This only made it harder to track down and understand the inner workings of the problem. From what we know now, MacOS touchpad inertia was the main culprit.</p>
<p>After realizing the behavioral cause, we had to understand the technical one, too. After some investigation, it seemed like <code class="language-plaintext highlighter-rouge">scroll-snap-type: y mandatory</code> was to blame. There are various conflicting reports of bugs with this property on MacOS related to inertia on different browsers and OS versions. The bottom line is that the <code class="language-plaintext highlighter-rouge">mandatory</code> part can cause certain problems under the right circumstances.</p>
<p>Oddly enough, using plain <code class="language-plaintext highlighter-rouge">scroll-snap-type: y</code> worked correctly and didn’t cause any bugs, but the behavior wasn’t the desired one. As expected, the scroll position would only snap at certain parts of the image instead of always. At this point, we thought we could use the <code class="language-plaintext highlighter-rouge">:hover</code> pseudo-selector to make snapping mandatory only when the mouse was inside the gallery container. While this CSS-only approach made sense on paper, it started to cause some very unexpected issues.</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/KbTDUovBcIE" frameborder="0"> </iframe>
<p>Clearly, this approach didn’t work as well as we’d hoped. However, it pushed us closer to a solution. After all, using <code class="language-plaintext highlighter-rouge">:hover</code> was a straightforward way to detect if the user was indeed intent on scrolling the gallery or the entire page. Thus, we could disable the vertical scroll (<code class="language-plaintext highlighter-rouge">overflow-y: hidden</code>) when the gallery wasn’t hovered. This was far more stable, but would cause gallery slides to get stuck halfway through being scrolled if the cursor exited the gallery container.</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/q1mG_qVk1Mw" frameborder="0"> </iframe>
<p>The next step towards a solution was to add some JavaScript. A simple 300ms interval that would check for the container being hovered and snapping the slide into position should solve the problem, we thought. And it worked for the most part. However, the user experience didn’t feel great.</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/a55TkXvK1uY" frameborder="0"> </iframe>
<p>There was a little bit of a visual stutter involved, which we weren’t pleased with. After all, great care was put into making the gallery scroll experience feel smooth and premium. So, we had to deal with this stutter by using some sort of transition.</p>
<p>Unfortunately, <code class="language-plaintext highlighter-rouge">overflow</code> is a binary CSS property and, much like <code class="language-plaintext highlighter-rouge">display</code>, cannot be transitioned. The CSS engine has no clue what such a transition would look like. Fortunately, CSS animations can be leveraged for this kind of thing. By creating an animation with a <code class="language-plaintext highlighter-rouge">from { overflow: auto; }</code> keyframe, we can make it so that the stutter is less pronounced.</p>
<p>By now, the average reader wouldn’t expect this to work without a hitch. And, like clockwork, it did not. While the animation worked, it required about 600ms to feel smooth. This would lock the page scroll for a little too long and the user would feel like the page was unresponsive.</p>
<p>Luckily, the animation timing highlighted a potential solution. By slowing down the start of the animation and speeding it up towards the end, we could simulate an inertial snap. After some tinkering, we ended up with a <code class="language-plaintext highlighter-rouge">cubic-bezier(.35, -.7, 1, 1)</code> timing function.</p>
<p><img src="https://engineering.skroutz.gr/images/2022-gallery-snap-bug/gallery-bezier-curve.png" alt="Inertial snap animation timing function" /></p>
<p>This timing function enabled us to shorten the animation duration back to 300ms, matching the snap interval. This was the last piece in this puzzle. While the inertial snap isn’t perfect, it’s far less noticeable and the page doesn’t lock anymore when the user reaches the last slide.</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/KdLg_G9Stxo" frameborder="0"> </iframe>
<p>Putting it all together, we had to make the following changes to the gallery component:</p>
<ol>
<li>Set an interval that runs every 300ms from the gallery component. Whenever it’s run it checks if the <code class="language-plaintext highlighter-rouge">.slides</code> element is still hovered. If it isn’t, it snaps it to the correct gallery slide.</li>
<li>Use the <code class="language-plaintext highlighter-rouge">:hover</code> CSS pseudo-selector to change <code class="language-plaintext highlighter-rouge">overflow-y</code> behavior in the <code class="language-plaintext highlighter-rouge">.slides</code> element, effectively preventing scroll events from occurring in the gallery when the mouse is not over it. This prevents the scroll from getting locked when the user reaches the end of the gallery with inertial scroll.</li>
<li>Define a CSS animation for the <code class="language-plaintext highlighter-rouge">overflow</code> property that animates the transition from hovered to not hovered on the <code class="language-plaintext highlighter-rouge">.slides</code> element. An appropriate timing function effectively produces an inertia-like transition while the JavaScript-based slide snapping happens.</li>
</ol>
<p>Here’s a <a href="https://codepen.io/chalarangelo/pen/ExQWqdR?editors=0110">CodePen with the final gallery implementation</a>. Note that internal implementation details have been omitted, as they’re unrelated to this example.</p>
<h2 id="impact-on-user-experience">Impact on user experience</h2>
<p>After fixing the bug, we took a look at the numbers to see the potential impact on user experience. On surface, this was a localized issue, that would only affect certain users on very specific conditions. As it turns out, that wasn’t exactly the case. While only a small fraction of sessions (roughly <strong>2%</strong>), about <strong>500.000 monthly Skroutz users</strong> are on the appropriate OS and browser combination to experience this bug. This means that, even though the percentage is small, the absolute number of users that could end up on an almost unusable page was still pretty high. This goes to show that even small, localized bugs, can spiral into a lot of user frustration, if left unaddressed.</p>
<h3 id="references">References</h3>
<ul>
<li><a href="https://bugzilla.mozilla.org/show_bug.cgi?id=1737820">CSS Scroll Snap momentum flaky on macOS Monterey</a></li>
<li><a href="https://stackoverflow.com/a/41221543/1650200">Can I apply a CSS transition to the overflow property?</a></li>
</ul>
<p><a href="https://engineering.skroutz.gr/blog/handling-inertial-scroll-in-combination-with-scroll-snapping/">Handling inertial scroll in combination with scroll snapping</a> was originally published by Angelos Chalaris at <a href="https://engineering.skroutz.gr">Skroutz Engineering</a> on May 15, 2022.</p>https://engineering.skroutz.gr/blog/core-web-vitals-at-skroutz-gr2022-02-27T21:00:00+00:002022-02-27T21:00:00+00:00Skroutz Engineering Teamhttps://engineering.skroutz.gr<p>At Skroutz, we believe that for a modern web experience, it’s important to <strong>get fast and stay fast</strong>.</p>
<p>For this, speed has always been a critical component for our Engineering and SEO Teams and we were monitoring speed KPIs early on.</p>
<p>Image 1: SpeedIndex graphs for Skroutz.gr back in 2015.
<img src="https://engineering.skroutz.gr/images/core-web-vitals-at-skroutz/corewebvitals-skroutz-1.png" alt="SpeedIndex graphs for Skroutz.gr back in 2015" /></p>
<p>As <a href="https://www.skroutz.gr/" target="_blank">Skroutz.gr</a> shifted from a price comparison site to a fully operational Marketplace, we made some serious changes to our core product. At the same time, our Engineering teams grew rapidly and on top of this, we architecturally moved our front-end stack toward heavier Javascript rendering, from a static to a more reactive fashion.</p>
<p>Occasionally, rendering performance was getting worse and until recently we were running ad-hoc sprints in order to improve Skroutz.gr ’s speed (<a href="https://engineering.skroutz.gr/blog/speed-the-journey-to-delivering-a-faster-experience-at-skroutz-gr/" target="_blank">read a post for such a sprint here</a>).</p>
<p>Although we achieved better performance after each sprint -and hopefully a better user experience for our visitors-, we knew that this was not an ideal, sustainable process.</p>
<p>To solve this, we <strong>established an additional continuous monitoring and alerting</strong> system of <strong>Core Web Vitals</strong> using field data (real users) with a new set of tools and methodologies that we apply, in order to have these new metrics under our daily radars.</p>
<p>This continuous monitoring helps to not only be proactive from an SEO perspective, but also allows engineering teams to be in touch with rendering and speed issues and to organically <strong>establish a “fast speed mentality”</strong>.</p>
<p>In this article, we describe what we did, some real life cases we’ve dealt with, and some takeaways from our experience during the symbiosis with the Core Web Vitals real-time monitoring.</p>
<hr />
<h1 id="core-web-vitals-continuous-real-time-monitoring">Core Web Vitals Continuous Real-Time Monitoring</h1>
<h2 id="lab-data-is-not-enough">Lab data is not enough</h2>
<p>While lab tools are invaluable, the data they provide isn’t always predictive of how a website performs for real users.</p>
<p>For example, <a href="https://developers.google.com/web/tools/lighthouse" target="_blank">Lighthouse</a> runs tests with simulated throttling in a simulated desktop or mobile environment. While such simulations of slower network and device conditions often help surface user experience problems better than native network and device conditions, they’re just a single slice of the large variety in network conditions and device capabilities across a website’s entire user base [<a href="https://web.dev/vitals-tools/" target="_blank">web.dev/vitals-tools</a>].</p>
<p>On the other hand, there is the <a href="https://developers.google.com/web/tools/chrome-user-experience-report" target="_blank">Chrome User Experience Report</a> (CrUX), a BigQuery dataset of field data gathered from a segment of real Google Chrome users, which presents Core Web Vitals with sufficient traffic, but only at the origin level. CrUX is still useful since one could compare it with field or lab data to see how they align.</p>
<p><a href="https://support.google.com/webmasters/answer/9205520?hl=en" target="_blank">Search Console’s Core Web Vitals</a> section assesses groups of similar pages (for example, our Product pages) and also includes a Core Web Vitals report based on field data from CrUX, offering novel insights into how performance improvements impact the entire sections of the site and different page templates.</p>
<p>All these tools are extremely <strong>useful, but they alert us about any issues long after they have occurred</strong>, one would say a bit too late, as organic performance is already affected at scale.</p>
<h2 id="how-we-measure-core-web-vitals">How we measure Core Web Vitals</h2>
<p>Since the Core Web Vitals metrics represent the user’s experience when interacting with a web page and they were confirmed ranking factors in Google Search as of May 2021 (along with mobile-friendliness, HTTPS-security, and intrusive interstitial guidelines), the importance of incorporating Web Vitals into our site hygiene monitoring practice was larger that ever.</p>
<p>We decided to collect field data from Skroutz.gr ‘s thousands of daily visitors in real time, process them and add some alerting heuristics. We used the <a href="https://github.com/GoogleChrome/web-vitals" target="_blank">web-vitals library</a>, a tiny (~1K), modular library for measuring all the Web Vitals metrics on real users, in a way that accurately matches how they’re measured by Chrome and reported to other Google tools (e.g. Chrome User Experience Report, Page Speed Insights, Search Console’s Speed Report).</p>
<p>Mid July 2021, live monitoring for Core Web Vitals was launched. Using this library, we essentially render the web-vitals JavaScript bundles and invoke the functions for the 3 Core Web Vitals on Skroutz.gr.</p>
<p>We send a portion of the traffic (1% of random anonymized sessions, that is more than 100k pageviews & data points daily) at <a href="https://grafana.com/" target="_blank">Grafana</a>, an open-source visualisation and analytics software providing tools to turn time-series data into graphs and visualisations.</p>
<p>We have created dedicated dashboards for our most important site sections and we furthermore distinguish them into mobile and desktop traffic. More specifically, we are monitoring and visualising the scores of the 3 Core Web Vital Metrics (LCP, CLS, FID) per page type (Product Listing Pages (PLPs) and Product Detail Pages (PDPs)) and device type (mobile, desktop).</p>
<p>Image 2: Core Web Vitals (LCP) Real-Time Continuous Monitoring dashboard for Skroutz.gr.
<img src="https://engineering.skroutz.gr/images/core-web-vitals-at-skroutz/corewebvitals-skroutz-2.png" alt="Core Web Vitals (LCP) Real-Time Continuous Monitoring dashboard for Skroutz.gr" /></p>
<h2 id="how-we-get-alerted-for-core-web-vitals-issues">How we get alerted for Core Web Vitals issues</h2>
<p>When each Core Web Vital metric drops below the “Good Performance” range, an alert is fired within a dedicated channel on Slack, our main communication tool. This way we are informed instantly when one of the Web Vital metrics drops at the “Medium Performance - Needs Improvement” state, while we’re also made aware of the exact section of the site that was affected.</p>
<p>Image 3: Web Vitals alert notifications in Growth Team’s slack channel.
<img src="https://engineering.skroutz.gr/images/core-web-vitals-at-skroutz/corewebvitals-skroutz-3.png" alt="Web Vitals alert notifications in Growth Team’s slack channel" /></p>
<p>Thus, we get alerted as soon as an issue appears, oftentimes before even Google was able to spot the affected area. Then we take immediate actions to remedy the situation.</p>
<p>Image 4: CLS of Product Pages on Desktop exceeded the 0.10 threshold and an alert was fired.
<img src="https://engineering.skroutz.gr/images/core-web-vitals-at-skroutz/corewebvitals-skroutz-4.png" alt="CLS of Product Pages on Desktop exceeded the 0.10 threshold and an alert was fired" /></p>
<p>We monitor 2 time series for each Web Vital metric, one for the current time and one for 1-week earlier, in order to make it easier for us to compare them and make up our mind as to whether the performance has significantly declined or not.</p>
<p>Image 5: Core Web Vitals (LCP) Real-Time Continuous Monitoring dashboard at Skroutz.gr.
<img src="https://engineering.skroutz.gr/images/core-web-vitals-at-skroutz/corewebvitals-skroutz-5.png" alt="Core Web Vitals (LCP) Real-Time Continuous Monitoring dashboard at Skroutz.gr" /></p>
<p>There is also a toggle option to see all the deployments. The exact time of each deployment as well as other details linking to the github page are easily accessible. This can prove very useful when an alert pops up, as it can direct the team straight to the source of the issue.</p>
<p>Image 6: Deployments annotation in the Core Web Vitals dashboard.
<img src="https://engineering.skroutz.gr/images/core-web-vitals-at-skroutz/corewebvitals-skroutz-6.png" alt="Deployments annotation in the Core Web Vitals dashboard" /></p>
<p>With the help of all these advanced monitoring systems and procedures, we keep <a href="https://www.skroutz.gr/" target="_blank">Skroutz.gr</a> fast and steady, we find and fix any rendering issues promptly, and we optimise user experience, which in turn leads to increased user engagement, more conversions, and -hopefully- higher user satisfaction.</p>
<p>Ιncorporating Core Web Vitals monitoring has led Skroutz.gr to an impressive <strong>98,5% of ~26 million pages seen as providing a “good page experience”</strong>!</p>
<p>Image 7: Page Experience Score of Skroutz.gr at Search Console.
<img src="https://engineering.skroutz.gr/images/core-web-vitals-at-skroutz/corewebvitals-skroutz-7.png" alt="Page Experience Score of Skroutz.gr at Search Console" /></p>
<p>Image 8: Core Web Vitals of Skroutz.gr at Search Console.
<img src="https://engineering.skroutz.gr/images/core-web-vitals-at-skroutz/corewebvitals-skroutz-8.png" alt="Core Web Vitals of Skroutz.gr at Search Console" /></p>
<hr />
<h1 id="examples-of-how-core-web-vitals-helped-us">Examples Of How Core Web Vitals Helped Us</h1>
<p>Let us show you 3 examples of how Core Web Vitals real-time monitoring has helped us resolve issues that we might not have detected otherwise.</p>
<h2 id="1-server-side-rendering-gone-wrong">1. Server-side rendering gone wrong</h2>
<p>The first example is from September 2021, where we saw an abnormal increase in pages’ rendering stability score, CLS (Content Layout Shift), almost 2x, and specifically on Product pages (PLPs) on both mobile and desktop.</p>
<p>This was very strange, because no matter the styling changes in mobile and desktop views, it is not really possible for different code (applied CSS styles in DOM) to cause such (relatively) huge layout shifts simultaneously.</p>
<p>Up until then, we had seen cases where a major page change impacted more in terms of layout shift in either desktop or mobile view (usually at the desktop where there is larger viewport to composite layout).</p>
<p>Image 9: CLS for Product Pages almost doubled in September 2021.
<img src="https://engineering.skroutz.gr/images/core-web-vitals-at-skroutz/corewebvitals-skroutz-9.png" alt="CLS for Product Pages almost doubled in September 2021" /></p>
<p>We deep-dived, but we couldn’t find any season for layout shifts caused by CSS changes - everything seemed okay.</p>
<p>However, a more careful examination showed that we had introduced a <strong>critical bug in the rendering process</strong>: we normally send a fully rendered page to the client from the server (server-side rendering) at the initial load; then the client’s JavaScript bundle takes over and manipulates DOM depending on the users’ interactions. This approach has been chosen as more SEO-friendly. What we saw, in this case, was that during a major refurb of the Product page, we accidentally disabled the server-side rendering and the page was rendering in the browser.</p>
<p>Since our pages are often heavy and rich in content, browsers struggled to composite and paint, resulting in more layout shifts compared with the server-side rendering.</p>
<p>If we didn’t manage to catch this error early, we would probably have been impacted severely in terms of SEO and organic performance. Product prices, reviews, info, etc. are changing very frequently and, especially in the ecommerce industry, content freshness is very important.</p>
<h2 id="2-new-fashion-categories-layout-shifts">2. New fashion categories layout shifts</h2>
<p>The second incident began in December 2021, when a number of alerts started popping up, regarding the CLS score of our Product Listing Pages (namely Categories) in desktop views. These alerts informed us about an increase of the CLS score up to 0.37, when a score of more than 0.25 is seen as poor performance.</p>
<p>Image 10: CLS on Product Listing Pages exceeded alert thresholds.
<img src="https://engineering.skroutz.gr/images/core-web-vitals-at-skroutz/corewebvitals-skroutz-10.png" alt="CLS on Product Listing Pages exceeded alert thresholds" /></p>
<p>After examining the deployments that happened in the exact period, one stood out the most. All image driven PLPs (mainly Fashion, see example <a href="https://www.skroutz.gr/c/1009/andrika-mpoufan.html" target="_blank">here</a>) were switched to a <strong>new layout</strong>, going from the usual 4 tile layout to a wider 3 tile layout. Our new layout didn’t render in a solid and stable way, so users were seeing things pushed down and down while loading.</p>
<p>Image 11: New Fashion layout at Skroutz.gr.
<img src="https://engineering.skroutz.gr/images/core-web-vitals-at-skroutz/corewebvitals-skroutz-11.png" alt="New Fashion layout at Skroutz.gr" /></p>
<p>Images in this layout have a fixed ratio, which is very helpful since we only set their width to fill its container and their height is auto-calculated. We already knew that we had an unknown variable, the image height. However, the width of the images was also unknown since it depends on the viewport, the grid, the grid gaps and the resulting columns. This meant we had practically no control over the width or the height of our images.</p>
<p>Setting a height or width on our images was in this case impossible, since we could not calculate either correctly. Using aspect-ratio was also not a safe resort back then, since it was a fairly new property.</p>
<p>So, we used an old CSS trick for creating responsive squares (initially), but the logic can be applied to rectangles as well. <strong>The % vertical padding of an element is always relative to its width and not its height</strong>, as one might expect. To avoid CLS issues and use fixed ratio images, we have a fixed ratio empty area, based on the available width that the image can then fill when it gets loaded without shifting the content of the whole page. Finally we had to absolute position the images and the gallery to go to the correct place.</p>
<p>We had a stabler layout.</p>
<h2 id="3-css-grid-module-issues">3. CSS Grid module issues</h2>
<p>The third example is again about CLS issues, yet again for Product listing pages in desktop view.</p>
<p>Product listing pages had marginally a good performance score (<0.1) for a long time, however this was okay for us.</p>
<p>Unfortunately, on January 10, a huge layout shift triggered alerts in our slack channel. Something really bad had happened. The increase was observed only in desktop views, while at the same time mobile view had a small decrease.</p>
<p>Image 12: CLS for Listing Page Desktop almost tripled in January 2022.
<img src="https://engineering.skroutz.gr/images/core-web-vitals-at-skroutz/corewebvitals-skroutz-12.png" alt="CLS for Listing Page Desktop almost tripled in January 2022" /></p>
<p>When something like this happens, we usually search in the latest deployments, where it’s more likely to find the bug. However, in the specific example, we didn’t find anything that had changed on the Listing pages, front-end wise. Moreover, the increase started after working hours in a strange and unusual way.</p>
<p>Image 13: CLS for Listing Pages Desktop didn’t seem to correlate with a deployment.
<img src="https://engineering.skroutz.gr/images/core-web-vitals-at-skroutz/corewebvitals-skroutz-13.png" alt="CLS for Listing Pages Desktop didn’t seem to correlate with a deployment" /></p>
<p>When we investigated carefully we saw that this was a multi-factor event. One, Listing pages have not been optimal in terms of stability for a long time. Two, a Chrome update (97.0.4692) kicked in at that time, the new Chrome could evaluate something not optimal in a more rigorous manner.</p>
<p>Normally, the Products Listing page has a left sidebar with the filters and a right -main- section with all the products.</p>
<p>Image 14: Normal Listing Page rendering on Desktop.
<img src="https://engineering.skroutz.gr/images/core-web-vitals-at-skroutz/corewebvitals-skroutz-14.png" alt="Normal Listing Page rendering on Desktop" /></p>
<p>After we ran some tests we figured out that the layout shifts were caused by the main section of the page. What was happening?</p>
<p>Image 15: The main section contributed mostly to the problem.
<img src="https://engineering.skroutz.gr/images/core-web-vitals-at-skroutz/corewebvitals-skroutz-15.png" alt="The main section contributed mostly to the problem" /></p>
<p>Playing with network throttling and CPU slowdown, we caught the bug: the order of the elements (main, sidebar) in the page source for desktop were reversed on a markup level, so we were using CSS Grid modules to reorder them. Until now we specified only the order of the Sidebar (which comes after the main content in the DOM) and the Main section position was left unspecified. Since the sidebar in some specific cases was delayed, the main content would take its place from the grid-template.</p>
<p>Naturally, this caused a minor yet noticeable issue for the user and subsequently the Page Experience and CLS score.</p>
<p>Image 16: A middle state of Listing Page rendering on Desktop: content is pushed to the left due to lack of content in the sidebar.
<img src="https://engineering.skroutz.gr/images/core-web-vitals-at-skroutz/corewebvitals-skroutz-16.png" alt="A middle state of Listing Page rendering on Desktop: content is pushed to the left due to lack of content in the sidebar" /></p>
<p>The fix to this issue proved to be a very quick tweak in our CSS. The main change was basically specifying <strong>explicitly</strong> the grid column where the main section should stand.</p>
<p>After the fix, product listing pages improved and they are now much more stable than before.</p>
<p>Image 17: It is pretty amazing how 2 lines of CSS can make or break a page. Pay attention to your CSS grid module and make sure you specify all elements’ position to avoid any unexpected layout shifts.
<img src="https://engineering.skroutz.gr/images/core-web-vitals-at-skroutz/corewebvitals-skroutz-17.png" alt="It is pretty amazing how 2 lines of CSS can make or break a page. Pay attention to your CSS grid module and make sure you specify all elements’ position to avoid any unexpected layout shifts" /></p>
<p>We have also spotted changes of the other Core Web Vital metrics, Largest Contentful Page (LCP) and First Interaction Delay (FID), however the most sensitive to changes metric until now has proved to be Content Layout Shift (CLS).</p>
<hr />
<h1 id="conclusion">Conclusion</h1>
<p>Having the ability to measure and report on real-world rendering performance is critical for diagnosing issues promptly and improving performance over time. Without field data, it’s impossible to know whether certain changes are actually pushing towards the desired results.</p>
<p>Core Web Vitals helped Skroutz.gr provide a faster, stabler, and more responsive experience. Web Vitals real-time monitoring proved to be essential to delivering a great user experience, in terms of loading time, interactivity, and visual stability.</p>
<p>Image 12: Core Web Vitals Phone State for Skroutz.gr - January 2022.
<img src="https://engineering.skroutz.gr/images/core-web-vitals-at-skroutz/corewebvitals-skroutz-18.png" alt="Core Web Vitals Phone State for Skroutz.gr - January 2022" /></p>
<p>Core Web Vitals represent the best available signals we have today to measure the quality of experience across the web. However, these signals and the available free tools are far from perfect and we expect future improvements or additions. This fact creates a crucial need for an engineering team that caters for all aspects of performance, while a good relationship between SEO and engineering is invaluable for a successful site.</p>
<p>Speed, stability and responsiveness are foundational parts of a good user experience. Since we are committed to offering better user experiences, striving for great site performance is a never-ending journey.</p>
<p><strong>SEO Team</strong>.</p>
<p>💡 Feel free to connect and follow our fresh <a href="https://twitter.com/SkroutzSEO" target="_blank">Skroutz SEO Team Twitter account</a> for more SEO insights and news, or follow <a href="https://twitter.com/skroutzdevs" target="_blank">Skroutz Engineering at Twitter</a>.</p>
<hr />
<p>Hero image source: <a href="https://unsplash.com/photos/w7ZyuGYNpRQ" target="_blank">Unsplash</a>.</p>
<style type="text/css">
.entry-content p > img {
padding-top: 5px;
}
</style>
<p><a href="https://engineering.skroutz.gr/blog/core-web-vitals-at-skroutz-gr/">Core Web Vitals Real-time Monitoring at Skroutz.gr</a> was originally published by Skroutz Engineering Team at <a href="https://engineering.skroutz.gr">Skroutz Engineering</a> on February 27, 2022.</p>https://engineering.skroutz.gr/blog/contributions_to_hotwire_upstream2021-11-01T22:00:00+00:002021-11-01T22:00:00+00:00John Kapantzakishttps://engineering.skroutz.gr<p>As we mentioned in a <a href="https://engineering.skroutz.gr/blog/using_hotwire_to_lazy_load_data/">previous post</a>, we have started to investigate Hotwire and its techniques, which claim to bring the speed of a single-page web application without writing any JavaScript. It seems that Hotwire, especially Turbo, keeps its promise by providing useful tools which make your application more dynamic, without having to write almost no custom JavaScript.</p>
<p>From our experience with Turbo so far, we found Turbo-Frames to be very handy and can be easily used out of the box. But, as Hotwire is a relatively new tool, we often come across to situations where something seems to be missing, or it doesn’t work as it is supposed to. Skroutz’s engineers always look for opportunities to contribute to open source projects, and this seemed to be a perfect opportunity, so we proceeded by opening some pull requests to the Hotwire’s repo.</p>
<p>Now, let’s take a look at the pull requests that have already been merged and see what problem, each one of them, tries to solve.</p>
<h4 id="including-url-in-turbobefore-fetch-request-event">Including url in <code class="language-plaintext highlighter-rouge">turbo:before-fetch-request</code> event</h4>
<p><a href="https://github.com/hotwired/turbo/pull/289">Pull request #289</a> by <a href="https://github.com/ctrochalakis">Christos Trochalakis</a></p>
<p>Turbo fires the <code class="language-plaintext highlighter-rouge">turbo:before-fetch-request</code> before it issues a network request. Let’s say that we have multiple Turbo-Frame elements in the page and each one of them, uses a different endpoint to update its contents. Let’s also say that we have the following event listener attached to the document:</p>
<figure class="highlight"><pre><code class="language-javascript" data-lang="javascript"><span class="nb">document</span><span class="p">.</span><span class="nf">addEventListener</span><span class="p">(</span><span class="dl">'</span><span class="s1">turbo:before-fetch-request</span><span class="dl">'</span><span class="p">,</span> <span class="nx">handleBeforeFetchRequest</span><span class="p">);</span></code></pre></figure>
<p>Before <a href="https://github.com/hotwired/turbo/pull/289">#289</a> got merged, we didn’t have a way to distinguish each one of those events. We just knew that some Turbo element issued a network request. By making available the <code class="language-plaintext highlighter-rouge">url</code> to which the network request gets issued, from the respective event, we can add any custom logic that handles the different urls.</p>
<p>For example, we can do this:</p>
<figure class="highlight"><pre><code class="language-javascript" data-lang="javascript"><span class="kd">const</span> <span class="nx">handleBeforeFetchRequest</span> <span class="o">=</span> <span class="p">({</span> <span class="na">detail</span><span class="p">:</span> <span class="p">{</span> <span class="nx">url</span> <span class="p">}</span> <span class="p">})</span> <span class="o">=></span> <span class="p">{</span>
<span class="k">switch </span><span class="p">(</span><span class="nx">url</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">// handle different urls</span>
<span class="p">}</span>
<span class="p">}</span></code></pre></figure>
<h4 id="adding-the-target-element-to-turbobefore-fetch-requestresponseevents">Adding the target element to <code class="language-plaintext highlighter-rouge">turbo:before-fetch-(request|response)</code>events</h4>
<p><a href="https://github.com/hotwired/turbo/pull/367">Pull request #367</a> by <a href="https://github.com/kapantzak">John Kapantzakis</a></p>
<p><a href="https://github.com/hotwired/turbo-site/pull/68">Docs update regarding #367</a></p>
<p>Similarly to <code class="language-plaintext highlighter-rouge">turbo:before-fetch-request</code>, <code class="language-plaintext highlighter-rouge">turbo:before-fetch-response</code> fires after the network request completes. Those events used to get fired on the document and we couldn’t identify the element that caused the network request / response, from an event listener attached to the document.</p>
<p>This PR adds the target element to the <code class="language-plaintext highlighter-rouge">turbo:before-fetch-request</code> and <code class="language-plaintext highlighter-rouge">turbo:before-fetch-response</code> events, so that we can listen for those events coming from specific elements, like this:</p>
<figure class="highlight"><pre><code class="language-javascript" data-lang="javascript"><span class="nx">myTurboFrame</span><span class="p">.</span><span class="nf">addEventListener</span><span class="p">(</span><span class="dl">'</span><span class="s1">turbo:before-fetch-request</span><span class="dl">'</span><span class="p">,</span> <span class="nx">handleFetchRequest</span><span class="p">);</span></code></pre></figure>
<h4 id="introducing-turboframe-render-and-turboframe-load-events">Introducing <code class="language-plaintext highlighter-rouge">turbo:frame-render</code> and <code class="language-plaintext highlighter-rouge">turbo:frame-load</code> events</h4>
<p><a href="https://github.com/hotwired/turbo/pull/327">Pull request #327</a> by <a href="https://github.com/kapantzak">John Kapantzakis</a></p>
<p><em><code class="language-plaintext highlighter-rouge">turbo:frame-load</code> cherry-picked from</em> <a href="https://github.com/hotwired/turbo/pull/59">#59</a></p>
<p><a href="https://github.com/hotwired/turbo-site/pull/64">Docs update regarding #327</a></p>
<p>Lifecycle events were missing from Turbo-Frames until <code class="language-plaintext highlighter-rouge">turbo:frame-render</code> and <code class="language-plaintext highlighter-rouge">turbo:frame-load</code> were introduced, and gave us the opportunity to hook various handlers on those events.</p>
<p>These get fired as soon as the Turbo-Frame element has rendered its contents and when it has finished loading, respectively. Furthermore, these events get fired on the respective Turbo-Frame element, rather than on the document, making it easier to target specific elements.</p>
<figure class="highlight"><pre><code class="language-javascript" data-lang="javascript"><span class="nx">myTurboFrame</span><span class="p">.</span><span class="nf">addEventListener</span><span class="p">(</span><span class="dl">'</span><span class="s1">turbo:frame-render</span><span class="dl">'</span><span class="p">,</span> <span class="nx">handleMyTurboFrameRender</span><span class="p">);</span>
<span class="kd">const</span> <span class="nx">handleMyTurboFrameRender</span> <span class="o">=</span> <span class="p">({</span> <span class="nx">target</span> <span class="p">})</span> <span class="o">=></span> <span class="p">{</span>
<span class="nx">target</span><span class="p">.</span><span class="nf">querySelectorAll</span><span class="p">(</span><span class="dl">'</span><span class="s1">.elements-inside-frame</span><span class="dl">'</span><span class="p">).</span><span class="nf">forEach</span><span class="p">((</span><span class="nx">elem</span><span class="p">)</span> <span class="o">=></span> <span class="p">{</span> <span class="p">...</span> <span class="p">})</span>
<span class="p">}</span></code></pre></figure>
<h4 id="introducing-test-runner-options">Introducing test runner options</h4>
<p><a href="https://github.com/hotwired/turbo/pull/311">Pull request #311</a> by <a href="https://github.com/ctrochalakis">Christos Trochalakis</a></p>
<p>This PR doesn’t affect the tools that Turbo provides, directly, but it makes the development of Turbo’s features a lot easier, by adding some options to the testing process. Specifically, it adds the <code class="language-plaintext highlighter-rouge">--grep</code> and <code class="language-plaintext highlighter-rouge">--environment</code> options.</p>
<p>You can use the <code class="language-plaintext highlighter-rouge">--grep</code> option when you want to target a specific test case.</p>
<figure class="highlight"><pre><code class="language-terminal" data-lang="terminal"><span class="gp">$</span><span class="w"> </span>yarn <span class="nb">test</span> <span class="nt">--grep</span> <span class="s1">'triggers before-render and render events'</span></code></pre></figure>
<p>You can use the <code class="language-plaintext highlighter-rouge">--environment</code> option when you want to set the environment on which you want to perform the tests.</p>
<figure class="highlight"><pre><code class="language-terminal" data-lang="terminal"><span class="gp">$</span><span class="w"> </span>yarn <span class="nb">test</span> <span class="nt">--environment</span> <span class="s1">'Firefox'</span></code></pre></figure>
<h4 id="avoiding-race-condition-between-visit-tests">Avoiding race condition between visit tests</h4>
<p><a href="https://github.com/hotwired/turbo/pull/310">Pull request #310</a> by <a href="https://github.com/ctrochalakis">Christos Trochalakis</a></p>
<p>This is another PR that improves the development experience of Turbo features by fixing a race condition that was happening when the page location was changed asynchronously and an event logs array was getting out of sync. You can inspect the PR for more details regarding the relevant changes.</p>
<h1 id="summary">Summary</h1>
<p>Summing it up, here’s a list of the commits sent upstream so far:</p>
<ul>
<li><a href="https://github.com/hotwired/turbo/commit/4d42a38658d892e5617144362a4a96863c6c860e">Introduce test runner options</a></li>
<li><a href="https://github.com/hotwired/turbo/commit/3b70866f1a8f92c313a90aba305fb208428d175d">Include url in turbo:before-fetch-request event</a></li>
<li><a href="https://github.com/hotwired/turbo/commit/9dfca8ffa0e8f7ef613c02db03e5a4a93630c484">Avoid race between visit tests</a></li>
<li><a href="https://github.com/hotwired/turbo/commit/c9c1c11610f12442a6342200e396c12a30ed957d">Add the target element to turbo:before-fetch-request and turbo:before-fetch-response events</a></li>
<li><a href="https://github.com/hotwired/turbo/commit/84b0a89902d48ac455b08e70a975abad3e1b14b9">Fire turbo:frame-render event after turbo frame renders the view</a></li>
</ul>
<p><a href="https://engineering.skroutz.gr/blog/contributions_to_hotwire_upstream/">Skroutz contributes to Hotwire's upstream</a> was originally published by John Kapantzakis at <a href="https://engineering.skroutz.gr">Skroutz Engineering</a> on November 01, 2021.</p>https://engineering.skroutz.gr/blog/monolith-diaries-upgrading-rails2021-10-22T13:30:00+00:002021-10-22T13:30:00+00:00Lazarus Lazaridishttps://engineering.skroutz.gr<p>We recently upgraded our monolith application from Rails 6.0 to Rails 6.1.
By evaluating our prior experience on Rails upgrades, we have streamlined the process and we want to share it with you.</p>
<p>In this post we are going to give some insights on our workflow, from organizing such a milestone to actually delivering it without blocking an engineering team of more than 160 developers, building an application that peaks at more than 100k requests per minute.</p>
<h2 id="introduction">Introduction</h2>
<p>The core application of Skroutz is a large Rails monolith heavily utilizing MariaDB, MongoDB, Elasticsearch, Kafka, Redis and Memcached.
We also use Jenkins for our CI and various tools like Sentry, NewRelic and Grafana for monitoring.</p>
<p>Even though we were upgrading to a minor version, Rails 6.1 introduced a <a href="https://guides.rubyonrails.org/6_1_release_notes.html">notable amount of changes</a> affecting many parts of our codebase and the aforementioned components.</p>
<p>We will describe the process we followed including some key points that allowed us to have a smooth release (such as our <a href="#deprecations">deprecation handling mechanism</a>, how we approached <a href="#working-for-the-upgrade">backportable and non-backportable changes</a>, <a href="#canary-release">canary deployment</a> and more).</p>
<h2 id="organizing-the-upgrade">Organizing the upgrade</h2>
<p>Spending time and resources to properly organize such a milestone is crucial for a successful delivery so we started with brainstorming and discussions on the following three questions: <strong>who</strong>, <strong>how</strong> and <strong>when</strong>.</p>
<h3 id="who">Who</h3>
<p>We have a core team named Kernel that among others, is responsible for keeping the application healthy, modern and productive.</p>
<p>Although the whole upgrade was driven by this team, all Skroutz’s teams were involved with much of the work to be done. Why?</p>
<ul>
<li>
<p><strong>Share the knowledge</strong></p>
<p>With every upgrade, new things become available, some things start working in a different way than before and some others are no longer there.</p>
<p>Having engineers directly working on these changes, familiarizes themselves more effectively than just reading a changelog of the new version. Additionally, the gained knowledge is much more easily and directly communicated to other members of their team.</p>
</li>
<li>
<p><strong>Cross team work is beneficial in many ways</strong></p>
<p>This is a very good opportunity for engineers to</p>
<ul>
<li>familiarize themselves with sections of the codebase that don’t belong to their domain</li>
<li>meet and work with engineers outside of their team</li>
<li>exchange knowledge, share tips, hacks and cat photos :P</li>
</ul>
</li>
<li>
<p><strong>Speed up the process</strong></p>
<p>It’s much easier and productive to investigate problems and proceed to changes in specific code sections by the team that owns it.</p>
</li>
</ul>
<hr />
<p>At Skroutz we have organized the engineering team under product groups with each group consisting of a handful of teams.</p>
<p>For the upgrade process, each product group assigned to one of its members the role of the <strong>Contact Person</strong> with the following responsibilities:</p>
<ul>
<li>
<p><strong>Single point of reference</strong></p>
<p>Address any requests for help or information coming from the Core team.</p>
</li>
<li>
<p><strong>Delegation</strong></p>
<p>Work directly to address a group’s issue regarding the upgrade or pass it on to the proper member of the group.</p>
</li>
<li>
<p><strong>Sync</strong></p>
<p>Stay up to date with the status of the upgrade, communicate developments affecting the group’s pipeline, raise the flag and request for help in case of delays or blocking items.</p>
</li>
</ul>
<h3 id="how">How</h3>
<p>For a milestone of this size, effective communication and task breakdown is critical.</p>
<h4 id="tracking">Tracking</h4>
<p>Since upgrading Rails is a recurring task, we use a dedicated project in our tracking system and we create milestones for each explicit upgrade.</p>
<p>The workboard contains columns categorizing the tasks based on their nature, so we can easily have a good overview of the state of the upgrade process, what’s left to be done, what’s blocked etc.</p>
<figure>
<a href="../../../images/2021-upgrading-rails/phabricator.png" class="image-popup">
<img src="../../../images/2021-upgrading-rails/phabricator.png" alt="Workboard" />
</a>
</figure>
<p>The nature of the tasks varies for each application but the following categories should be pretty common for everyone.</p>
<ul>
<li>
<p><strong>Preparations</strong></p>
<p>Tasks for preparing the upgrade process before the actual work starts - find more in the <a href="#preparation">Preparations</a> section below</p>
</li>
<li>
<p><strong>Investigations</strong></p>
<p>Tasks for items that need investigation, for example checking if there is a version compatible with the target Rails version for a specific gem or check if the CI needs modifications to play well with the new version etc.</p>
</li>
<li>
<p><strong>Deprecations</strong></p>
<p>Tasks for complying with suggestions deriving from Rails active support deprecations for the target version - find more in the <a href="#deprecations">Deprecations</a> section below</p>
</li>
<li>
<p><strong>Gem updates</strong></p>
<p>Tasks for updating internal or external gems to their new Rails compatible version</p>
</li>
<li>
<p><strong>Changes & Fixes</strong></p>
<p>This category contains all the tasks that actually make the codebase compatible with the new Rails version. Most commonly, these tasks involve fixing bugs due to changes that were not resolved by the deprecations or modifying code to use a newly introduced Rails feature.</p>
</li>
<li>
<p><strong>Pre-release tasks</strong></p>
<p>Tasks for actions that need to be done after everything seems to be in place and before the actual release (such as smoke tests, create the deploy plan etc.)</p>
</li>
<li>
<p><strong>Post-release</strong></p>
<p>Task for actions that need to be taken after the new version release - that could be cleanups, monitor the performance etc.</p>
</li>
</ul>
<h4 id="communication">Communication</h4>
<p>We created a Slack channel joined by the Core Team, the Group Contact Persons and any other engineer interested in the upgrade and we set up our tracking system to publish notifications of the Rails upgrade milestone to it.</p>
<p>Having a dedicated place for communication had many benefits:</p>
<ul>
<li>Anything related to the upgrade was shared at the channel - the information was not being spread to emails, private conversations or other communication channels. We didn’t have to remember what was discussed and where, everything was available and discoverable in a single place - we could revisit the channel at any time in the future and find what we’re looking for.</li>
<li>Every member was constantly in sync with the upgrade developments - any accomplishments, resolutions, blocking factors or discussions were communicated to the channel - even if someone got involved at a later phase of the milestone, the information was there.</li>
<li>Something that possibly affected a specific group’s code area was visible to any member of the channel - everyone could contribute and familiarize themselves with almost all introduced changes of the upgrade.</li>
</ul>
<hr />
<p>Given the above, the “How” could be summed up to:</p>
<ul>
<li>The upgrade has to be <strong>well broken down</strong> in tasks on the milestone <strong>workboard</strong> in the tracking system - when everything is resolved, we’re ready for the release.</li>
<li>Whatever we need - <strong>help, raise a flag, share a finding - use the Slack channel</strong> and let the discussion begin.</li>
</ul>
<h3 id="when">When</h3>
<p>Even though planning a Rails upgrade is hard and can easily go off track, there are one or two things that can help us accomplish it in a safer manner.</p>
<h4 id="cross-team-work">Cross team work</h4>
<p>The upgrade should be a cross team work.</p>
<p>Our Core team had this task already in its pipeline but involving other teams, having their own planning, at the last minute would not work.
To avoid this, we had to evaluate the required effort and how it is distributed to the other teams’ components <strong>early in the process</strong> - three months before the date we wanted the release to take place.</p>
<h4 id="take-baby-steps---dont-jump-at-once">Take baby steps - don’t jump at once</h4>
<p>Having a well tested application with a green CI build doesn’t mean that everything will be fine once we go live. There are many things that could go wrong - from degraded performance to bugs showing up only in production - and the sooner we learn about it the better it is.</p>
<p>Upgrading an application to a newer Rails version, usually means:</p>
<ul>
<li>update the gems to a compatible version</li>
<li>modify the codebase to conform with the new conventions</li>
<li>change previously deprecated mechanisms to the suggested ones (for example, the <code class="language-plaintext highlighter-rouge">dalli_store</code> that doesn’t <a href="https://github.com/petergoldstein/dalli/issues/771">play well</a> in Rails 6.1 and can be replaced by the <code class="language-plaintext highlighter-rouge">mem_cache_store</code> implementation as suggested by both Rails and the Dalli gem)</li>
</ul>
<p>Instead of packing all of the above in a single deployment, we isolated any backwards compatible changes and shipped them as soon as possible in the current Rails version.</p>
<h2 id="preparation">Preparation</h2>
<p>We decided how we will organize the upgrade. Time to start preparing it - we couldn’t just shout out to Slack “Hey everybody, start upgrading the application”.</p>
<p>As previously mentioned, we wanted to measure the effort and break it down efficiently in tasks.
How do we do this though?</p>
<h3 id="changelogs">Changelogs</h3>
<p>Obviously, the first step was to read the changelogs to get an idea of what is changing in the new version.
Besides learning of new features that our application could use, this step is also crucial to easier understand and resolve any failures that will show up later on during the upgrade process.</p>
<p>But. There will be a lot of changelog entries for which it’s not very obvious how they affect our application.</p>
<p>This one for example:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Fix complicated has_many :through with nested where condition
</code></pre></div></div>
<p>Does this mean that we are already affected by this bug in our current version? If yes, are we already using a workaround?</p>
<p>Since we are talking about a monolith built by a multi-member engineering team, someone can’t know every single bit of it. They can’t answer the questions above unless they actually coded something that revealed this specific bug. But even in this case, what about the rest of the changelog entries?</p>
<p>So, after this step, what will help us get a better idea of what is going on is to take a look at our CI. How many failures do we have in the new version?
But there’s a prerequisite for that step. Updating our gems…</p>
<p>At this point, we should create a branch (we named ours <code class="language-plaintext highlighter-rouge">rails-upgrade-main</code>) in which we will start adding the commits that will be merged to our <code class="language-plaintext highlighter-rouge">main</code> branch when we will be ready to ship the upgrade.</p>
<h3 id="gems-update">Gems update</h3>
<p>It would be so great if we could just change the Rails’ gem version in the Gemfile, run <code class="language-plaintext highlighter-rouge">bundle</code> and get the green message.</p>
<p>But that’s pretty uncommon. A monolith usually comes with a Gemfile full of dependencies and it’s almost certain that you’ll have to upgrade some or many of them to a version compatible with the target Rails version.</p>
<p>So, after changing the Rails version in the Gemfile, we run <code class="language-plaintext highlighter-rouge">bundle update rails</code> and start resolving any failures that arise due to other gem incompatibilities.</p>
<p>This usually means that we have to</p>
<ul>
<li>visit the gem’s homepage to locate the appropriate version</li>
<li>read the changelogs and check if the changes affect the gem’s usages in our codebase</li>
</ul>
<p>We use <a href="https://github.com/thoughtbot/appraisal"><code class="language-plaintext highlighter-rouge">Appraisal</code></a> in all of our internal gems and testing their compatibility with the Rails version was as simple as creating a new appraisal definition and making sure that the tests are green. In most cases, all we had to do was to extend their Rails dependency to include the new one.</p>
<p>A very good practice here is to check if the new gem’s version is also compatible with the current Rails version. <strong>If yes, then this version bump should be brought to the <code class="language-plaintext highlighter-rouge">main</code> branch and deployed early</strong>. This will allow us to identify and deal with gem issues incrementally and gem by gem instead of dealing with all of them upon the upgrade release. So, instead of pushing the gem version bumps to the <code class="language-plaintext highlighter-rouge">rails-upgrade-main</code> branch, use the <code class="language-plaintext highlighter-rouge">main</code> branch and ship them one by one or as we see fit.</p>
<p>Ideally, from this step, the <code class="language-plaintext highlighter-rouge">rails-upgrade-main</code> branch should contain only the commit that bumps the <code class="language-plaintext highlighter-rouge">rails</code> gem version to the target one.</p>
<h3 id="rake-appupdate"><code class="language-plaintext highlighter-rouge">rake app:update</code></h3>
<p>So, we have a branch whose <code class="language-plaintext highlighter-rouge">rails</code> version is the target one and we can <code class="language-plaintext highlighter-rouge">bundle</code> successfully.</p>
<p>At this point, we need to execute the <code class="language-plaintext highlighter-rouge">rake app:update</code> task as noted <a href="https://guides.rubyonrails.org/upgrading_ruby_on_rails.html#the-update-task">here</a> and
also <a href="https://guides.rubyonrails.org/upgrading_ruby_on_rails.html#configure-framework-defaults">configure the framework defaults</a>.</p>
<blockquote>
<p>The new Rails version might have different configuration defaults than the previous version. However, after following the steps described above, your application would still run with configuration defaults from the previous Rails version. That’s because the value for config.load_defaults in config/application.rb has not been changed yet.</p>
</blockquote>
<p>We follow the interactive session and proceed based on our application’s setup. At the end we should carefully review the changes, especially those related to the new version’s defaults, and commit them in the <code class="language-plaintext highlighter-rouge">rails-upgrade-main</code> branch.</p>
<h3 id="test-suite">Test Suite</h3>
<p>We have successfully bundled, we adapted to the new version’s configuration and we want to run the test suite to see what’s going on.</p>
<p>Extensively testing our application makes milestones like the Rails upgrade much safer and gives us more confidence that everything will be fine.</p>
<p>In our application, we have ~75k RSpec examples and we have set up our CI to distribute them in a group of servers ending up decreasing the duration of the sequential run from hours to just 15minutes.</p>
<p>Our first execution finished with more than 1.5k failures. Even though this seemed kind of disappointing, we knew what the root cause along with the fix for the majority of them was. Deprecations :)</p>
<h3 id="deprecations">Deprecations</h3>
<p>Rails comes with a deprecation API, <code class="language-plaintext highlighter-rouge">ActiveSupport::Deprecation</code>s, and every framework component like ActiveRecord, uses it to inform for usages that are deprecated and subject to removal, replacement or change in a next version release (in most cases the warnings include a suggestion on how to deal with it).</p>
<p>At Skroutz, we have set up this deprecation mechanism to work along with Rails’ instrumentation API, <a href="https://api.rubyonrails.org/classes/ActiveSupport/Notifications.html"><code class="language-plaintext highlighter-rouge">ActiveSupport::Notifications</code></a>.</p>
<p>Instead of raising an error or just logging a deprecation, we configured all of our environments to <code class="language-plaintext highlighter-rouge">notify</code> in case of a deprecation</p>
<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">config</span><span class="p">.</span><span class="nf">active_support</span><span class="p">.</span><span class="nf">deprecation</span> <span class="o">=</span> <span class="ss">:notify</span>
</code></pre></div></div>
<p>and in an initializer we subscribed to the related event in order to implement our deprecation handling.</p>
<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="no">ActiveSupport</span><span class="o">::</span><span class="no">Notifications</span><span class="p">.</span><span class="nf">subscribe</span><span class="p">(</span><span class="s1">'deprecation.rails'</span><span class="p">)</span> <span class="k">do</span> <span class="o">|</span><span class="n">_</span><span class="p">,</span> <span class="n">_</span><span class="p">,</span> <span class="n">_</span><span class="p">,</span> <span class="n">_</span><span class="p">,</span> <span class="n">payload</span><span class="o">|</span>
<span class="c1"># Deprecation handling goes here</span>
<span class="k">end</span>
</code></pre></div></div>
<p>We define an allowed list of deprecation messages - deprecations that we don’t have to deal with at the moment and should be ignored.</p>
<p>The following table shows how our handling works:</p>
<table>
<thead>
<tr>
<th>Environment</th>
<th>Allowed deprecation</th>
<th>Action</th>
</tr>
</thead>
<tbody>
<tr>
<td>Production</td>
<td>Yes</td>
<td>Nothing</td>
</tr>
<tr>
<td>Production</td>
<td>No</td>
<td>Send event to Sentry</td>
</tr>
<tr>
<td>All other environments</td>
<td>Yes</td>
<td>Log the deprecation</td>
</tr>
<tr>
<td>All other environments</td>
<td>No</td>
<td>Raise it as an error</td>
</tr>
</tbody>
</table>
<p>With some simplifications, the code would look like this:</p>
<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">DeprecationHandler</span>
<span class="no">ALLOWED_LIST</span> <span class="o">=</span> <span class="p">[</span>
<span class="sr">/You should not do this/</span>
<span class="p">]</span>
<span class="k">def</span> <span class="nc">self</span><span class="o">.</span><span class="nf">handle</span><span class="p">(</span><span class="n">payload</span><span class="p">)</span>
<span class="n">allowed</span> <span class="o">=</span> <span class="no">ALLOWED_LIST</span><span class="p">.</span><span class="nf">any?</span> <span class="p">{</span> <span class="o">|</span><span class="n">pattern</span><span class="o">|</span> <span class="n">pattern</span><span class="p">.</span><span class="nf">match?</span><span class="p">(</span><span class="n">payload</span><span class="p">[</span><span class="ss">:message</span><span class="p">])</span> <span class="p">}</span>
<span class="k">if</span> <span class="no">Rails</span><span class="p">.</span><span class="nf">env</span><span class="p">.</span><span class="nf">production?</span>
<span class="k">return</span> <span class="k">if</span> <span class="n">allowed</span>
<span class="n">report_to_sentry</span><span class="p">(</span><span class="n">payload</span><span class="p">)</span>
<span class="k">else</span>
<span class="k">if</span> <span class="n">allowed</span>
<span class="no">Rails</span><span class="p">.</span><span class="nf">logger</span><span class="p">.</span><span class="nf">tagged</span><span class="p">(</span><span class="s1">'active_support'</span><span class="p">,</span> <span class="s1">'deprecation'</span><span class="p">)</span> <span class="k">do</span>
<span class="no">Rails</span><span class="p">.</span><span class="nf">logger</span><span class="p">.</span><span class="nf">warn</span><span class="p">(</span><span class="n">payload</span><span class="p">[</span><span class="ss">:message</span><span class="p">])</span>
<span class="k">end</span>
<span class="k">else</span>
<span class="k">raise</span> <span class="no">ActiveSupport</span><span class="o">::</span><span class="no">DeprecationException</span><span class="p">.</span><span class="nf">new</span><span class="p">(</span><span class="n">payload</span><span class="p">[</span><span class="ss">:message</span><span class="p">])</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="no">ActiveSupport</span><span class="o">::</span><span class="no">Notifications</span><span class="p">.</span><span class="nf">subscribe</span><span class="p">(</span><span class="s1">'deprecation.rails'</span><span class="p">)</span> <span class="k">do</span> <span class="o">|</span><span class="n">_</span><span class="p">,</span> <span class="n">_</span><span class="p">,</span> <span class="n">_</span><span class="p">,</span> <span class="n">_</span><span class="p">,</span> <span class="n">payload</span><span class="o">|</span>
<span class="no">DeprecationHandler</span><span class="p">.</span><span class="nf">handle</span><span class="p">(</span><span class="n">payload</span><span class="p">)</span>
<span class="k">end</span>
</code></pre></div></div>
<hr />
<p>Given the above, most of the messages we had in our allowed list before we started the upgrade, were deprecations generated in our Rails 6.0 for items that would affect our target version 6.1.</p>
<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">...</span>
<span class="sr">/Initialization autoloaded the constants/</span><span class="p">,</span>
<span class="sr">/Class level methods will no longer inherit scoping from `/</span><span class="p">,</span>
<span class="sr">/update_attributes(!)? is deprecated and will be removed from Rails 6.1 \(please, use update(!)? instead\)/</span><span class="p">,</span>
<span class="sr">/ActionMailer::Base\.receive is deprecated and will be removed in Rails 6\.1\. Use Action Mailbox to process inbound email\./</span><span class="p">,</span>
<span class="sr">/ActionView::Base instances should be constructed with a lookup context, assignments, and a controller/</span><span class="p">,</span>
<span class="sr">/ActionView::Base instances must implement `compiled_method_container` or use the class method `with_empty_template_cache` for constructing an ActionView::Base instance that has an empty cache/</span><span class="p">,</span>
<span class="sr">/Rails 6\.1 will return Content-Type header without modification/</span><span class="p">,</span>
<span class="sr">/render file: should be given the absolute path to a file/</span><span class="p">,</span>
<span class="sr">/NOT conditions will no longer behave as NOR/</span><span class="p">,</span>
<span class="o">...</span>
</code></pre></div></div>
<p>So, before starting to check each one of the 1.5k failing specs mentioned in the previous section, we first worked on dealing with these deprecations. How?</p>
<p>For <strong>each deprecation</strong>:</p>
<ol>
<li>we created a branch from our <code class="language-plaintext highlighter-rouge">main</code> branch</li>
<li>we removed the deprecation from the allowed list</li>
<li>we ran the test suite on the branch and we located the parts that were generating the deprecations - remember that our deprecation handling raises errors for non-allowed messages</li>
<li>engineers from each group prepared commits to the branch fixing the deprecations relevant to their team</li>
<li>when the suite got green, we shipped it in production, and</li>
<li>we checked our production monitoring system for deprecation events a.k.a. deprecations that occurred from code that was not fully tested</li>
</ol>
<p>Note here that the changes were <strong>backwards compatible</strong> - fixes were merged in the <code class="language-plaintext highlighter-rouge">main</code> branch and not deferred to <code class="language-plaintext highlighter-rouge">rails-upgrade-main</code> for the final upgrade.</p>
<h2 id="working-for-the-upgrade">Working for the upgrade</h2>
<p>After fixing all the deprecations for Rails 6.1, the test suite ended up failing with only 50 errors or so. Good news, right?</p>
<p>Well, now this is the most tricky part of the upgrade process. It’s the part in which we have to investigate and try to find out which changelog entry caused it in order to get a good understanding of what changed and how to fix it.</p>
<p>As previously noted, we can’t know exactly how a changelog entry actually affects the codebase and in many cases we will have to check the Rails PRs that have been merged to the new version in order to gather more information.</p>
<p>Also, note that some failures might actually happen due to a framework’s bug introduced in the new version, such as <a href="https://github.com/rails/rails/issues/42525">this one</a> that we located in one of our specs and for which we <a href="https://github.com/rails/rails/pull/43100">opened a Rails PR upstream</a>.</p>
<hr />
<p>For each of the failing specs in our suite, we created a task in the tracking system and we assigned it to the proper contact person to either work on it or delegate it to one or more team members.</p>
<p>Normally, any work that has to be done from now on would be committed in the <code class="language-plaintext highlighter-rouge">rails-upgrade-main</code> branch.
The whole process might take some weeks or even months to complete and rebasing this branch to the <code class="language-plaintext highlighter-rouge">main</code> one should take place on a weekly basis if not more frequently.</p>
<p>To eliminate the effort of conflict resolution by rebasing to the <code class="language-plaintext highlighter-rouge">main</code> branch though, there are some things that we can do.</p>
<h2 id="backportable-changes">Backportable changes</h2>
<p>There will be changes that will work both in the current and the target Rails version - these could and should be directly committed to the <code class="language-plaintext highlighter-rouge">main</code> branch.</p>
<p>For example, in one of our specs we made use of the <code class="language-plaintext highlighter-rouge">last_migration</code> method of <code class="language-plaintext highlighter-rouge">ActiveRecord::MigrationContext</code> which was <a href="https://github.com/rails/rails/commit/4705ba82dbf303b5eb84c46d1c7112a75d3273e5#diff-c7b2018646f254d00541db2d6cdb3b02b64ac8cc7a7dc2fb0f1b67e9c8cb7ff8L1101-L1103">removed in Rails 6.1</a> and we now had to calculate ourselves in Rails 6.1. Since the calculation would work in our <code class="language-plaintext highlighter-rouge">main</code> Rails version, we pushed the fix there instead of the <code class="language-plaintext highlighter-rouge">rails-upgrade-main</code> branch.</p>
<h2 id="non-backportable-changes">Non-backportable changes</h2>
<p>For the rest of them, if a change is relatively small and contained, we can use a condition and alter the implementation based on it.</p>
<p>In a base module of the application we added the following helper methods:</p>
<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">module</span> <span class="nn">Skroutz</span>
<span class="k">def</span> <span class="nc">self</span><span class="o">.</span><span class="nf">rails_next_version</span>
<span class="no">Gem</span><span class="o">::</span><span class="no">Version</span><span class="p">.</span><span class="nf">new</span><span class="p">(</span><span class="s1">'7.0'</span><span class="p">)</span> <span class="c1"># Your target version here</span>
<span class="k">end</span>
<span class="k">def</span> <span class="nc">self</span><span class="o">.</span><span class="nf">rails_next?</span>
<span class="no">Rails</span><span class="p">.</span><span class="nf">gem_version</span> <span class="o">>=</span> <span class="n">rails_next_version</span>
<span class="k">end</span>
<span class="k">end</span>
</code></pre></div></div>
<p>Then, when introducing a small change like the following, use the provided condition above to differentiate the behaviour.</p>
<p>Assume that there is a Rails framework method <code class="language-plaintext highlighter-rouge">rails_method</code> that returns a number in the current Rails version and we use it in a file that is frequently changed in the codebase:</p>
<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">ClassWithManyChanges</span>
<span class="k">def</span> <span class="nf">a_method</span>
<span class="k">if</span> <span class="no">Rails</span><span class="p">.</span><span class="nf">rails_method</span><span class="p">(</span><span class="n">params</span><span class="p">)</span> <span class="o">==</span> <span class="mi">1</span>
<span class="n">logger</span><span class="p">.</span><span class="nf">info</span> <span class="s1">'All good'</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="k">end</span>
</code></pre></div></div>
<p>but in the next Rails version it returns a boolean instead of a number.</p>
<p>Instead of changing the condition to use <code class="language-plaintext highlighter-rouge">true</code> instead of <code class="language-plaintext highlighter-rouge">1</code> that would work only in the <code class="language-plaintext highlighter-rouge">rails-upgrade-main</code> branch</p>
<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">ClassWithManyChanges</span>
<span class="k">def</span> <span class="nf">a_method</span>
<span class="k">if</span> <span class="no">Rails</span><span class="p">.</span><span class="nf">rails_method</span><span class="p">(</span><span class="n">params</span><span class="p">)</span> <span class="o">==</span> <span class="kp">true</span>
<span class="n">logger</span><span class="p">.</span><span class="nf">info</span> <span class="s1">'All good'</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="k">end</span>
</code></pre></div></div>
<p>we can instead do the following and push it to the <code class="language-plaintext highlighter-rouge">main</code> branch.</p>
<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">ClassWithManyChanges</span>
<span class="k">def</span> <span class="nf">a_method</span>
<span class="c1"># TODO(rails6.1): Cleanup after upgrade</span>
<span class="n">against_value</span> <span class="o">=</span> <span class="no">Skroutz</span><span class="p">.</span><span class="nf">rails_next?</span> <span class="p">?</span> <span class="kp">true</span> <span class="p">:</span> <span class="mi">1</span>
<span class="k">if</span> <span class="no">Rails</span><span class="p">.</span><span class="nf">rails_method</span><span class="p">(</span><span class="n">params</span><span class="p">)</span> <span class="o">==</span> <span class="n">against_value</span>
<span class="n">logger</span><span class="p">.</span><span class="nf">info</span> <span class="s1">'All good'</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="k">end</span>
</code></pre></div></div>
<p>This might seem a bit weird but besides saving time from conflicts upon rebase, it also works as a warning for engineers when they attempt to change a part in the <code class="language-plaintext highlighter-rouge">main</code> branch that has altered behaviour in the next Rails version.</p>
<h2 id="delivering-the-upgrade">Delivering the upgrade</h2>
<p>At this point the suite is green and the most important milestone of the upgrade has completed successfully.
Now, the target is to deliver it in a safe manner and without surprises. Well, at least with the least possible surprises :)</p>
<h3 id="sanity-testing">Sanity testing</h3>
<p>It is very common in our field to test something in the development environment and find out that it’s working, write specs about it and they get green but once it goes live users see the 500 page as our new feature.</p>
<p>To eliminate these cases, having a staging environment that is very close to production is a saver.</p>
<h4 id="core-testing">Core testing</h4>
<p>This is a list of items to test against the new framework version:</p>
<ul>
<li>
<p><strong>Migrations:</strong> we proceed to at least one ActiveRecord migration to make sure that everything works as expected and we review the generated changes in the schema.</p>
</li>
<li>
<p><strong>Caching:</strong> it is very common when upgrading to a new version, to have failures when deserializing an object that was cached in the previous one. We must try to identify such cases and note them down in order to be prepared to clear the affected keys from the cache upon releasing the upgrade unless, like in our case, we can afford a whole cache clear.</p>
</li>
<li>
<p><strong>Encryption:</strong> if we use Rails’ encryption (ex. encrypted cookies), we have to make sure that the decryption succeeds in the new version (and vice versa in case of a rollback).</p>
</li>
<li><strong>Integrations:</strong> the following checks should be also done (depending on the setup):
<ul>
<li><strong>rake:</strong> make sure that the application loads and the execution completes successfully for the most important tasks. In addition, if we are using libraries like <a href="https://github.com/javan/whenever"><code class="language-plaintext highlighter-rouge">whenever</code></a> for <strong>cron</strong> tasks, we should check that the generation of the crontab list succeeds and the result is identical to the previous version’s one.</li>
<li><strong>Background jobs:</strong> in our setup, we use <a href="https://github.com/resque/resque">resque</a> and <a href="https://kafka.apache.org/">kafka</a> for background processing - we queued jobs to both and made sure that their execution completed with the desired results.</li>
<li><strong>Benchmarking:</strong> at this point, we have to monitor the performance of the application. We used <a href="https://github.com/tmm1/stackprof">StackProf</a> along with flamegraphs and Ruby’s <a href="https://ruby-doc.org/stdlib-2.5.5/libdoc/benchmark/rdoc/Benchmark.html">Benchmark</a> module and compared the performance (memory usage and timings) of our most critical flows.</li>
<li><strong>Elasticsearch:</strong> index documents to ensure that new version changes to ActiveRecord models haven’t affected the generated JSON to be indexed on the server</li>
</ul>
</li>
<li><strong>Traffic replay</strong>: we replay a large sample of production requests against the old <em>and</em> the new implementation and we verify that the results are identical</li>
</ul>
<h4 id="application-testing">Application testing</h4>
<p>We deployed the <code class="language-plaintext highlighter-rouge">rails-upgrade-main</code> branch to our staging environment and we requested from all product groups to perform manual tests at least for the most important flows of their domain. In our case, this step led to a couple of important bug fixes that would otherwise reach production.</p>
<h3 id="spread-the-news">Spread the news</h3>
<p>We checked everything. We’re ready to move on.</p>
<p>Given that the deployment of the upgrade will require some time and that our engineering team counts more than 160 members, it is important to <strong>inform everyone about the release date a few days beforehand</strong>:</p>
<ul>
<li>
<p><strong>Product engineers</strong>: Teams with a tight schedule to release an important feature should not be informed about the upgrade at the last minute. We have to make sure that we will not block any important operations and we might even end up postponing the release for a few days if another milestone has a higher priority.</p>
</li>
<li>
<p><strong>Platform engineers</strong>: Our platform team that is responsible for the infrastructure and the site reliability, also has to be informed soon enough to reserve the appropriate time to help us with the upgrade and its monitoring afterwards.</p>
</li>
<li>
<p><strong>Contact persons</strong>: The Core team is the one to deploy the upgrade, though the Contact persons have to be available throughout the process to help, investigate and hotfix if something related to their domain comes up.</p>
</li>
</ul>
<h3 id="deployment">Deployment</h3>
<p>We found the date. What are we going to do actually on that day?</p>
<h4 id="canary-release">Canary Release</h4>
<p>Our setup consists of many servers grouped by their purpose:</p>
<ul>
<li>application servers: serving the application to our end users</li>
<li>workers: executing the background jobs</li>
<li>internal tools: serving parts of the monolith to internal users (ex. content editing, reporting…)</li>
<li>etc</li>
</ul>
<p>Instead of deploying the Rails upgrade to all of the servers at once, we follow the <a href="https://martinfowler.com/bliki/CanaryRelease.html">Canary Release</a> technique.
In a nutshell, with this approach the changes are deployed to a subset of the servers and in an order that will reduce the impact in case of failure.</p>
<p>In our case, it was obvious that we should start with the servers dedicated to our internal tools. This would help us get immediate feedback from our internal users and our monitoring system and also avoid causing unnecessary frustration to our end users. So we deployed the upgrade to one of the group’s servers, everything went well and we moved on deploying to the rest of them.</p>
<p>Even though the workers group seemed a good next candidate, we decided to deploy on it last because in case of failure, on top of resolving the error, we would have a big amount of operational work to do related to the failed jobs.</p>
<h4 id="create-a-detailed-plan">Create a detailed plan</h4>
<p>As we described above, the deployment is a multistep process.
It is extremely helpful to have a document with all the steps that we will need to follow on the release date.</p>
<p>We created a task in our tracking system in the milestone’s board in which we documented each specific deploy action along with notes, commands and resources (ex. monitoring links).</p>
<p>Here’s a sample:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1. Lock deploy
2. Merge `rails-upgrade-main` to `main`
3. Ensure successful build in Jenkins
4. Deploy to internals-1
Command: $ TARGET=internals-1 bundle exec deploy
Monitoring: https://monitoring/events?host=internals-1
5. Deploy to all internals
Command: $ TARGET=internals bundle exec deploy
Monitoring: https://monitoring/events?host=internals
...
10. Clear Rails.cache
Comand: $ Rails.cache.clear
...
</code></pre></div></div>
<p>Upon the deployment, we might end up executing different commands, add new steps etc. Updating the task with these changes will be valuable since we will revisit this task to create the next upgrade’s release plan.</p>
<h4 id="monitoring">Monitoring</h4>
<p>We use Sentry for reporting exceptions in production and Grafana with a great amount of dashboards with configured alerts on most of them. Both tools send notifications to one or more Slack channels.</p>
<p>During the release of course, we were not waiting for notifications to appear in Slack - we had the critical dashboards opened in our browsers and we were checking their state constantly till the moment we felt confident that everything was fine.</p>
<p>After cross-checking with the Platform team that things look good on their side as well we considered the release successful!</p>
<p>No, not yet but close. We might have scheduled tasks that execute during the night or on specific days on a weekly basis so we have to remember to check the monitoring tools occasionally for errors that might be triggered by them until all of them complete successfully once after the upgrade.</p>
<h2 id="next-steps">Next steps</h2>
<p>Rails 7 is around the corner and there are a couple of things we can do to be better prepared for the next upgrade.</p>
<p>– <strong>Gem updates</strong>: we can schedule more frequent updates of our gems (especially those required by or depending on Rails) saving time from the next upgrade milestone</p>
<p>– <strong>Deprecations</strong>: new deprecations appeared in the current version and we can already start working on them moving our codebase to a more compatible state for the next version</p>
<p>– <strong>Release information</strong>: we need to keep our eyes open for any major changes, new features etc. about the new release</p>
<p>Now we’re done :)</p>
<hr />
<p>If you like providing a top-notch development environment or you get intrigued by working with Ruby & Rails, make sure to check our <a href="https://www.skroutz.gr/careers/162">Core team’s open position</a>, or our other <a href="https://www.skroutz.gr/careers#Engineering">job openings</a>.</p>
<p>Thank you for reading!</p>
<hr />
<p>PS: We forgot to acknowledge the upgrade’s coordinator.</p>
<figure>
<a href="../../../images/2021-upgrading-rails/engineering-cat.jpg" class="image-popup">
<img src="../../../images/2021-upgrading-rails/engineering-cat.jpg" alt="Engineering cat" />
</a>
</figure>
<p><a href="https://engineering.skroutz.gr/blog/monolith-diaries-upgrading-rails/">Monolith Diaries: Upgrading Rails</a> was originally published by Lazarus Lazaridis at <a href="https://engineering.skroutz.gr">Skroutz Engineering</a> on October 22, 2021.</p>https://engineering.skroutz.gr/blog/using_hotwire_to_lazy_load_data2021-07-11T22:00:00+00:002021-07-11T22:00:00+00:00John Kapantzakishttps://engineering.skroutz.gr<p>At Skroutz we constantly try to find ways to make our website faster and consequently optimize our users’ experience. In this context, Hotwire couldn’t escape our attention as it aroused the interest of the developers community from the moment it was <a href="https://twitter.com/dhh/status/1341420143239450624?lang=en">announced</a> by its creators.</p>
<h1 id="what-is-hotwire">What is Hotwire?</h1>
<p>Hotwire, as described in the <a href="https://hotwire.dev/">official website</a>, is</p>
<blockquote>
<p>an alternative approach to building modern web applications without using much JavaScript by sending HTML instead of JSON over the wire</p>
</blockquote>
<p>In other words, Hotwire creates HTML markup, instead of JSON objects, and sends it as response to a request from the client. This way, we avoid the manipulation of the response data with Javascript.</p>
<p>Furthermore, Hotwire can find the way to automatically inject the received HTML into the right place of the DOM, with <a href="https://turbo.hotwire.dev/">Turbo</a>, a set of techniques that eliminate the need of writing custom Javascript in order to handle form submissions, partial DOM updates, history changes and many more.</p>
<p>As stated in their documentation, Turbo is able to handle at least 80% of the cases by itself on the client side, without the need for any Javascript to be written by you. For the remaining 20% of the cases, Hotwire provides <a href="https://stimulus.hotwire.dev/">Stimulus</a>, a lightweight Javascript framework that works well with Turbo. Stimulus can be used to create reusable components that can be bound to any HTML element and enhance it with custom behaviour.</p>
<h1 id="the-order-show-page">The order show page</h1>
<p>Let’s get started by setting the context of our example. At Skroutz we have developed a portal, known as <a href="https://merchants.skroutz.gr/merchants">Skroutz Merchants</a>, that provides useful tools to our partners in order to facilitate the operation of their store. In one of those views we show the order’s details alongside a list of tickets that may exist and are related to this order.</p>
<p>In order to reduce the initial rendering time, we choose to load the tickets list asynchronously, as soon as the initial render has finished.</p>
<p>The following image illustrates a simplified wireframe of the order show page. The parts that are loaded on the initial render, such as the sidebar, the top bar and the order itself, are colored with green. The tickets list section is colored in orange, indicating that it gets loaded asynchronously, after the initial render.</p>
<figure>
<a href="../../../images/hotwire_lazy_load_tickets/order_show.png" class="image-popup">
<img src="../../../images/hotwire_lazy_load_tickets/order_show.png" alt="image" />
</a>
<figcaption>
<a href="../../images/hotwire_lazy_load_tickets/order_show.png">
Image 1: Merchants panel: Order show
</a>
</figcaption>
</figure>
<h1 id="lazy-load-with-vanilla-javascript">Lazy load with vanilla Javascript</h1>
<p>The process is simple: as soon as the page loads, a javascript function makes a request to <code class="language-plaintext highlighter-rouge">/merchants/orders/:code/tickets</code> path in order to fetch the tickets, if any.</p>
<p>As shown in the following block, <code class="language-plaintext highlighter-rouge">order_tickets</code> queries the database, checks to see if there are any tickets and creates the HTML from the respective partial template.</p>
<figure class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="c1"># orders_controller.rb</span>
<span class="c1"># GET /merchants/orders/:code/tickets</span>
<span class="k">def</span> <span class="nf">order_tickets</span>
<span class="n">tickets</span> <span class="o">=</span> <span class="c1"># db query</span>
<span class="n">options</span> <span class="o">=</span> <span class="p">{</span> <span class="ss">layout: </span><span class="kp">false</span><span class="p">,</span> <span class="ss">formats: :html</span> <span class="p">}</span>
<span class="n">view</span> <span class="o">=</span> <span class="k">if</span> <span class="n">tickets</span><span class="p">.</span><span class="nf">present?</span>
<span class="n">options</span><span class="p">.</span><span class="nf">merge!</span><span class="p">(</span><span class="ss">partial: </span><span class="s1">'merchants/tickets/ticket'</span><span class="p">,</span>
<span class="ss">collection: </span><span class="n">tickets</span><span class="p">,</span>
<span class="ss">as: :ticket</span><span class="p">)</span>
<span class="k">else</span>
<span class="n">options</span><span class="p">.</span><span class="nf">merge!</span><span class="p">(</span><span class="ss">partial: </span><span class="s1">'merchants/tickets/no_tickets_message'</span><span class="p">)</span>
<span class="k">end</span>
<span class="n">respond_to</span> <span class="k">do</span> <span class="o">|</span><span class="nb">format</span><span class="o">|</span>
<span class="nb">format</span><span class="p">.</span><span class="nf">json</span> <span class="k">do</span>
<span class="n">render</span> <span class="ss">json: </span><span class="p">{</span> <span class="ss">html: </span><span class="n">render_to_string</span><span class="p">(</span><span class="n">view</span><span class="p">).</span><span class="nf">squish</span> <span class="p">},</span> <span class="ss">status: :ok</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="k">end</span></code></pre></figure>
<p>Now, let’s see the frontend part. <code class="language-plaintext highlighter-rouge">OrderTicketsView</code> is the class that is responsible for fetching the tickets data and injects the received markup into the DOM. More specifically, <code class="language-plaintext highlighter-rouge">_getOrderTicketsData</code> performs the asynchronous request, finds the <code class="language-plaintext highlighter-rouge">#js-tickets-wrapper</code> element and replaces it with the received markup.</p>
<figure class="highlight"><pre><code class="language-erb" data-lang="erb"><span class="c"><%# show.html.erb %></span>
...
<span class="nt"><div</span> <span class="na">id=</span><span class="s">"js-tickets-wrapper"</span> <span class="na">data-order-code=</span><span class="s">"</span><span class="cp"><%=</span> <span class="n">order</span><span class="p">.</span><span class="nf">code</span> <span class="cp">%></span><span class="s">"</span><span class="nt">></span>
<span class="nt"><div</span> <span class="na">class=</span><span class="s">"loading-tickets flex-row"</span><span class="nt">></span>
<span class="cp"><%=</span> <span class="n">render</span> <span class="s1">'merchants/shared/spinner'</span> <span class="cp">%></span>
<span class="nt"></div></span>
<span class="nt"></div></span>
...</code></pre></figure>
<figure class="highlight"><pre><code class="language-javascript" data-lang="javascript"><span class="c1">// order_tickets_view.js</span>
<span class="k">export</span> <span class="k">default</span> <span class="kd">class</span> <span class="nc">OrderTicketsView</span> <span class="p">{</span>
<span class="nf">constructor</span><span class="p">()</span> <span class="p">{</span>
<span class="k">this</span><span class="p">.</span><span class="nf">_cacheElements</span><span class="p">();</span>
<span class="k">this</span><span class="p">.</span><span class="nf">_getOrderTicketsData</span><span class="p">();</span>
<span class="p">}</span>
<span class="nf">_cacheElements</span><span class="p">()</span> <span class="p">{</span>
<span class="k">this</span><span class="p">.</span><span class="nx">_ticketsWrapper</span> <span class="o">=</span> <span class="nb">document</span><span class="p">.</span><span class="nf">getElementById</span><span class="p">(</span><span class="dl">'</span><span class="s1">js-tickets-wrapper</span><span class="dl">'</span><span class="p">);</span>
<span class="k">if </span><span class="p">(</span><span class="k">this</span><span class="p">.</span><span class="nx">_ticketsWrapper</span><span class="p">)</span> <span class="p">{</span>
<span class="k">this</span><span class="p">.</span><span class="nx">_orderCode</span> <span class="o">=</span> <span class="k">this</span><span class="p">.</span><span class="nx">_ticketsWrapper</span><span class="p">.</span><span class="nx">dataset</span><span class="p">.</span><span class="nx">orderCode</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="nf">_getOrderTicketsData</span><span class="p">()</span> <span class="p">{</span>
<span class="k">if </span><span class="p">(</span><span class="k">this</span><span class="p">.</span><span class="nx">_orderCode</span><span class="p">)</span> <span class="p">{</span>
<span class="kd">const</span> <span class="nx">orderTicketsUrl</span> <span class="o">=</span> <span class="s2">`</span><span class="p">${</span><span class="k">this</span><span class="p">.</span><span class="nx">_orderCode</span><span class="p">}</span><span class="s2">/tickets`</span><span class="p">;</span>
<span class="nx">axios</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="nx">orderTicketsUrl</span><span class="p">)</span>
<span class="p">.</span><span class="nf">then</span><span class="p">(({</span> <span class="nx">data</span> <span class="p">})</span> <span class="o">=></span> <span class="p">{</span>
<span class="k">this</span><span class="p">.</span><span class="nf">_appendTicketsGrid</span><span class="p">(</span><span class="nx">data</span><span class="p">.</span><span class="nx">html</span><span class="p">);</span>
<span class="p">})</span>
<span class="p">.</span><span class="k">catch</span><span class="p">(()</span> <span class="o">=></span> <span class="p">{</span>
<span class="k">this</span><span class="p">.</span><span class="nf">_showErrorMessage</span><span class="p">();</span>
<span class="p">});</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="nf">_appendTicketsGrid</span><span class="p">(</span><span class="nx">tickets</span><span class="p">)</span> <span class="p">{</span>
<span class="k">this</span><span class="p">.</span><span class="nx">_ticketsWrapper</span><span class="p">.</span><span class="nx">parentElement</span><span class="p">.</span><span class="nx">innerHTML</span> <span class="o">=</span> <span class="nx">tickets</span><span class="p">;</span>
<span class="p">}</span>
<span class="nf">_showErrorMessage</span><span class="p">()</span> <span class="p">{</span>
<span class="k">this</span><span class="p">.</span><span class="nx">_ticketsWrapper</span><span class="p">.</span><span class="nx">innerHTML</span> <span class="o">=</span> <span class="s2">`<div class="box-alert error"></span><span class="p">${</span><span class="nf">__</span><span class="p">(</span>
<span class="dl">'</span><span class="s1">Failed loading tickets</span><span class="dl">'</span>
<span class="p">)}</span><span class="s2"></div>`</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span></code></pre></figure>
<p>As we can see from <code class="language-plaintext highlighter-rouge">order_tickets_view.js</code> file, we have to write a fair amount of custom javascript code to achieve a lazy loading behaviour. Wouldn’t it be nice if we had a way to apply this lazy loading feature without the boilerplate javascript code?</p>
<h1 id="introducing-turbo-frames">Introducing Turbo Frames</h1>
<p>Fortunately, Turbo provides <a href="https://turbo.hotwire.dev/handbook/frames">Turbo Frames</a>, a set of techniques that help us decompose a page into independent parts that get updated individually.</p>
<p>Turbo frame is nothing more than a custom HTML element with the <code class="language-plaintext highlighter-rouge"><turbo-frame></code> tag. Every turbo frame element must have a unique id that is used by Turbo in order to update its contents. Anything that is wrapped within a <code class="language-plaintext highlighter-rouge"><turbo-frame></code> tag, belongs to a separate context that gets updated independently of the rest of the page.</p>
<p><a href="https://turbo.hotwire.dev/handbook/frames#lazily-loading-frames">Lazily loading frames</a> is a special case of turbo frames that fits perfectly to our case. In order to create a lazily loading frame we just have to provide a <code class="language-plaintext highlighter-rouge">src</code> attribute to the <code class="language-plaintext highlighter-rouge"><turbo-frame></code> element with a url as the value. As soon as the <code class="language-plaintext highlighter-rouge"><turbo-frame></code> element gets rendered, Turbo will make a request to the provided url and try to update the frame’s contents with the received HTML (As we said earlier, Hotwire responds with HTML instead of JSON). This update happens automatically by Turbo and we don’t have to write any custom javascript to handle the response.</p>
<h1 id="applying-lazily-loading-frames">Applying lazily loading frames</h1>
<p>Introducing turbo frames to an existing codebase is quite simple. Just wrap the desired part of the page with a <code class="language-plaintext highlighter-rouge"><turbo-frame></code> tag and you have created a frame.</p>
<p>In this way, in <code class="language-plaintext highlighter-rouge">show.html.erb</code> view, we replace the <code class="language-plaintext highlighter-rouge">#js-tickets-wrapper</code> div with a <code class="language-plaintext highlighter-rouge"><turbo-frame></code> tag. The new turbo frame element must have a unique id, so we assign the <code class="language-plaintext highlighter-rouge">order_tickets</code> id, alongside with a url as value of the <code class="language-plaintext highlighter-rouge">src</code> attribute. Finally, we add the <code class="language-plaintext highlighter-rouge">loading: 'lazy'</code> attribute so that the request to the provided url happens only when the turbo frame element becomes visible in the viewport. More details about the available HTML attributes can be found <a href="https://turbo.hotwire.dev/reference/frames#html-attributes">here</a>.</p>
<figure class="highlight"><pre><code class="language-erb" data-lang="erb"><span class="c"><%# show.html.erb %></span>
...
<span class="c"><%# <div id="js-tickets-wrapper" data-order-code="<%= order.code %></span>"> %>
<span class="cp"><%=</span> <span class="n">turbo_frame_tag</span> <span class="ss">:order_tickets</span><span class="p">,</span>
<span class="ss">src: </span><span class="n">tickets_merchants_order_path</span><span class="p">(</span><span class="ss">code: </span><span class="vi">@order</span><span class="p">.</span><span class="nf">code</span><span class="p">),</span>
<span class="ss">loading: </span><span class="s1">'lazy'</span> <span class="k">do</span> <span class="cp">%></span>
<span class="nt"><div</span> <span class="na">class=</span><span class="s">"loading-tickets flex-row"</span><span class="nt">></span>
<span class="cp"><%=</span> <span class="n">render</span> <span class="s1">'merchants/shared/spinner'</span> <span class="cp">%></span>
<span class="nt"></div></span>
<span class="cp"><%</span> <span class="k">end</span> <span class="cp">%></span>
<span class="c"><%# </div> %></span></code></pre></figure>
<p>Then, we have to adjust the response of the action that gets called when the turbo frame element requests the provided url. Turbo frame waits for a response that contains HTML markup, so we alter the contents of the <code class="language-plaintext highlighter-rouge">respond_to</code> block in order to return the respective partial view. Furthermore, we no longer need the <code class="language-plaintext highlighter-rouge">options</code> and the <code class="language-plaintext highlighter-rouge">view</code> objects because we don’t build the HTML manually, as we did before with <code class="language-plaintext highlighter-rouge">render_to_string</code>.</p>
<figure class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="c1"># orders_controller.rb</span>
<span class="c1"># GET /merchants/orders/:code/tickets</span>
<span class="k">def</span> <span class="nf">order_tickets</span>
<span class="n">tickets</span> <span class="o">=</span> <span class="c1"># db query</span>
<span class="c1"># options = { layout: false, formats: :html }</span>
<span class="c1"># view = if tickets.present?</span>
<span class="c1"># options.merge!(partial: 'merchants/tickets/ticket',</span>
<span class="c1"># collection: tickets,</span>
<span class="c1"># as: :ticket)</span>
<span class="c1"># else</span>
<span class="c1"># options.merge!(partial: 'merchants/tickets/no_tickets_message')</span>
<span class="c1"># end</span>
<span class="n">respond_to</span> <span class="k">do</span> <span class="o">|</span><span class="nb">format</span><span class="o">|</span>
<span class="c1"># format.json do</span>
<span class="c1"># render json: { html: render_to_string(view).squish }, status: :ok</span>
<span class="c1"># end</span>
<span class="nb">format</span><span class="p">.</span><span class="nf">html</span> <span class="k">do</span>
<span class="k">if</span> <span class="n">tickets</span><span class="p">.</span><span class="nf">present?</span>
<span class="n">render</span> <span class="ss">partial: </span><span class="s1">'merchants/tickets/tickets'</span><span class="p">,</span> <span class="ss">locals: </span><span class="p">{</span> <span class="ss">tickets: </span><span class="n">tickets</span> <span class="p">}</span>
<span class="k">else</span>
<span class="n">render</span> <span class="ss">partial: </span><span class="s1">'merchants/tickets/no_tickets_message'</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="k">end</span></code></pre></figure>
<p>There is something more that we need to do. We’ll have to adjust the <code class="language-plaintext highlighter-rouge">merchants/tickets/no_tickets_message</code> partial so that it responds with the expected markup. <code class="language-plaintext highlighter-rouge">merchants/tickets/tickets</code> has been created from the beginning in order to wrap the collection of tickets in a way that Turbo can handle.</p>
<p>Turbo frame has to find a way to match the content it receives from the request to the provided url, with the part of the page that it needs to update. As we said earlier, we gave the <code class="language-plaintext highlighter-rouge">order_tickets</code> id to the turbo frame element. Turbo will try to find a <code class="language-plaintext highlighter-rouge"><turbo-frame></code> tag with the same id inside the response body, and if it finds it, it takes its contents and replace the contents of the <code class="language-plaintext highlighter-rouge">#order_tickets</code> turbo frame element of the page with them.</p>
<p>So, nothing scary, just wrap the contents with a <code class="language-plaintext highlighter-rouge"><turbo-frame></code> tag with the appropriate id as shown in the following blocks.</p>
<figure class="highlight"><pre><code class="language-erb" data-lang="erb"><span class="c"><%# _tickets.html.erb %></span>
<span class="cp"><%=</span> <span class="n">turbo_frame_tag</span> <span class="ss">:order_tickets</span> <span class="k">do</span> <span class="cp">%></span>
<span class="cp"><%=</span> <span class="n">render</span> <span class="ss">partial: </span><span class="s1">'merchants/tickets/ticket'</span><span class="p">,</span>
<span class="ss">collection: </span><span class="n">tickets</span><span class="p">,</span>
<span class="ss">as: :ticket</span> <span class="cp">%></span>
<span class="cp"><%</span> <span class="k">end</span> <span class="cp">%></span></code></pre></figure>
<figure class="highlight"><pre><code class="language-erb" data-lang="erb"><span class="c"><%# _no_tickets_message.html.erb %></span>
<span class="c"><%# Previously %></span>
<span class="c"><%# <div class="box-alert warning"> %></span>
<span class="c"><%# <%= _('No tickets found') %></span>
<span class="c"><%# </div> %></span>
<span class="c"><%# Add turbo_frame_tag %></span>
<span class="cp"><%=</span> <span class="n">turbo_frame_tag</span> <span class="ss">:order_tickets</span> <span class="k">do</span> <span class="cp">%></span>
<span class="nt"><div</span> <span class="na">class=</span><span class="s">"box-alert warning"</span><span class="nt">></span>
<span class="cp"><%=</span> <span class="n">_</span><span class="p">(</span><span class="s1">'No tickets found'</span><span class="p">)</span> <span class="cp">%></span>
<span class="nt"></div></span>
<span class="cp"><%</span> <span class="k">end</span> <span class="cp">%></span></code></pre></figure>
<p>Oh, and don’t forget, we no longer need the custom javascript code from <code class="language-plaintext highlighter-rouge">order_tickets_view.js</code>, so, we can safely delete it!</p>
<p>And that’s it! In three simple steps we have introduced Turbo Frames to our codebase in order to achieve the same lazily loading behaviour, without the use of custom javascript.</p>
<h1 id="summary">Summary</h1>
<p>In this post, we tried to demonstrate the ease with which we can use Turbo Frames. We have completed the refactoring in three simple steps:</p>
<ul>
<li>Wrap the desired part of the page with a <code class="language-plaintext highlighter-rouge"><turbo-frame></code> tag and give it a unique id and a url as the value of the <code class="language-plaintext highlighter-rouge">src</code> attribute</li>
<li>Refactor the controller’s response as needed</li>
<li>Add the <code class="language-plaintext highlighter-rouge"><turbo-frame></code> tag with the appropriate id to the partials that get rendered from the controller</li>
</ul>
<p>Except from the simplicity of this refactoring, we have managed a small shrinkage of our codebase (as shown in the following image from Github), as a result of the removal of unwanted custom javascript code that was used to handle these updates that now, automatically, get handled by Turbo.</p>
<figure>
<a href="../../../images/hotwire_lazy_load_tickets/github_lines.png" class="image-popup">
<img src="../../../images/hotwire_lazy_load_tickets/github_lines.png" alt="image" />
</a>
<figcaption>
<a href="../../images/hotwire_lazy_load_tickets/github_lines.png">
Image 2: Github: Lines removed and added
</a>
</figcaption>
</figure>
<h1 id="next-steps">Next steps</h1>
<p>Turbo comes with many more techniques, apart from Turbo Frames. <a href="https://turbo.hotwire.dev/reference/streams">Turbo Streams</a> is another powerful feature that can improve the dynamic nature of any app. We can use streams to broadcast changes to our models, from the server to the client. And this is done with a <a href="https://developer.mozilla.org/en-US/docs/Web/API/WebSocket">WebSocket</a> connection that Turbo, automatically establishes and handles for us.</p>
<p>In our case, we can take advantage of the power of Turbo Streams and push any updates of a specific order’s tickets to the client, so, users will be able to see a live update (insertion of a new ticket, deletion or edit) on their screen, without having to constantly refresh the page to fetch the latest state.</p>
<p><a href="https://engineering.skroutz.gr/blog/using_hotwire_to_lazy_load_data/">Hotwire @ Skroutz: Lazy load data with minimum effort</a> was originally published by John Kapantzakis at <a href="https://engineering.skroutz.gr">Skroutz Engineering</a> on July 11, 2021.</p>https://engineering.skroutz.gr/blog/seo-in-skroutz-our-top-5-principles-and-values2021-06-17T21:00:00+00:002021-06-17T21:00:00+00:00Vasilis Giannakourishttps://engineering.skroutz.gr<blockquote>
<p><strong>Table of Contents</strong></p>
<p><a href="#serve-the-human-not-the-machine">Serve the human, not the machine</a> <br />
› <a href="#great-user-experience-should-be-your-top-priority">Great User Experience should be your top priority</a> <br />
› <a href="#you-cant-fool-google-in-the-long-run">You can’t fool Google in the long run</a> <br /></p>
<p><a href="#user-intent-is-your-guiding-principle-for-great-content">User intent is your guiding principle for great content</a> <br />
› <a href="#know-your-audience">Know your audience</a> <br />
› <a href="#have-a-quality-page-for-every-important-to-your-company-query">Have a quality page for every important (to your company) query</a> <br /></p>
<p><a href="#an-excellent-site-performance--usability-should-be-a-company-objective-not-just-a-task">An excellent site performance & usability should be a company objective, not just a task</a> <br /></p>
<p><a href="#understand-how-google-sees-your-property-must-be-a-top-priority">Understand how Google sees your property must be top priority</a> <br />
› <a href="#google-doesnt-have-to-know-everything-you-can-help">Google doesn’t have to know everything; you can help!</a> <br />
› <a href="#are-you-confident-that-googlebot-can-always-parse-all-your-content">Are you confident that GoogleBot can always parse all your content?</a> <br /></p>
<p><a href="#seo-is-a-team-sport">SEO is a team sport</a> <br />
› <a href="#seo-should-be-in-the-dna-of-the-company-not-just-an-extra-task">SEO should be in the DNA of the company, not just an extra task</a> <br />
› <a href="#seo-unveils-helpful-actionable-data-and-creates-tools-that-help-the-other-teams-objectives">SEO unveils helpful, actionable data and creates tools that help the other teams objectives</a> <br /></p>
<p><a href="#final-words">Final Words</a> <br /></p>
</blockquote>
<p>With almost 10.000 stores and more than 10 million products on its platform, <a href="https://www.skroutz.gr/" target="_blank">Skroutz.gr</a> is currently the <a href="https://www.similarweb.com/top-websites/greece/" target="_blank">fourth most visited site</a> (after Google, Facebook, and YouTube) and the leading Marketplace in Greece. <a href="https://www.skroutz.gr/" target="_blank">Skroutz.gr</a> has on average 35M visits per month, with the vast majority of the traffic coming from Organic and Direct channels; we have never used paid ads (Adwords, etc.) for driving traffic to categories and products.</p>
<p>From the early days of Skroutz, back in 2005, we focused on <strong>quality content and experience</strong> in order to drive organic traffic. Although we didn’t always have a dedicated SEO team, the SEO mentality was present throughout the company. This mentality is what gave us an extremely good performance in the Greek SERPs and a steady year-over-year organic growth.</p>
<p>In this article, we share the <strong>most important values and principles</strong> that we have followed all these years. Although <a href="https://www.skroutz.gr/" target="_blank">Skroutz.gr</a> is a marketplace and many of the principles focus on the e-commerce’s SEO aspects, we believe that any fellow of the SEO community could find some useful information that can be applied to their websites.</p>
<h1 id="serve-the-human-not-the-machine">Serve the human, not the machine</h1>
<h3 id="great-user-experience-should-be-your-top-priority">Great User Experience should be your top priority.</h3>
<p>This is basically derived from our <a href="https://www.skroutz.gr/careers#journey" target="_blank">core values</a>, as a company. We <strong>focus on the user journey</strong> and seek to give the customer the best experience in every step.</p>
<p>Although we always strive to get our content and structure accessible and optimized for search engines, we never build things <strong>exclusively for SEO reasons</strong>. So, be it a new feature or a page redesign, our primary focus is always an <strong>excellent user experience</strong>.</p>
<p>After all, according to Google’s <a href="https://web.dev/vitals/" target="_blank">Web Vitals</a>: <em>“Optimizing for quality of user experience is key to the long-term success of any site on the web.”</em></p>
<h3 id="you-cant-fool-google-in-the-long-run">You can’t fool Google in the long run.</h3>
<p>We are not going to talk about this extensively, but there are many grey and black hat SEO techniques that can lead to some good short-term results and aren’t endorsed, of course, by the <a href="https://developers.google.com/search/docs/advanced/guidelines/overview#quality" target="_blank">Google Quality Guidelines</a>.</p>
<p>Well, we think that if you want to build a site on solid “SEO” ground, get all the tremendous benefits of organic traffic in the long term, and not lose your sleep over every Google Core Update, you should stay away from any shady techniques. Google, despite its flaws, has evolved so much these years, and you will have many chances to get caught with a penalty.</p>
<p>After all, who wants to spend a lot of time on something that won’t payout in the future when they could work on things that <strong>create value</strong> for their visitors.</p>
<p>User experience, content, and more technical stuff like performance, crawlability & indexability, and website architecture are some of the things you might invest your time in!</p>
<h1 id="user-intent-is-your-guiding-principle-for-great-content">User intent is your guiding principle for great content</h1>
<h3 id="know-your-audience">Know your audience</h3>
<p>If you can deeply understand your audience, you have made the first step to structure your pages to serve the user’s intent; that’s something that Google rewards in the long term.</p>
<p>By “deeply understand”, we don’t mean only <strong>how</strong> they search (Search Intent) but also:</p>
<ul>
<li><strong>What type</strong> of information is likely to help them most.</li>
<li><strong>How</strong> should you serve that content to help the user.</li>
<li><strong>Which</strong> piece of content can remove any doubts from the user to continue their journey.</li>
</ul>
<p>At Skroutz, we use many techniques to learn our users. We start with the <strong>Search Intent</strong> (how the users search on Google), and then we try to unveil valuable insights about their behavior <strong>after landing</strong> on our site.</p>
<blockquote>
<p><strong>Skroutz Info:</strong> Except for the quantitative research, in order to deeply understand what information we need on our <strong>Product</strong> or <strong>Category</strong> pages, our User Research Team runs comprehensive qualitative & UX research.<br /><br />
For example, they use Live Chats, User Surveys and Live Usability Tests with scenarios like “I want to get a Refrigerator for my family”. They gather all the pain points of the User Journey and use them to enhance our products.</p>
</blockquote>
<h3 id="have-a-quality-page-for-every-important-to-your-company-query">Have a quality page for every important (to your company) query</h3>
<p>This is one of the <strong>fundamental principles</strong> of SEO, yet many sites underestimate or fear duplicate content or <a href="https://engineering.skroutz.gr/blog/SEO-Crawl-Budget-Optimization-2019/" target="_blank">crawl budget issues</a>. Especially for E-commerce Sites, there are a lot of “traditional” rules that many (or their CMS) blindly follow:</p>
<p><em>“You should always no-follow & no-index category facets (filters).”</em></p>
<p><em>“Product variations are duplicate pages and should always be blocked from Google.”</em></p>
<p><em>“Out-of-Stock products have no value and should be removed from the Google Index as soon as possible.”</em></p>
<p>At Skroutz, we think that everything should be decided based on the user <strong>search intent</strong> and the company’s <strong>objectives</strong>. If one page has value for the company and can drive high-quality organic traffic, there is <strong>no reason why this page shouldn’t be indexed</strong>.</p>
<p>If we want to be more specific:</p>
<ul>
<li>Many facet combinations (Category Filters) can rank for many short and long-tail searches. If something has value for the visitors, index it; if not, save your crawl budget for another quality page.</li>
</ul>
<blockquote>
<p><strong>Skroutz Info:</strong> We use a sophisticated & automated way of indexing filter combinations (Faceted Navigation) and follow their links, mainly based on traffic and internal searches.</p>
</blockquote>
<ul>
<li>In some cases, product variations may have a substantial difference regarding search intent. For example, some color variations in fashion products have a decent volume for many different colors. This means that the user wants to see a specific variation of one product. Hence, a dedicated page for each color might be more relevant, helpful (i.e. recommend a suited color-complementary product) and drive more traffic cumulatively than a page which contains every variation on the same page.</li>
</ul>
<p><img src="https://engineering.skroutz.gr/images/seo-principles-and-values-2021/keywords-skroutz-seo.png" alt="" /></p>
<ul>
<li>A large number of out-of-stock products can drive a lot of traffic, even if they have been discontinued for months. In some cases (e.g., a newer model came out), it’s beneficial to test if you could add value to a visitor by promoting more recent/ related products in the out-of-stock product pages.</li>
</ul>
<blockquote>
<p><strong>Skroutz Info:</strong> We keep our out-of-stock products until there is no genuine search interest for them. Some of them drive quality traffic to relevant (linked) products for many months after the day of being out-of-stock.</p>
</blockquote>
<h1 id="an-excellent-site-performance--usability-should-be-a-company-objective-not-just-a-task">An excellent site performance & usability should be a company objective, not just a task</h1>
<p>There has been a lot of chatter in the SEO community lately about the <a href="https://web.dev/vitals/" target="_blank">Web Vitals</a> and Google’s <a href="https://developers.google.com/search/blog/2021/04/more-details-page-experience" target="_blank">page experience update</a> that is taking place. Some are rushing now to fix those metrics to increase or preserve their organic performance after the update.</p>
<p>At Skroutz, we believe that delivering a great user experience on the web is <strong>heavily impacted</strong> by <strong>site performance and usability</strong>. That’s why <a href="https://engineering.skroutz.gr/blog/speed-the-journey-to-delivering-a-faster-experience-at-skroutz-gr/" target="_blank">speed was always a critical factor for Skroutz.gr</a>.</p>
<p>In order to preserve an excellent performance and usability:</p>
<ul>
<li>We are actively monitoring all the SEO-specific metrics like <a href="https://web.dev/lcp/" target="_blank">LCP</a>, <a href="https://web.dev/fid/" target="_blank">FID</a>, <a href="https://web.dev/cls/" target="_blank">CLS</a>.</li>
<li>We have set up a “speed mentality” for our Front-End engineers, especially for the latest and greatest things on rendering performance.</li>
<li>Our Systems Team is actively monitoring all requests, response volumes, and timings to ensure a stable and fast performance of our servers.</li>
</ul>
<h1 id="understand-how-google-sees-your-property-must-be-a-top-priority">Understand how Google sees your property must be a top priority</h1>
<h3 id="google-doesnt-have-to-know-everything-you-can-help">Google doesn’t have to know everything; you can help!</h3>
<p>Google has improved its crawling capabilities all these years, and, in most cases, GoogleBot can crawl a site efficiently, regardless of the technology used in the backend or the site’s size.</p>
<p>However, crawl efficiency is not always guaranteed for large sites (1 million+ unique pages) or sites with daily updated content. In those cases, <strong>prioritizing</strong> what to crawl is a vital aspect and should be considered in your SEO strategy.</p>
<p>How can you help Google?</p>
<ul>
<li><strong>Sitemaps</strong>: Help Google understand what YOU think should be prioritized.</li>
<li><strong>Content Pruning</strong>: Remove pages that are of little value to your audience and save crawl budget.</li>
<li><strong>Site Architecture</strong>: Help Google find & crawl your site easily, and understand the importance of every page</li>
<li><strong>Internal Linking</strong>: Help Google understand how each page is related to each other and boost crawl rate for your important pages.</li>
</ul>
<blockquote>
<p><strong>Skroutz Info:</strong> 2 years ago, we optimized our crawl budget by removing 72% of Skroutz indexed URLs. If you are curious about how we did it, you can read the <a href="https://engineering.skroutz.gr/blog/SEO-Crawl-Budget-Optimization-2019/" target="_blank">detailed case study</a>.</p>
</blockquote>
<h3 id="are-you-confident-that-googlebot-can-always-parse-all-your-content">Are you confident that GoogleBot can always parse all your content?</h3>
<p>The web is changing, so is SEO. The need for better website design and user experience, accelerated the usage of new technologies and frameworks, like ReactJS, VueJS, etc., that can change the content of one web page dynamically. This can create some problems for the SEO teams.</p>
<p>If your site makes heavy use of Javascript, you have to know also:</p>
<ul>
<li>If Google can crawl and parse your content.</li>
<li>If your most important information like meta robots, titles & descriptions are always served correctly to GoogleBot.</li>
<li>In client-side rendering, you should be aware of the time needed for all your content to be indexed, especially if there are frequent updates; in such cases, GoogleBot will crawl and index the HTML first and come back later to render the JavaScript when their resources become available.</li>
</ul>
<h1 id="seo-is-a-team-sport">SEO is a team sport</h1>
<h3 id="seo-should-be-in-the-dna-of-the-company-not-just-an-extra-task">SEO should be in the DNA of the company, not just an extra task</h3>
<p>SEO, especially in enterprise-level websites with millions of pages, shouldn’t be one team’s job, but it has to be embedded in the company’s DNA. Imagine how easier it would be for SEO teams if non-SEO teams had a clear knowledge of:</p>
<ul>
<li>what SEO is,</li>
<li>why does their job affect the SEO,</li>
<li>how they can help the SEO Team and vice versa,</li>
<li>when they should proactively get in touch with the SEO team.</li>
</ul>
<p>At Skroutz, we are trying to embrace SEO through the company as a mindset for every individual from Product & Design to Content and Engineers. We use training, workshops and meetings with individuals/ teams so that everyone is involved in this.</p>
<blockquote>
<p><strong>Skroutz Info:</strong> Content teams are actively involved in many “SEO” kind of tasks like Keyword Research for Category & Product Titles</p>
</blockquote>
<h3 id="seo-unveils-helpful-actionable-data-and-creates-tools-that-help-the-other-teams-objectives">SEO unveils helpful, actionable data and creates tools that help the other teams objectives</h3>
<p>Knowing how a user searches to find a specific piece of information in Google is an invaluable asset for site owners. In addition, this knowledge is something that the SEO team specializes in and can use to create value for many other teams and the company.</p>
<p>Some examples of different cases where the SEO team can really offer value are the following:</p>
<ul>
<li>Help the customer support teams by sharing information about the search behavior of the customer for any information they need from the site. For example, if many people search for “how to return a product in site X” or “cost of the X service”, the SEO team can propose some changes or a new section/ landing page, thus decreasing the number of phone calls/ emails.</li>
<li>Help Merchandising & Marketing Teams with prioritizing their promotional efforts (Site Banners, Social Media Posts, etc) especially for Seasonal products/ services, by providing them with weekly or monthly organic trends for some keywords or landing pages.</li>
</ul>
<blockquote>
<p><strong>Skroutz Info:</strong> We have created a Data Studio Dashboard with Organic Trends for Categories, Landing Pages, and Keyphrases using Search Console data. This Dashboard is used by many fellows of Merchandising and Marketing teams.</p>
</blockquote>
<p><img src="https://engineering.skroutz.gr/images/seo-principles-and-values-2021/google-trends-skroutz-seo.png" alt="" /></p>
<ul>
<li>Educate Content teams about SEO and create tools that help their everyday job, like creating new Product or Category pages. For example, Search Console can be used to create a tool (via API or Data Studio) where members of Content Teams can find popular keyphrases and use them in titles or main content.</li>
</ul>
<h1 id="final-words">Final Words</h1>
<p>Having good organic performance is a long, difficult journey, especially for large and complex websites. However, if you stay focused on providing the best user experience, you will be rewarded with great results in the long term.</p>
<p>We hope that you found this article useful as a source of inspiration for your SEO adventure!</p>
<p>What are the values and principles that you follow, regarding SEO? Let us know, in a comment below (we’ll reply to all questions).</p>
<p>On Behalf of <a href="https://www.skroutz.gr/careers#Growth" target="_blank">Growth Team</a>,<br />
Vasilis.</p>
<hr />
<style type="text/css">
.entry-content h3 {
line-height: 1.2;
}
.entry-content img {
margin: 20px 0;
}
.entry-content td {
background: #fafafa;
font-size: 12px;
}
.entry-content blockquote {
background: #f6f6f6;
padding: 20px 25px;
border: 0;
margin: 30px 0;
transition: none;
}
.entry-content blockquote {
font-style: normal;
}
.entry-content blockquote ~ blockquote p {
border: 0;
}
.entry-content blockquote p {
border-bottom: 1px dotted #ccc;
padding-bottom: 5px;
}
.entry-content blockquote > p > a {
color: #1d1db8;
}
.entry-content blockquote p,
.entry-content blockquote li {
font-size: .9rem;
}
.entry-content p:last-child {
margin-bottom: 0;
}
.entry-content blockquote {
font-style: normal;
}
.entry-content blockquote p,
.entry-content blockquote li {
font-size: .9rem;
}
@media screen and (min-width: 48em) {
.entry-content blockquote p,
.entry-content blockquote li {
font-size: 1rem;
}
}
.entry-content a,
.entry-content code {
white-space: normal;
word-break: break-word;
}
</style>
<p><a href="https://engineering.skroutz.gr/blog/seo-in-skroutz-our-top-5-principles-and-values/">SEO at Skroutz.gr: Our Top 5 Principles & Values</a> was originally published by Vasilis Giannakouris at <a href="https://engineering.skroutz.gr">Skroutz Engineering</a> on June 17, 2021.</p>https://engineering.skroutz.gr/blog/refactor-react-app-to-progressively-load-its-data2021-04-07T22:00:00+00:002021-04-07T22:00:00+00:00John Kapantzakishttps://engineering.skroutz.gr<p>Except from the main product, Skroutz provides various internal tools to its people. These tools are developed in-house and are highly customized for our specific needs. One of these tools provides its users with statistics related to orders. Let’s refer to this page as the <strong>statistics page</strong> from now on.</p>
<h1 id="the-problem">The problem</h1>
<p>The statistics page is rendered using React and is responsible for getting specific data from the backend and display them through various charts to the end user. The problem is that all required data come from a unique database query which is quite heavy and takes some time to finish (it depends on the requested period of time and number of shops) on the one hand and, on the other hand, the way React handles this waiting.</p>
<p>The following image shows what user sees while waiting for the page to finish loading.</p>
<figure>
<a href="../../../images/react_progressive_load/before.png" class="image-popup">
<img src="../../../images/react_progressive_load/before.png" alt="image" />
</a>
<figcaption>
<a href="../../images/react_progressive_load/before.png">
Image 1: Before the refactoring
</a>
</figcaption>
</figure>
<p>This is not the best user experience because we have to <strong>wait for all data to be available</strong> before React starts to render the children components that are going to display the desired data. Furthermore, there is too much <strong>blank space</strong> on the screen while page loads.</p>
<p>The following illustration depicts the way that the current implementation is organized. Each solid-lined rectangle represents a React component and each arrow represents data that flow from one component to another.</p>
<p>There is a wrapper component (the outer rectangle) that is responsible for fetching the data from the backend. Inside the wrapper component there is another component (the intermediate rectangle) that holds various children components (the colored rectangles) which are going to display the respective data.</p>
<p>When the data are available, the wrapper component updates its internal state and all children components get re-rendered because we pass the data as props to each one of them through the intermediate component.</p>
<figure>
<a href="../../../images/react_progressive_load/single_source_of_data.png" class="image-popup">
<img src="../../../images/react_progressive_load/single_source_of_data.png" alt="image" />
</a>
<figcaption>
<a href="../../images/react_progressive_load/single_source_of_data.png">
Image 2: Single source of data
</a>
</figcaption>
</figure>
<p>The components are colored this way on purpose. Different color means different data. We can see that some components request different data from each other, but others, like A and B, or C and E, request the same data, only they display them in a slightly different way.</p>
<h1 id="proposed-solution">Proposed solution</h1>
<p>As mentioned before, the main problem with the initial implementation is the fact that React waits for a heavy query to finish in order to return all the required data.</p>
<p>What if each component requests their own data independently from the backend? We could split the one heavy query to smaller ones that would get called from the respective components. We may end up with multiple network requests instead of one, but we can render each component as soon as it has its data available.</p>
<figure>
<a href="../../../images/react_progressive_load/independent_fetch.png" class="image-popup">
<img src="../../../images/react_progressive_load/independent_fetch.png" alt="image" />
</a>
<figcaption>
<a href="../../images/react_progressive_load/independent_fetch.png">
Image 3: Independent data fetch
</a>
</figcaption>
</figure>
<p>What do we want to achieve with these changes?</p>
<ul>
<li><strong>Better user experience</strong>, because the user will see various components in a loading state (instead of a full page loader) and, gradually, each one of them will render the respective chart, as soon as it gets its data. This is a valid benefit here because the nature of the specific page is to provide various, <strong>independent metricts</strong> that can be consumed individually by the user and provide a valuable insight. In other words, the user doesn’t have to view all the data that the page, eventually, will render in order to extract a conclusion.</li>
<li><strong>Avoiding single point of failure</strong>, by executing multiple requests, we avoid the case when a failed request prevents all the components to be rendered, leaving a black page with an error message. With multiple, independent components we can render the ones that have data, while in the ones where an error has occurred, we can render an error message with a retry button.</li>
</ul>
<h1 id="implementation">Implementation</h1>
<p>Now let’s move on with the implementation. We are not going to dive into much detail here and we are not going to provide the full code as this is not the goal of this article. The purpose of this article is to highlight the most interesting parts of the current implementation and briefly explain the changes we made to achieve the final result.</p>
<blockquote>
<p>Names of classes, methods and components may have changed. Many parts of code have been omitted for reasons of simplicity.</p>
</blockquote>
<h4 id="create-the-api">Create the API</h4>
<p>First of all we have to provide the API that the React components are going to use in order to fetch their data. Until now, we had an endpoint that, as soon as it gets called, it executes a heavy query to the database in order to return a hash containing all the required data.</p>
<figure class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="c1"># GET /path/to/stats_data</span>
<span class="k">def</span> <span class="nf">stats_data</span>
<span class="n">render</span> <span class="ss">json: </span><span class="no">DataClass</span><span class="p">.</span><span class="nf">new</span><span class="p">(</span>
<span class="ss">from: </span><span class="n">params</span><span class="p">[</span><span class="ss">:from</span><span class="p">],</span>
<span class="ss">to: </span><span class="n">params</span><span class="p">[</span><span class="ss">:to</span><span class="p">]</span>
<span class="p">).</span><span class="nf">stats</span>
<span class="k">end</span></code></pre></figure>
<p>After adding the appropriate methods to <code class="language-plaintext highlighter-rouge">DataClass</code> in order to return the respective portion of data, we make <code class="language-plaintext highlighter-rouge">stats_data</code> action to accept the <code class="language-plaintext highlighter-rouge">metric</code> param in order to be able to call the respective <code class="language-plaintext highlighter-rouge">DataClass</code> method.</p>
<figure class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="c1"># GET /path/to/stats_data</span>
<span class="k">def</span> <span class="nf">stats_data</span>
<span class="n">data_summary</span> <span class="o">=</span> <span class="no">DataClass</span><span class="p">.</span><span class="nf">new</span><span class="p">(</span>
<span class="ss">from: </span><span class="n">params</span><span class="p">[</span><span class="ss">:from</span><span class="p">],</span>
<span class="ss">to: </span><span class="n">params</span><span class="p">[</span><span class="ss">:to</span><span class="p">]</span>
<span class="p">)</span>
<span class="n">data</span> <span class="o">=</span> <span class="p">{}</span>
<span class="n">data</span><span class="p">[</span><span class="n">params</span><span class="p">[</span><span class="ss">:metric</span><span class="p">]]</span> <span class="o">=</span> <span class="n">data_summary</span><span class="p">.</span><span class="nf">public_send</span><span class="p">(</span><span class="n">params</span><span class="p">[</span><span class="ss">:metric</span><span class="p">])</span>
<span class="n">render</span> <span class="ss">json: </span><span class="n">data</span>
<span class="n">resque</span> <span class="o">=></span> <span class="n">e</span>
<span class="n">respond_error</span><span class="p">(</span><span class="n">e</span><span class="p">,</span> <span class="ss">:unprocessable_entity</span><span class="p">)</span>
<span class="k">end</span></code></pre></figure>
<p>Now, each component will be able to call <code class="language-plaintext highlighter-rouge">stats_data</code>, providing its own <code class="language-plaintext highlighter-rouge">metric</code> param to get the desired data.</p>
<h4 id="how-it-used-to-work-initially">How it used to work initially</h4>
<p>Let’s take a look at the initial state. There are two wrapper components, <code class="language-plaintext highlighter-rouge">Stats</code> and <code class="language-plaintext highlighter-rouge">StatsMetrics</code> as we saw in image 2. <code class="language-plaintext highlighter-rouge">Stats</code> component fetches the data and passes them to <code class="language-plaintext highlighter-rouge">StatsMetrics</code>, as we can see in the following snippet.</p>
<figure class="highlight"><pre><code class="language-jsx" data-lang="jsx"><span class="k">export</span> <span class="k">default</span> <span class="kd">function</span> <span class="nf">Stats</span><span class="p">({</span> <span class="nx">options</span> <span class="p">})</span> <span class="p">{</span>
<span class="kd">const</span> <span class="p">[</span><span class="nx">searchUri</span><span class="p">,</span> <span class="nx">setSearchUri</span><span class="p">]</span> <span class="o">=</span> <span class="nf">useState</span><span class="p">(</span><span class="kc">null</span><span class="p">);</span>
<span class="kd">const</span> <span class="p">[</span><span class="nx">statsData</span><span class="p">,</span> <span class="nx">setStatsData</span><span class="p">]</span> <span class="o">=</span> <span class="nf">useState</span><span class="p">(</span><span class="kc">null</span><span class="p">);</span>
<span class="p">...</span>
<span class="nf">useEffect</span><span class="p">(()</span> <span class="o">=></span> <span class="p">{</span>
<span class="k">if </span><span class="p">(</span><span class="o">!</span><span class="nx">shouldFetchStats</span><span class="p">)</span> <span class="k">return</span><span class="p">;</span>
<span class="nf">getStats</span><span class="p">(</span><span class="nx">searchUri</span><span class="p">)</span>
<span class="p">.</span><span class="nf">then</span><span class="p">((</span><span class="nx">data</span><span class="p">)</span> <span class="o">=></span> <span class="nf">setStatsData</span><span class="p">(</span><span class="nx">data</span><span class="p">))</span>
<span class="p">.</span><span class="k">catch</span><span class="p">((</span><span class="nx">error</span><span class="p">)</span> <span class="o">=></span> <span class="p">{</span> <span class="p">...</span> <span class="p">})</span>
<span class="p">},</span> <span class="p">[</span><span class="nx">shouldFetchStats</span><span class="p">,</span> <span class="nx">searchUri</span><span class="p">]);</span>
<span class="k">return </span><span class="p">(</span>
<span class="p"><></span>
...
<span class="p"><</span><span class="nt">div</span><span class="p">></span><span class="si">{</span><span class="nx">statsData</span> <span class="o">&&</span> <span class="p"><</span><span class="nc">StatsMetrics</span> <span class="na">data</span><span class="p">=</span><span class="si">{</span><span class="nx">statsData</span><span class="si">}</span> <span class="p">/></span><span class="si">}</span><span class="p"></</span><span class="nt">div</span><span class="p">></span>
<span class="p"></></span>
<span class="p">);</span>
<span class="p">}</span></code></pre></figure>
<p><code class="language-plaintext highlighter-rouge">StatsMetrics</code>, on its turn, gets the data from its parent and renders the children components, passing the respective data to each one of them. You can see a comment after each component that indicates the respective rectangle from image 2 (and 3).</p>
<p>As we explained earlier, some components require the same data as other components do, like component A, which requires <code class="language-plaintext highlighter-rouge">data.order.all</code>, just like component B does. The same goes for components C and E which require the <code class="language-plaintext highlighter-rouge">data.order.billed</code> part.</p>
<figure class="highlight"><pre><code class="language-jsx" data-lang="jsx"><span class="k">export</span> <span class="k">default</span> <span class="kd">function</span> <span class="nf">StatsMetrics</span><span class="p">({</span> <span class="nx">data</span> <span class="p">})</span> <span class="p">{</span>
<span class="k">return </span><span class="p">(</span>
<span class="p"><></span>
<span class="p"><</span><span class="nc">StatsOrderGroup</span> <span class="na">data</span><span class="p">=</span><span class="si">{</span><span class="nx">data</span><span class="p">.</span><span class="nx">orders</span><span class="p">.</span><span class="nx">all</span><span class="si">}</span> <span class="p">/></span> /* A */
<span class="p"><</span><span class="nc">StatsOrderCountLine</span> <span class="na">order</span><span class="p">=</span><span class="si">{</span><span class="nx">data</span><span class="p">.</span><span class="nx">orders</span><span class="p">.</span><span class="nx">all</span><span class="si">}</span> <span class="p">/></span> /* B */
<span class="p"><</span><span class="nc">StatsOrderGroup</span> <span class="na">data</span><span class="p">=</span><span class="si">{</span><span class="nx">data</span><span class="p">.</span><span class="nx">orders</span><span class="p">.</span><span class="nx">billed</span><span class="si">}</span> <span class="p">/></span> /* C */
<span class="p"><</span><span class="nc">StatsOrderGroup</span> <span class="na">data</span><span class="p">=</span><span class="si">{</span><span class="nx">data</span><span class="p">.</span><span class="nx">orders</span><span class="p">.</span><span class="nx">pending_billing</span><span class="si">}</span> <span class="p">/></span> /* D */
<span class="p"><</span><span class="nc">StatsAverageGroup</span> <span class="na">data</span><span class="p">=</span><span class="si">{</span><span class="nx">data</span><span class="p">.</span><span class="nx">orders</span><span class="p">.</span><span class="nx">billed</span><span class="si">}</span> <span class="p">/></span> /* E */
<span class="p"><</span><span class="nc">StatsRatiosGroup</span> <span class="na">data</span><span class="p">=</span><span class="si">{</span><span class="nx">data</span><span class="p">.</span><span class="nx">ratios</span><span class="si">}</span> <span class="p">/></span> /* F */
<span class="p"><</span><span class="nc">StatsOrderGroup</span> <span class="na">data</span><span class="p">=</span><span class="si">{</span><span class="nx">data</span><span class="p">.</span><span class="nx">orders</span><span class="p">.</span><span class="nx">cancelled</span><span class="si">}</span> <span class="p">/></span> /* G */
<span class="p"><</span><span class="nc">CancellationGroup</span> <span class="na">data</span><span class="p">=</span><span class="si">{</span><span class="nx">data</span><span class="p">.</span><span class="nx">cancellation_per_reason</span><span class="si">}</span> <span class="p">/></span> /* H */
<span class="p"></></span>
<span class="p">);</span>
<span class="p">}</span></code></pre></figure>
<p>Taking a look at one of the children components, let’s say <code class="language-plaintext highlighter-rouge">StatsOrderGroup</code>, we can see that it takes the data prop and displays parts of the data object via helper components.</p>
<figure class="highlight"><pre><code class="language-jsx" data-lang="jsx"><span class="k">export</span> <span class="k">default</span> <span class="kd">function</span> <span class="nf">StatsOrderGroup</span><span class="p">({</span> <span class="nx">data</span> <span class="p">})</span> <span class="p">{</span>
<span class="k">return </span><span class="p">(</span>
<span class="p"><></span>
<span class="p"><</span><span class="nc">StatsQuantityMetric</span> <span class="na">value</span><span class="p">=</span><span class="si">{</span><span class="nx">data</span><span class="p">.</span><span class="nx">count</span><span class="si">}</span> <span class="p">/></span>
<span class="p"><</span><span class="nc">StatsCurrencyMetric</span> <span class="na">value</span><span class="p">=</span><span class="si">{</span><span class="nx">data</span><span class="p">.</span><span class="nx">revenue</span><span class="si">}</span> <span class="p">/></span>
<span class="p"><</span><span class="nc">StatsCurrencyMetric</span> <span class="na">value</span><span class="p">=</span><span class="si">{</span><span class="nx">data</span><span class="p">.</span><span class="nx">commission</span><span class="si">}</span> <span class="p">/></span>
<span class="p"></></span>
<span class="p">)</span>
<span class="p">}</span></code></pre></figure>
<h4 id="move-responsibility-of-data-fetching-to-children-components">Move responsibility of data fetching to children components</h4>
<p>As we explained in the previous section, the plan is to assign the responsibility of data fetching to each one of the children components. So, the first step is to remove the <code class="language-plaintext highlighter-rouge">useEffect</code> hook from the <code class="language-plaintext highlighter-rouge">Stats</code> component and pass the <code class="language-plaintext highlighter-rouge">searchUri</code> prop to <code class="language-plaintext highlighter-rouge">StatsMetrics</code> component, instead of <code class="language-plaintext highlighter-rouge">statsData</code>.</p>
<figure class="highlight"><pre><code class="language-jsx" data-lang="jsx"><span class="k">export</span> <span class="k">default</span> <span class="kd">function</span> <span class="nf">Stats</span><span class="p">({</span> <span class="nx">options</span> <span class="p">})</span> <span class="p">{</span>
<span class="kd">const</span> <span class="p">[</span><span class="nx">searchUri</span><span class="p">,</span> <span class="nx">setSearchUri</span><span class="p">]</span> <span class="o">=</span> <span class="nf">useState</span><span class="p">(</span><span class="kc">null</span><span class="p">);</span>
<span class="c1">// const [statsData, setStatsData] = useState(null);</span>
<span class="p">...</span>
<span class="c1">// useEffect(() => {</span>
<span class="c1">// if (!shouldFetchStats) return;</span>
<span class="c1">// getStats(searchUri)</span>
<span class="c1">// .then((data) => setStatsData(data))</span>
<span class="c1">// .catch((error) => { ... })</span>
<span class="c1">// }, [shouldFetchStats, searchUri]);</span>
<span class="k">return </span><span class="p">(</span>
<span class="p"><></span>
...
<span class="si">{</span><span class="cm">/* <div>{statsData && <StatsMetrics data={statsData} />}</div> */</span><span class="si">}</span>
<span class="p"><</span><span class="nt">div</span><span class="p">></span>
<span class="si">{</span><span class="nx">searchUri</span> <span class="o">!==</span> <span class="kc">null</span> <span class="o">&&</span> <span class="p"><</span><span class="nc">StatsMetrics</span> <span class="na">searchUri</span><span class="p">=</span><span class="si">{</span><span class="nx">searchUri</span><span class="si">}</span> <span class="p">/></span><span class="si">}</span>
<span class="p"></</span><span class="nt">div</span><span class="p">></span>
<span class="p"></></span>
<span class="p">);</span>
<span class="p">}</span></code></pre></figure>
<p>Then, we change the <code class="language-plaintext highlighter-rouge">StatsMetrics</code> component to receive the <code class="language-plaintext highlighter-rouge">searchUri</code> prop in order to pass it to the children components.</p>
<figure class="highlight"><pre><code class="language-jsx" data-lang="jsx"><span class="k">export</span> <span class="k">default</span> <span class="kd">function</span> <span class="nf">StatsMetrics</span><span class="p">({</span> <span class="nx">searchUri</span> <span class="p">})</span> <span class="p">{</span>
<span class="k">return </span><span class="p">(</span>
<span class="p"><></span>
<span class="p"><</span><span class="nc">StatsOrderGroup</span> <span class="na">searchUri</span><span class="p">=</span><span class="si">{</span><span class="nx">searchUri</span><span class="si">}</span> <span class="p">/></span>
<span class="p"><</span><span class="nc">StatsOrderCountLine</span> <span class="na">order</span><span class="p">=</span><span class="si">{</span><span class="nx">searchUri</span><span class="si">}</span> <span class="p">/></span>
<span class="p"><</span><span class="nc">StatsOrderGroup</span> <span class="na">searchUri</span><span class="p">=</span><span class="si">{</span><span class="nx">searchUri</span><span class="si">}</span> <span class="p">/></span>
<span class="p"><</span><span class="nc">StatsOrderGroup</span> <span class="na">searchUri</span><span class="p">=</span><span class="si">{</span><span class="nx">searchUri</span><span class="si">}</span> <span class="p">/></span>
<span class="p"><</span><span class="nc">StatsAverageGroup</span> <span class="na">searchUri</span><span class="p">=</span><span class="si">{</span><span class="nx">searchUri</span><span class="si">}</span> <span class="p">/></span>
<span class="p"><</span><span class="nc">StatsRatiosGroup</span> <span class="na">searchUri</span><span class="p">=</span><span class="si">{</span><span class="nx">searchUri</span><span class="si">}</span> <span class="p">/></span>
<span class="p"><</span><span class="nc">StatsOrderGroup</span> <span class="na">searchUri</span><span class="p">=</span><span class="si">{</span><span class="nx">searchUri</span><span class="si">}</span> <span class="p">/></span>
<span class="p"><</span><span class="nc">CancellationGroup</span> <span class="na">searchUri</span><span class="p">=</span><span class="si">{</span><span class="nx">searchUri</span><span class="si">}</span> <span class="p">/></span>
<span class="p"></></span>
<span class="p">);</span>
<span class="p">}</span></code></pre></figure>
<h4 id="we-need-a-loading-state-and-error-reporting">We need a loading state and error reporting</h4>
<p>Until now, children components got rendered if and only if they had their data available. Their parent was passing the data to them and they were ready to render their markup.</p>
<p>Now, the situation is different. Each child component gets rendered immediately on page load and it waits until it has its data available from the backend in order to display them to the user.</p>
<p>So, we need to have a <strong>loading state</strong> for each component and an <strong>error reporting state</strong> in case of an error response from the api call. For this reason, we have introduced some wrapper components that are responsible for the following:</p>
<ul>
<li>Fetch data from the backend</li>
<li>Render a loading state while waiting for data to be available</li>
<li>Display an error message in case of error, and a retry button (for on-demand data fetching)</li>
</ul>
<h4 id="introducing-children-components-wrappers">Introducing children components wrappers</h4>
<p>We are going to need a wrapper for each component that needs to fetch its data, so we create three wrappers:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">StatsOrderGroupWrapper</code> for <code class="language-plaintext highlighter-rouge">StatsOrderGroup</code></li>
<li><code class="language-plaintext highlighter-rouge">StatsRatiosGroupWrapper</code> for <code class="language-plaintext highlighter-rouge">StatsRatiosGroup</code></li>
<li><code class="language-plaintext highlighter-rouge">CancellationGroupWrapper</code> for <code class="language-plaintext highlighter-rouge">CancellationGroup</code></li>
</ul>
<p>We did not create wrappers for <code class="language-plaintext highlighter-rouge">StatsOrderCountLine</code> and <code class="language-plaintext highlighter-rouge">StatsAverageGroup</code> because these components will get their data, indirectly from other components (pairs A - B and C - E).</p>
<p>The following snippet shows the <code class="language-plaintext highlighter-rouge">StatsOrderGroupWrapper</code> component in its final form. This component (wrapper) fetches the data it needs and it renders a loading state while waiting and displaying the data as soon as they are available, or displays an error message in case of unsuccessful fetch.</p>
<figure class="highlight"><pre><code class="language-jsx" data-lang="jsx"><span class="k">import</span> <span class="p">{</span> <span class="nx">getStats</span> <span class="p">}</span> <span class="k">from</span> <span class="dl">'</span><span class="s1">path/to/api</span><span class="dl">'</span><span class="p">;</span>
<span class="k">import</span> <span class="p">{</span> <span class="nx">useGetCpsOrderStats</span> <span class="p">}</span> <span class="k">from</span> <span class="dl">'</span><span class="s1">path/to/useGetCpsOrderStats</span><span class="dl">'</span><span class="p">;</span>
<span class="k">import</span> <span class="p">{</span> <span class="nx">useCpsOrdersStats</span> <span class="p">}</span> <span class="k">from</span> <span class="dl">'</span><span class="s1">path/to/cpsOrdersStatsContext</span><span class="dl">'</span><span class="p">;</span>
<span class="k">export</span> <span class="k">default</span> <span class="kd">function</span> <span class="nf">StatsOrderGroupWrapper</span><span class="p">({</span> <span class="nx">searchUri</span><span class="p">,</span> <span class="nx">metric</span><span class="p">,</span> <span class="nx">updateRequestsState</span> <span class="p">})</span> <span class="p">{</span>
<span class="kd">const</span> <span class="nx">initialStats</span> <span class="o">=</span> <span class="p">{</span>
<span class="na">count</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span>
<span class="na">revenue</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span>
<span class="na">commission</span><span class="p">:</span> <span class="mi">0</span>
<span class="p">};</span>
<span class="kd">const</span> <span class="p">{</span> <span class="nx">dispatch</span> <span class="p">}</span> <span class="o">=</span> <span class="nf">useCpsOrdersStats</span><span class="p">();</span>
<span class="kd">const</span> <span class="p">{</span> <span class="nx">stats</span><span class="p">,</span> <span class="nx">isLoading</span><span class="p">,</span> <span class="nx">showError</span><span class="p">,</span> <span class="nx">getData</span> <span class="p">}</span> <span class="o">=</span> <span class="nf">useGetCpsOrderStats</span><span class="p">({</span>
<span class="nx">getStats</span><span class="p">,</span>
<span class="nx">searchUri</span><span class="p">,</span>
<span class="nx">metric</span><span class="p">,</span>
<span class="nx">initialStats</span><span class="p">,</span>
<span class="nx">dispatch</span>
<span class="p">});</span>
<span class="nf">useEffect</span><span class="p">(()</span> <span class="o">=></span> <span class="p">{</span>
<span class="nf">updateRequestsState</span><span class="p">(</span><span class="nx">metric</span><span class="p">,</span> <span class="nx">isLoading</span><span class="p">);</span>
<span class="p">},</span> <span class="p">[</span><span class="nx">updateRequestsState</span><span class="p">,</span> <span class="nx">metric</span><span class="p">,</span> <span class="nx">isLoading</span><span class="p">]);</span>
<span class="k">return</span> <span class="nx">showError</span> <span class="p">?</span> <span class="p">(</span>
<span class="p"><</span><span class="nc">StatsErrorSection</span> <span class="na">errorMessage</span><span class="p">=</span><span class="si">{</span><span class="nx">stats</span><span class="p">.</span><span class="nx">error</span><span class="si">}</span> <span class="na">retryFunc</span><span class="p">=</span><span class="si">{</span><span class="nx">getData</span><span class="si">}</span> <span class="p">/></span>
<span class="p">)</span> <span class="p">:</span> <span class="p">(</span>
<span class="p"><</span><span class="nc">StatsOrdersGroup</span> <span class="na">isLoading</span><span class="p">=</span><span class="si">{</span><span class="nx">isLoading</span><span class="si">}</span> <span class="na">data</span><span class="p">=</span><span class="si">{</span><span class="nx">stats</span><span class="si">}</span> <span class="p">/></span>
<span class="p">);</span>
<span class="p">}</span></code></pre></figure>
<p>As you can see, all the functionality is delegated to two custom hooks, <code class="language-plaintext highlighter-rouge">useGetCpsOrderStats</code> and <code class="language-plaintext highlighter-rouge">useCpsOrdersStats</code>. The other two wrappers use the same hooks and are similarly organized.</p>
<p>Let’s see what’s going on inside <code class="language-plaintext highlighter-rouge">cpsOrdersStatsContext</code> which exposes the <code class="language-plaintext highlighter-rouge">useGetCpsOrderStats</code> hook alongside with another two context components that we are going to see later in action:</p>
<figure class="highlight"><pre><code class="language-jsx" data-lang="jsx"><span class="kd">const</span> <span class="nx">initialStats</span> <span class="o">=</span> <span class="p">{</span>
<span class="na">all</span><span class="p">:</span> <span class="kc">null</span><span class="p">,</span>
<span class="na">billed</span><span class="p">:</span> <span class="kc">null</span>
<span class="p">};</span>
<span class="kd">const</span> <span class="nx">CpsOrdersStatsContext</span> <span class="o">=</span> <span class="nx">React</span><span class="p">.</span><span class="nf">createContext</span><span class="p">(</span><span class="nx">initialStats</span><span class="p">);</span>
<span class="kd">function</span> <span class="nf">cpsOrdersStatsReducer</span><span class="p">(</span><span class="nx">state</span><span class="p">,</span> <span class="p">{</span> <span class="nx">type</span><span class="p">,</span> <span class="nx">payload</span> <span class="p">})</span> <span class="p">{</span>
<span class="k">switch </span><span class="p">(</span><span class="nx">type</span><span class="p">)</span> <span class="p">{</span>
<span class="k">case</span> <span class="dl">'</span><span class="s1">SET_ALL</span><span class="dl">'</span><span class="p">:</span> <span class="p">{</span>
<span class="k">return</span> <span class="p">{</span>
<span class="p">...</span><span class="nx">state</span><span class="p">,</span>
<span class="na">all</span><span class="p">:</span> <span class="nx">payload</span>
<span class="p">};</span>
<span class="p">}</span>
<span class="k">case</span> <span class="dl">'</span><span class="s1">SET_BILLED</span><span class="dl">'</span><span class="p">:</span> <span class="p">{</span>
<span class="k">return</span> <span class="p">{</span>
<span class="p">...</span><span class="nx">state</span><span class="p">,</span>
<span class="na">billed</span><span class="p">:</span> <span class="nx">payload</span>
<span class="p">};</span>
<span class="p">}</span>
<span class="k">case</span> <span class="dl">'</span><span class="s1">RESET_ALL</span><span class="dl">'</span><span class="p">:</span> <span class="p">{</span>
<span class="k">return</span> <span class="p">{</span>
<span class="p">...</span><span class="nx">state</span><span class="p">,</span>
<span class="na">all</span><span class="p">:</span> <span class="kc">null</span>
<span class="p">};</span>
<span class="p">}</span>
<span class="k">case</span> <span class="dl">'</span><span class="s1">RESET_BILLED</span><span class="dl">'</span><span class="p">:</span> <span class="p">{</span>
<span class="k">return</span> <span class="p">{</span>
<span class="p">...</span><span class="nx">state</span><span class="p">,</span>
<span class="na">billed</span><span class="p">:</span> <span class="kc">null</span>
<span class="p">};</span>
<span class="p">}</span>
<span class="nl">default</span><span class="p">:</span> <span class="p">{</span>
<span class="k">return</span> <span class="nx">state</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="kd">function</span> <span class="nf">CpsOrdersStatsProvider</span><span class="p">({</span> <span class="nx">children</span> <span class="p">})</span> <span class="p">{</span>
<span class="kd">const</span> <span class="p">[</span><span class="nx">stats</span><span class="p">,</span> <span class="nx">dispatch</span><span class="p">]</span> <span class="o">=</span> <span class="nx">React</span><span class="p">.</span><span class="nf">useReducer</span><span class="p">(</span><span class="nx">cpsOrdersStatsReducer</span><span class="p">,</span> <span class="nx">initialStats</span><span class="p">);</span>
<span class="kd">const</span> <span class="nx">value</span> <span class="o">=</span> <span class="nx">React</span><span class="p">.</span><span class="nf">useMemo</span><span class="p">(</span>
<span class="p">()</span> <span class="o">=></span> <span class="p">({</span>
<span class="nx">stats</span><span class="p">,</span>
<span class="nx">dispatch</span>
<span class="p">}),</span>
<span class="p">[</span><span class="nx">stats</span><span class="p">]</span>
<span class="p">);</span>
<span class="k">return</span> <span class="p"><</span><span class="nc">CpsOrdersStatsContext</span><span class="p">.</span><span class="nc">Provider</span> <span class="na">value</span><span class="p">=</span><span class="si">{</span><span class="nx">value</span><span class="si">}</span><span class="p">></span><span class="si">{</span><span class="nx">children</span><span class="si">}</span><span class="p"></</span><span class="nc">CpsOrdersStatsContext</span><span class="p">.</span><span class="nc">Provider</span><span class="p">>;</span>
<span class="p">}</span>
<span class="kd">function</span> <span class="nf">CpsOrdersStatsConsumer</span><span class="p">({</span> <span class="nx">children</span> <span class="p">})</span> <span class="p">{</span>
<span class="k">return </span><span class="p">(</span>
<span class="p"><</span><span class="nc">CpsOrdersStatsContext</span><span class="p">.</span><span class="nc">Consumer</span><span class="p">></span>
<span class="si">{</span><span class="p">(</span><span class="nx">context</span><span class="p">)</span> <span class="o">=></span> <span class="p">{</span>
<span class="k">if </span><span class="p">(</span><span class="nx">context</span> <span class="o">===</span> <span class="kc">undefined</span><span class="p">)</span> <span class="p">{</span>
<span class="k">throw</span> <span class="k">new</span> <span class="nc">Error</span><span class="p">(</span>
<span class="dl">'</span><span class="s1">CpsAllOrdersStatsConsumer should be used inside a CpsAllOrdersStatsProvider</span><span class="dl">'</span>
<span class="p">);</span>
<span class="p">}</span>
<span class="k">return</span> <span class="nf">children</span><span class="p">(</span><span class="nx">context</span><span class="p">);</span>
<span class="p">}</span><span class="si">}</span>
<span class="p"></</span><span class="nc">CpsOrdersStatsContext</span><span class="p">.</span><span class="nc">Consumer</span><span class="p">></span>
<span class="p">);</span>
<span class="p">}</span>
<span class="kd">function</span> <span class="nf">useCpsOrdersStats</span><span class="p">()</span> <span class="p">{</span>
<span class="kd">const</span> <span class="nx">context</span> <span class="o">=</span> <span class="nx">React</span><span class="p">.</span><span class="nf">useContext</span><span class="p">(</span><span class="nx">CpsOrdersStatsContext</span><span class="p">);</span>
<span class="k">if </span><span class="p">(</span><span class="nx">context</span> <span class="o">===</span> <span class="kc">undefined</span><span class="p">)</span> <span class="p">{</span>
<span class="k">throw</span> <span class="k">new</span> <span class="nc">Error</span><span class="p">(</span><span class="dl">'</span><span class="s1">useCpsAllOrdersStats should be used inside a CpsAllOrdersStatsProvider</span><span class="dl">'</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">return</span> <span class="nx">context</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">export</span> <span class="p">{</span> <span class="nx">CpsOrdersStatsProvider</span><span class="p">,</span> <span class="nx">CpsOrdersStatsConsumer</span><span class="p">,</span> <span class="nx">useCpsOrdersStats</span> <span class="p">};</span></code></pre></figure>
<p>First of all we create a context object and store it in the <code class="language-plaintext highlighter-rouge">CpsOrdersStatsContext</code> constant.</p>
<p>After that, we declare the <code class="language-plaintext highlighter-rouge">cpsOrdersStatsReducer</code> function that we are going to use as a reducer inside <code class="language-plaintext highlighter-rouge">CpsOrdersStatsProvider</code> component, which we create immediately after.</p>
<p>What <code class="language-plaintext highlighter-rouge">CpsOrdersStatsProvider</code> does is that it provides a value to its children components, notifying them about changes and providing the <code class="language-plaintext highlighter-rouge">dispatch</code> function in order for them to update the context state.</p>
<p>But, in order for <code class="language-plaintext highlighter-rouge">CpsOrdersStatsProvider</code>’s children components to be informed about changes in our context (the stats data), they need to be wrapped inside a context consumer component. For this reason we create the <code class="language-plaintext highlighter-rouge">CpsOrdersStatsConsumer</code> component, which does exactly that.</p>
<p>Finally, we create the <code class="language-plaintext highlighter-rouge">useCpsOrdersStats</code> custom hook to be used by our wrappers (like <code class="language-plaintext highlighter-rouge">StatsOrderGroupWrapper</code>) in order to have access to our context, and especially the <code class="language-plaintext highlighter-rouge">dispatch</code> function. <code class="language-plaintext highlighter-rouge">StatsOrderGroupWrapper</code> calls the <code class="language-plaintext highlighter-rouge">dispatch</code> function (through <code class="language-plaintext highlighter-rouge">useGetCpsOrderStats</code> hook) every time it needs to inform its siblings components that it has the data they need.</p>
<p>Now let’s take a look at the contents of <code class="language-plaintext highlighter-rouge">useGetCpsOrderStats</code>:</p>
<figure class="highlight"><pre><code class="language-jsx" data-lang="jsx"><span class="k">import</span> <span class="p">{</span> <span class="nx">useState</span><span class="p">,</span> <span class="nx">useCallback</span><span class="p">,</span> <span class="nx">useEffect</span> <span class="p">}</span> <span class="k">from</span> <span class="dl">'</span><span class="s1">react</span><span class="dl">'</span><span class="p">;</span>
<span class="k">import</span> <span class="nx">camelCase</span> <span class="k">from</span> <span class="dl">'</span><span class="s1">lodash/camelCase</span><span class="dl">'</span><span class="p">;</span>
<span class="kd">const</span> <span class="nx">useGetCpsOrderStats</span> <span class="o">=</span> <span class="p">({</span> <span class="nx">getStats</span><span class="p">,</span> <span class="nx">searchUri</span><span class="p">,</span> <span class="nx">metric</span><span class="p">,</span> <span class="nx">initialStats</span><span class="p">,</span> <span class="nx">dispatch</span> <span class="o">=</span> <span class="kc">null</span> <span class="p">})</span> <span class="o">=></span> <span class="p">{</span>
<span class="kd">const</span> <span class="p">[</span><span class="nx">stats</span><span class="p">,</span> <span class="nx">setStats</span><span class="p">]</span> <span class="o">=</span> <span class="nf">useState</span><span class="p">(</span><span class="nx">initialStats</span><span class="p">);</span>
<span class="kd">const</span> <span class="p">[</span><span class="nx">isLoading</span><span class="p">,</span> <span class="nx">setIsLoading</span><span class="p">]</span> <span class="o">=</span> <span class="nf">useState</span><span class="p">(</span><span class="kc">false</span><span class="p">);</span>
<span class="kd">const</span> <span class="p">[</span><span class="nx">count</span><span class="p">,</span> <span class="nx">setCount</span><span class="p">]</span> <span class="o">=</span> <span class="nf">useState</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
<span class="kd">const</span> <span class="nx">dispatchCallback</span> <span class="o">=</span> <span class="nf">useCallback</span><span class="p">(</span>
<span class="p">(</span><span class="nx">payload</span><span class="p">)</span> <span class="o">=></span> <span class="p">{</span>
<span class="k">if </span><span class="p">(</span><span class="nx">dispatch</span> <span class="o">&&</span> <span class="p">(</span><span class="nx">metric</span> <span class="o">===</span> <span class="dl">'</span><span class="s1">all</span><span class="dl">'</span> <span class="o">||</span> <span class="nx">metric</span> <span class="o">===</span> <span class="dl">'</span><span class="s1">billed</span><span class="dl">'</span><span class="p">))</span> <span class="p">{</span>
<span class="nf">dispatch</span><span class="p">({</span> <span class="na">type</span><span class="p">:</span> <span class="s2">`SET_</span><span class="p">${</span><span class="nx">metric</span><span class="p">.</span><span class="nf">toUpperCase</span><span class="p">()}</span><span class="s2">`</span><span class="p">,</span> <span class="nx">payload</span> <span class="p">});</span>
<span class="p">}</span>
<span class="p">},</span>
<span class="p">[</span><span class="nx">dispatch</span><span class="p">,</span> <span class="nx">metric</span><span class="p">]</span>
<span class="p">);</span>
<span class="kd">const</span> <span class="nx">getCpsOrderStats</span> <span class="o">=</span> <span class="nf">useCallback</span><span class="p">(</span>
<span class="p">(</span><span class="nx">isMounted</span><span class="p">)</span> <span class="o">=></span> <span class="p">{</span>
<span class="k">if </span><span class="p">(</span><span class="nx">metric</span> <span class="o">&&</span> <span class="nx">searchUri</span> <span class="o">!==</span> <span class="kc">null</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if </span><span class="p">(</span><span class="nx">dispatch</span><span class="p">)</span> <span class="p">{</span>
<span class="nf">dispatch</span><span class="p">({</span> <span class="na">type</span><span class="p">:</span> <span class="s2">`RESET_</span><span class="p">${</span><span class="nx">metric</span><span class="p">.</span><span class="nf">toUpperCase</span><span class="p">()}</span><span class="s2">`</span> <span class="p">});</span>
<span class="p">}</span>
<span class="nf">setIsLoading</span><span class="p">(</span><span class="kc">true</span><span class="p">);</span>
<span class="nf">getStats</span><span class="p">(</span><span class="nx">searchUri</span><span class="p">,</span> <span class="nx">metric</span><span class="p">)</span>
<span class="p">.</span><span class="nf">then</span><span class="p">((</span><span class="nx">data</span><span class="p">)</span> <span class="o">=></span> <span class="p">{</span>
<span class="k">if </span><span class="p">(</span><span class="nx">isMounted</span><span class="p">)</span> <span class="p">{</span>
<span class="kd">const</span> <span class="nx">thisData</span> <span class="o">=</span> <span class="nx">data</span><span class="p">[</span><span class="nf">camelCase</span><span class="p">(</span><span class="nx">metric</span><span class="p">)];</span>
<span class="nf">setStats</span><span class="p">(</span><span class="nx">thisData</span><span class="p">);</span>
<span class="nf">dispatchCallback</span><span class="p">(</span><span class="nx">thisData</span><span class="p">);</span>
<span class="nf">setIsLoading</span><span class="p">(</span><span class="kc">false</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">})</span>
<span class="p">.</span><span class="k">catch</span><span class="p">((</span><span class="nx">e</span><span class="p">)</span> <span class="o">=></span> <span class="p">{</span>
<span class="k">if </span><span class="p">(</span><span class="nx">isMounted</span><span class="p">)</span> <span class="p">{</span>
<span class="kd">const</span> <span class="nx">errorMessage</span> <span class="o">=</span> <span class="p">(</span><span class="nx">e</span><span class="p">.</span><span class="nx">response</span> <span class="o">||</span> <span class="p">{}).</span><span class="nx">statusText</span> <span class="o">||</span> <span class="dl">'</span><span class="s1">An error occurred</span><span class="dl">'</span><span class="p">;</span>
<span class="kd">const</span> <span class="nx">errorData</span> <span class="o">=</span> <span class="p">{</span> <span class="na">error</span><span class="p">:</span> <span class="nx">errorMessage</span> <span class="p">};</span>
<span class="nf">setIsLoading</span><span class="p">(</span><span class="kc">false</span><span class="p">);</span>
<span class="nf">setStats</span><span class="p">(</span><span class="nx">errorData</span><span class="p">);</span>
<span class="nf">dispatchCallback</span><span class="p">(</span><span class="nx">errorData</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">});</span>
<span class="p">}</span>
<span class="p">},</span>
<span class="p">[</span><span class="nx">getStats</span><span class="p">,</span> <span class="nx">dispatch</span><span class="p">,</span> <span class="nx">metric</span><span class="p">,</span> <span class="nx">searchUri</span><span class="p">,</span> <span class="nx">dispatchCallback</span><span class="p">]</span>
<span class="p">);</span>
<span class="kd">const</span> <span class="nx">getData</span> <span class="o">=</span> <span class="p">()</span> <span class="o">=></span> <span class="p">{</span>
<span class="nf">setCount</span><span class="p">(</span><span class="nx">count</span> <span class="o">+</span> <span class="mi">1</span><span class="p">);</span>
<span class="p">};</span>
<span class="nf">useEffect</span><span class="p">(()</span> <span class="o">=></span> <span class="p">{</span>
<span class="kd">let</span> <span class="nx">mounted</span> <span class="o">=</span> <span class="kc">true</span><span class="p">;</span>
<span class="nf">getCpsOrderStats</span><span class="p">(</span><span class="nx">mounted</span><span class="p">);</span>
<span class="k">return </span><span class="p">()</span> <span class="o">=></span> <span class="p">{</span>
<span class="nx">mounted</span> <span class="o">=</span> <span class="kc">false</span><span class="p">;</span>
<span class="p">};</span>
<span class="p">},</span> <span class="p">[</span><span class="nx">getCpsOrderStats</span><span class="p">,</span> <span class="nx">count</span><span class="p">,</span> <span class="nx">searchUri</span><span class="p">]);</span>
<span class="kd">const</span> <span class="nx">showError</span> <span class="o">=</span> <span class="o">!</span><span class="nx">isLoading</span> <span class="o">&&</span> <span class="p">(</span><span class="o">!</span><span class="nx">stats</span> <span class="o">||</span> <span class="nx">stats</span><span class="p">.</span><span class="nx">error</span> <span class="o">!==</span> <span class="kc">undefined</span><span class="p">);</span>
<span class="k">return</span> <span class="p">{</span> <span class="nx">stats</span><span class="p">,</span> <span class="nx">isLoading</span><span class="p">,</span> <span class="nx">showError</span><span class="p">,</span> <span class="nx">getData</span> <span class="p">};</span>
<span class="p">};</span>
<span class="k">export</span> <span class="p">{</span> <span class="nx">useGetCpsOrderStats</span> <span class="p">};</span></code></pre></figure>
<p>In this custom hook we keep the logic of fetching the desired data (<code class="language-plaintext highlighter-rouge">metric</code>), update the stats context by calling the <code class="language-plaintext highlighter-rouge">dispatch</code> function for the specified metric and define the loading state and if we need to show an error message or not.</p>
<h4 id="prevent-new-requests-until-all-components-have-finished-loading">Prevent new requests until all components have finished loading</h4>
<p>There are two buttons under the stats page filters, <code class="language-plaintext highlighter-rouge">Search</code> and <code class="language-plaintext highlighter-rouge">Clear</code> buttons. Every time we click on each one of them, all components should request their data again. We need to prevent the user from clicking either of those buttons until all components have finished loading their data. Otherwise, we might end up having multiple asynchronous requests to compete with each other from which one will finish first.</p>
<p>For this reason we introduce the <code class="language-plaintext highlighter-rouge">LoadingInspectionContext</code>, which is responsible for keeping the loading state of the whole page, in other words, it checks to see if there is at least one component in the page that still waits for its request to finish.</p>
<figure class="highlight"><pre><code class="language-jsx" data-lang="jsx"><span class="kd">const</span> <span class="nx">LoadingInspectionContext</span> <span class="o">=</span> <span class="nx">React</span><span class="p">.</span><span class="nf">createContext</span><span class="p">({});</span>
<span class="kd">function</span> <span class="nf">atLeastOneIsPending</span><span class="p">(</span><span class="nx">collection</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="nb">Object</span><span class="p">.</span><span class="nf">entries</span><span class="p">(</span><span class="nx">collection</span><span class="p">).</span><span class="nf">some</span><span class="p">((</span><span class="nx">x</span><span class="p">)</span> <span class="o">=></span> <span class="nx">x</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">===</span> <span class="kc">true</span><span class="p">);</span>
<span class="p">}</span>
<span class="kd">function</span> <span class="nf">LoadingInspectionProvider</span><span class="p">({</span> <span class="nx">children</span> <span class="p">})</span> <span class="p">{</span>
<span class="kd">const</span> <span class="p">[</span><span class="nx">requests</span><span class="p">,</span> <span class="nx">setRequests</span><span class="p">]</span> <span class="o">=</span> <span class="nf">useState</span><span class="p">({});</span>
<span class="kd">const</span> <span class="p">[</span><span class="nx">isLoading</span><span class="p">,</span> <span class="nx">setIsLoading</span><span class="p">]</span> <span class="o">=</span> <span class="nf">useState</span><span class="p">(</span><span class="kc">false</span><span class="p">);</span>
<span class="kd">const</span> <span class="nx">updateRequest</span> <span class="o">=</span> <span class="nf">useCallback</span><span class="p">((</span><span class="nx">metric</span><span class="p">,</span> <span class="nx">metricIsLoading</span><span class="p">)</span> <span class="o">=></span> <span class="p">{</span>
<span class="nf">setRequests</span><span class="p">((</span><span class="nx">r</span><span class="p">)</span> <span class="o">=></span> <span class="p">({</span> <span class="p">...</span><span class="nx">r</span><span class="p">,</span> <span class="p">[</span><span class="nx">metric</span><span class="p">]:</span> <span class="nx">metricIsLoading</span> <span class="p">}));</span>
<span class="p">},</span> <span class="p">[]);</span>
<span class="kd">const</span> <span class="nx">value</span> <span class="o">=</span> <span class="nx">React</span><span class="p">.</span><span class="nf">useMemo</span><span class="p">(</span>
<span class="p">()</span> <span class="o">=></span> <span class="p">({</span>
<span class="nx">isLoading</span><span class="p">,</span>
<span class="nx">requests</span><span class="p">,</span>
<span class="na">updateRequestsState</span><span class="p">:</span> <span class="nx">updateRequest</span>
<span class="p">}),</span>
<span class="p">[</span><span class="nx">updateRequest</span><span class="p">,</span> <span class="nx">requests</span><span class="p">,</span> <span class="nx">isLoading</span><span class="p">]</span>
<span class="p">);</span>
<span class="nf">useEffect</span><span class="p">(()</span> <span class="o">=></span> <span class="p">{</span>
<span class="nf">setIsLoading</span><span class="p">(</span><span class="nf">atLeastOneIsPending</span><span class="p">(</span><span class="nx">requests</span><span class="p">));</span>
<span class="p">},</span> <span class="p">[</span><span class="nx">requests</span><span class="p">]);</span>
<span class="k">return </span><span class="p">(</span>
<span class="p"><</span><span class="nc">LoadingInspectionContext</span><span class="p">.</span><span class="nc">Provider</span> <span class="na">value</span><span class="p">=</span><span class="si">{</span><span class="nx">value</span><span class="si">}</span><span class="p">></span><span class="si">{</span><span class="nx">children</span><span class="si">}</span><span class="p"></</span><span class="nc">LoadingInspectionContext</span><span class="p">.</span><span class="nc">Provider</span><span class="p">></span>
<span class="p">);</span>
<span class="p">}</span>
<span class="kd">function</span> <span class="nf">LoadingInspectionConsumer</span><span class="p">({</span> <span class="nx">children</span> <span class="p">})</span> <span class="p">{</span>
<span class="k">return </span><span class="p">(</span>
<span class="p"><</span><span class="nc">LoadingInspectionContext</span><span class="p">.</span><span class="nc">Consumer</span><span class="p">></span>
<span class="si">{</span><span class="p">(</span><span class="nx">context</span><span class="p">)</span> <span class="o">=></span> <span class="p">{</span>
<span class="k">if </span><span class="p">(</span><span class="nx">context</span> <span class="o">===</span> <span class="kc">undefined</span><span class="p">)</span> <span class="p">{</span>
<span class="k">throw</span> <span class="k">new</span> <span class="nc">Error</span><span class="p">(</span>
<span class="dl">'</span><span class="s1">LoadingInspectionConsumer should be used inside a LoadingInspectionProvider</span><span class="dl">'</span>
<span class="p">);</span>
<span class="p">}</span>
<span class="k">return</span> <span class="nf">children</span><span class="p">(</span><span class="nx">context</span><span class="p">);</span>
<span class="p">}</span><span class="si">}</span>
<span class="p"></</span><span class="nc">LoadingInspectionContext</span><span class="p">.</span><span class="nc">Consumer</span><span class="p">></span>
<span class="p">);</span>
<span class="p">}</span>
<span class="k">export</span> <span class="p">{</span> <span class="nx">LoadingInspectionProvider</span><span class="p">,</span> <span class="nx">LoadingInspectionConsumer</span> <span class="p">};</span></code></pre></figure>
<p>As we can see, we expose the <code class="language-plaintext highlighter-rouge">LoadingInspectionProvider</code> and <code class="language-plaintext highlighter-rouge">LoadingInspectionConsumer</code> components from the above file. The way we use these goes like this: we wrap the filters and the main page (that contains the stats components) with a <code class="language-plaintext highlighter-rouge">LoadingInspectionProvider</code>. Then, we wrap each one of the wrapper components (the components that fetch the data) with a <code class="language-plaintext highlighter-rouge">LoadingInspectionConsumer</code>. When a stats component’s data are available, we call the <code class="language-plaintext highlighter-rouge">updateRequestsState</code> function that is provided by the context object of the <code class="language-plaintext highlighter-rouge">LoadingInspectionConsumer</code> components, in order to update the page’s loading state.</p>
<h4 id="combining-them-all-together">Combining them all together</h4>
<p>Now, let’s see how all the above work together.</p>
<figure class="highlight"><pre><code class="language-jsx" data-lang="jsx"><span class="k">export</span> <span class="k">default</span> <span class="kd">function</span> <span class="nf">Stats</span><span class="p">({</span> <span class="nx">options</span> <span class="p">})</span> <span class="p">{</span>
<span class="kd">const</span> <span class="p">[</span><span class="nx">searchUri</span><span class="p">,</span> <span class="nx">setSearchUri</span><span class="p">]</span> <span class="o">=</span> <span class="nf">useState</span><span class="p">(</span><span class="kc">null</span><span class="p">);</span>
<span class="kd">const</span> <span class="nx">searchCallback</span> <span class="o">=</span> <span class="nf">useCallback</span><span class="p">(({</span> <span class="nx">queryString</span> <span class="p">})</span> <span class="o">=></span> <span class="p">{</span>
<span class="nf">setSearchUri</span><span class="p">(</span><span class="nx">queryString</span><span class="p">);</span>
<span class="p">},</span> <span class="p">[]);</span>
<span class="k">return </span><span class="p">(</span>
<span class="p"><</span><span class="nc">LoadingInspectionProvider</span><span class="p">></span>
<span class="p"><</span><span class="nc">StatsFilters</span> <span class="na">OnSearchCallback</span><span class="p">=</span><span class="si">{</span><span class="nx">searchCallback</span><span class="si">}</span> <span class="p">/></span>
<span class="p"><</span><span class="nt">div</span><span class="p">></span>
<span class="si">{</span><span class="nx">searchUri</span> <span class="o">!==</span> <span class="kc">null</span> <span class="o">&&</span> <span class="p"><</span><span class="nc">StatsMetrics</span> <span class="na">searchUri</span><span class="p">=</span><span class="si">{</span><span class="nx">searchUri</span><span class="si">}</span> <span class="p">/></span><span class="si">}</span>
<span class="p"></</span><span class="nt">div</span><span class="p">></span>
<span class="p"></</span><span class="nc">LoadingInspectionProvider</span><span class="p">></span>
<span class="p">);</span>
<span class="p">}</span></code></pre></figure>
<p>As we said earlier, <code class="language-plaintext highlighter-rouge">LoadingInspectionProvider</code> wraps the filters and the <code class="language-plaintext highlighter-rouge">StatsMetrics</code> components. If we take a look inside the <code class="language-plaintext highlighter-rouge">StatsMetrics</code> component, we can see how we use the <code class="language-plaintext highlighter-rouge">LoadingInspectionConsumer</code> and the <code class="language-plaintext highlighter-rouge">CpsOrdersStatsProvider</code> and <code class="language-plaintext highlighter-rouge">CpsOrdersStatsConsumer</code> components.</p>
<figure class="highlight"><pre><code class="language-jsx" data-lang="jsx"><span class="k">import</span> <span class="p">{</span> <span class="nx">CpsOrdersStatsProvider</span><span class="p">,</span> <span class="nx">CpsOrdersStatsConsumer</span> <span class="p">}</span> <span class="k">from</span> <span class="dl">'</span><span class="s1">path/to/cpsOrdersStatsContext</span><span class="dl">'</span><span class="p">;</span>
<span class="k">import</span> <span class="p">{</span> <span class="nx">LoadingInspectionConsumer</span> <span class="p">}</span> <span class="k">from</span> <span class="dl">'</span><span class="s1">path/to/loadingInspectionContext</span><span class="dl">'</span><span class="p">;</span>
<span class="k">export</span> <span class="k">default</span> <span class="kd">function</span> <span class="nf">StatsMetrics</span><span class="p">({</span> <span class="nx">searchUri</span> <span class="p">})</span> <span class="p">{</span>
<span class="k">return </span><span class="p">(</span>
<span class="p"><</span><span class="nc">CpsOrdersStatsProvider</span><span class="p">></span>
/* A */
<span class="p"><</span><span class="nc">LoadingInspectionConsumer</span><span class="p">></span>
<span class="si">{</span><span class="p">({</span> <span class="nx">updateRequestsState</span> <span class="p">})</span> <span class="o">=></span> <span class="p">(</span>
<span class="p"><</span><span class="nc">StatsOrderGroupWrapper</span>
<span class="na">searchUri</span><span class="p">=</span><span class="si">{</span><span class="nx">searchUri</span><span class="si">}</span>
<span class="na">metric</span><span class="p">=</span><span class="s">"all"</span>
<span class="na">updateRequestsState</span><span class="p">=</span><span class="si">{</span><span class="nx">updateRequestsState</span><span class="si">}</span>
<span class="p">/></span>
<span class="p">)</span><span class="si">}</span>
<span class="p"></</span><span class="nc">LoadingInspectionConsumer</span><span class="p">></span>
/* B */
<span class="p"><</span><span class="nc">CpsOrdersStatsConsumer</span><span class="p">></span>
<span class="si">{</span><span class="p">(</span><span class="nx">context</span><span class="p">)</span> <span class="o">=></span> <span class="p">{</span>
<span class="kd">const</span> <span class="p">{</span>
<span class="na">stats</span><span class="p">:</span> <span class="p">{</span> <span class="nx">all</span> <span class="p">}</span>
<span class="p">}</span> <span class="o">=</span> <span class="nx">context</span> <span class="o">||</span> <span class="p">{</span> <span class="na">stats</span><span class="p">:</span> <span class="p">{}</span> <span class="p">};</span>
<span class="k">return</span> <span class="p"><</span><span class="nc">StatsOrderCountLine</span> <span class="na">metric</span><span class="p">=</span><span class="si">{</span><span class="nx">all</span><span class="si">}</span> <span class="p">/>;</span>
<span class="p">}</span><span class="si">}</span>
<span class="p"></</span><span class="nc">CpsOrdersStatsConsumer</span><span class="p">></span>
/* C */
<span class="p"><</span><span class="nc">LoadingInspectionConsumer</span><span class="p">></span>
<span class="si">{</span><span class="p">({</span> <span class="nx">updateRequestsState</span> <span class="p">})</span> <span class="o">=></span> <span class="p">(</span>
<span class="p"><</span><span class="nc">StatsOrderGroupWrapper</span>
<span class="na">searchUri</span><span class="p">=</span><span class="si">{</span><span class="nx">searchUri</span><span class="si">}</span>
<span class="na">metric</span><span class="p">=</span><span class="s">"billed"</span>
<span class="na">updateRequestsState</span><span class="p">=</span><span class="si">{</span><span class="nx">updateRequestsState</span><span class="si">}</span>
<span class="p">/></span>
<span class="p">)</span><span class="si">}</span>
<span class="p"></</span><span class="nc">LoadingInspectionConsumer</span><span class="p">></span>
/* D */
<span class="p"><</span><span class="nc">LoadingInspectionConsumer</span><span class="p">></span>
<span class="si">{</span><span class="p">({</span> <span class="nx">updateRequestsState</span> <span class="p">})</span> <span class="o">=></span> <span class="p">(</span>
<span class="p"><</span><span class="nc">StatsOrderGroupWrapper</span>
<span class="na">searchUri</span><span class="p">=</span><span class="si">{</span><span class="nx">searchUri</span><span class="si">}</span>
<span class="na">metric</span><span class="p">=</span><span class="s">"pending_billing"</span>
<span class="na">updateRequestsState</span><span class="p">=</span><span class="si">{</span><span class="nx">updateRequestsState</span><span class="si">}</span>
<span class="p">/></span>
<span class="p">)</span><span class="si">}</span>
<span class="p"></</span><span class="nc">LoadingInspectionConsumer</span><span class="p">></span>
/* E */
<span class="p"><</span><span class="nc">CpsOrdersStatsConsumer</span><span class="p">></span>
<span class="si">{</span><span class="p">(</span><span class="nx">context</span><span class="p">)</span> <span class="o">=></span> <span class="p">{</span>
<span class="kd">const</span> <span class="p">{</span>
<span class="na">stats</span><span class="p">:</span> <span class="p">{</span> <span class="nx">billed</span> <span class="p">}</span>
<span class="p">}</span> <span class="o">=</span> <span class="nx">context</span> <span class="o">||</span> <span class="p">{</span> <span class="na">stats</span><span class="p">:</span> <span class="p">{}</span> <span class="p">};</span>
<span class="k">return</span> <span class="p"><</span><span class="nc">StatsAverageGroup</span> <span class="na">billed</span><span class="p">=</span><span class="si">{</span><span class="nx">billed</span><span class="si">}</span> <span class="p">/>;</span>
<span class="p">}</span><span class="si">}</span>
<span class="p"></</span><span class="nc">CpsOrdersStatsConsumer</span><span class="p">></span>
/* F */
<span class="p"><</span><span class="nc">LoadingInspectionConsumer</span><span class="p">></span>
<span class="si">{</span><span class="p">({</span> <span class="nx">updateRequestsState</span> <span class="p">})</span> <span class="o">=></span> <span class="p">(</span>
<span class="p"><</span><span class="nc">StatsRatiosGroupWrapper</span>
<span class="na">searchUri</span><span class="p">=</span><span class="si">{</span><span class="nx">searchUri</span><span class="si">}</span>
<span class="na">metric</span><span class="p">=</span><span class="s">"ratios"</span>
<span class="na">updateRequestsState</span><span class="p">=</span><span class="si">{</span><span class="nx">updateRequestsState</span><span class="si">}</span>
<span class="p">/></span>
<span class="p">)</span><span class="si">}</span>
<span class="p"></</span><span class="nc">LoadingInspectionConsumer</span><span class="p">></span>
/* G */
<span class="p"><</span><span class="nc">LoadingInspectionConsumer</span><span class="p">></span>
<span class="si">{</span><span class="p">({</span> <span class="nx">updateRequestsState</span> <span class="p">})</span> <span class="o">=></span> <span class="p">(</span>
<span class="p"><</span><span class="nc">StatsOrderGroupWrapper</span>
<span class="na">searchUri</span><span class="p">=</span><span class="si">{</span><span class="nx">searchUri</span><span class="si">}</span>
<span class="na">metric</span><span class="p">=</span><span class="s">"cancelled"</span>
<span class="na">updateRequestsState</span><span class="p">=</span><span class="si">{</span><span class="nx">updateRequestsState</span><span class="si">}</span>
<span class="p">/></span>
<span class="p">)</span><span class="si">}</span>
<span class="p"></</span><span class="nc">LoadingInspectionConsumer</span><span class="p">></span>
/* H */
<span class="p"><</span><span class="nc">LoadingInspectionConsumer</span><span class="p">></span>
<span class="si">{</span><span class="p">({</span> <span class="nx">updateRequestsState</span> <span class="p">})</span> <span class="o">=></span> <span class="p">(</span>
<span class="p"><</span><span class="nc">CancellationGroupWrapper</span>
<span class="na">searchUri</span><span class="p">=</span><span class="si">{</span><span class="nx">searchUri</span><span class="si">}</span>
<span class="na">metric</span><span class="p">=</span><span class="s">"cancellation_reasons"</span>
<span class="na">updateRequestsState</span><span class="p">=</span><span class="si">{</span><span class="nx">updateRequestsState</span><span class="si">}</span>
<span class="p">/></span>
<span class="p">)</span><span class="si">}</span>
<span class="p"></</span><span class="nc">LoadingInspectionConsumer</span><span class="p">></span>
<span class="p"></</span><span class="nc">CpsOrdersStatsProvider</span><span class="p">></span>
<span class="p">);</span>
<span class="p">}</span></code></pre></figure>
<p>To sum up, we can say that the flow goes something like this:</p>
<ol>
<li>Wrapper components (like <code class="language-plaintext highlighter-rouge">StatsOrderGroupWrapper</code>) render with a loading state enabled</li>
<li><code class="language-plaintext highlighter-rouge">useGetCpsOrderStats</code> requests the data</li>
<li>When the data are available, the <code class="language-plaintext highlighter-rouge">dispatch</code> method is called in order to update the context</li>
<li>The component’s loading state becomes <code class="language-plaintext highlighter-rouge">false</code></li>
<li>The wrapper notifies the <code class="language-plaintext highlighter-rouge">LoadingInspectionProvider</code> context component that it has finished loading its data</li>
<li>The stats contexts has been updated, so <code class="language-plaintext highlighter-rouge">CpsOrdersStatsConsumer</code>s notify their children to render the desired data</li>
</ol>
<h1 id="final-result">Final result</h1>
<p>It is time to take a look at our final result. As soon as the document loads, each component sends a request to the backend and renders a placeholder, indicating that it waits for its data to be available. Now, the user can have a better idea of what this page is going to render.</p>
<figure>
<a href="../../../images/react_progressive_load/after.png" class="image-popup">
<img src="../../../images/react_progressive_load/after.png" alt="image" />
</a>
<figcaption>
<a href="../../images/react_progressive_load/after.png">
Image 4: The final result
</a>
</figcaption>
</figure>
<p>In case of unexpected error in a specific component, an error message gets rendered alongside with a retry button that gives the user the opportunity to request the data for this component again. The other components that managed to retrieve their data successfully should be able to visualize them via the respective chart.</p>
<figure>
<a href="../../../images/react_progressive_load/component_error.png" class="image-popup">
<img src="../../../images/react_progressive_load/component_error.png" alt="image" />
</a>
<figcaption>
<a href="../../images/react_progressive_load/component_error.png">
Image 5: Unexpected error to one or more components
</a>
</figcaption>
</figure>
<p>Now let’s take a look at the network tab to figure out what has changed. The next two images illustrate the time it took for each request to be completed.</p>
<blockquote>
<p>All measurements were made in development environment</p>
</blockquote>
<p>In the first image we can see that, in the initial case, it took about 34 seconds for one (and ony) request to be completed.</p>
<figure>
<a href="../../../images/react_progressive_load/network_one.png" class="image-popup">
<img src="../../../images/react_progressive_load/network_one.png" alt="image" />
</a>
<figcaption>
<a href="../../images/react_progressive_load/network_one.png">
Image 6: One network request
</a>
</figcaption>
</figure>
<p>In the second image we see that it takes about 42 seconds for all requests to be completed. Moreover, instead of 1 request, we have 6 requests that run concurrently. At first glance this does not seem so efficient, on the contrary, it seems to have made things worse.</p>
<p>But if we take a second look, we can see that the first request responses in less than 12 seconds. This means that in 12 seconds from the initial load, the first component renders its data. After about 5 seconds, a second component renders its own data, and so on. It seems that <strong>the page loads progressively</strong>!</p>
<figure>
<a href="../../../images/react_progressive_load/network_multiple.png" class="image-popup">
<img src="../../../images/react_progressive_load/network_multiple.png" alt="image" />
</a>
<figcaption>
<a href="../../images/react_progressive_load/network_multiple.png">
Image 7: Multiple network requests
</a>
</figcaption>
</figure>
<p>Comparing the two cases, (in the first one all components render their data after 34 seconds, and in the second case each component renders its data as soon as they are available), we see that the second case provides a better user experience, even if the last component gets rendered after 42 seconds (vs 34 seconds that took in the first case).</p>
<p>The fact that the page has finished loading the first batch of information in 12 seconds (instead of 34) reduces the <a href="https://web.dev/tti/">TTI</a>. As we can see in the following lighthouse report, the page starts to get interactive at 1.8 seconds.</p>
<figure>
<a href="../../../images/react_progressive_load/lighthouse.png" class="image-popup">
<img src="../../../images/react_progressive_load/lighthouse.png" alt="image" />
</a>
<figcaption>
<a href="../../images/react_progressive_load/lighthouse.png">
Image 8: Lighthouse report
</a>
</figcaption>
</figure>
<h1 id="summary-and-next-steps">Summary and next steps</h1>
<p>In this post we examined the implementation of a progressive react app and we provided some technical details. We saw the use of React’s <code class="language-plaintext highlighter-rouge">context</code> object and how it helped us to achieve specific functionality. Finally, we presented some performance metrics and saw that the final solution is a bit slower than the initial one, but the user experience is clearly better.</p>
<p>But there is always room for improvement. In our case, we can implement a mechanism that would give the ability to the user to change the applied filters without having to wait for all data to get fetched. As soon as we use <code class="language-plaintext highlighter-rouge">axios</code> to fetch our data, we can use the <code class="language-plaintext highlighter-rouge">CancelToken</code> object provided by the library.</p>
<p>Below we can see the <code class="language-plaintext highlighter-rouge">getStats</code> function, the function that we call to fetch the stats data for a specific metric. <code class="language-plaintext highlighter-rouge">getStats</code> uses the <code class="language-plaintext highlighter-rouge">get</code> function which is a custom implementation of axios that we call <code class="language-plaintext highlighter-rouge">SkroutzAxios</code>.</p>
<p>It’s pretty straightforward to implement the cancellation functionality here, we just have to pass the <code class="language-plaintext highlighter-rouge">cancelToken</code> property to the options object.</p>
<figure class="highlight"><pre><code class="language-jsx" data-lang="jsx"><span class="kd">const</span> <span class="nx">CancelToken</span> <span class="o">=</span> <span class="nx">axios</span><span class="p">.</span><span class="nx">CancelToken</span><span class="p">;</span>
<span class="kd">const</span> <span class="nx">source</span> <span class="o">=</span> <span class="nx">CancelToken</span><span class="p">.</span><span class="nf">source</span><span class="p">();</span>
<span class="kd">const</span> <span class="nx">httpRequest</span> <span class="o">=</span> <span class="nx">SkroutzAxios</span><span class="p">;</span>
<span class="kd">function</span> <span class="nf">get</span><span class="p">(</span><span class="nx">url</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="nf">httpRequest</span><span class="p">(</span><span class="nx">url</span><span class="p">,</span> <span class="p">{</span>
<span class="na">method</span><span class="p">:</span> <span class="dl">'</span><span class="s1">GET</span><span class="dl">'</span><span class="p">,</span>
<span class="na">credentials</span><span class="p">:</span> <span class="dl">'</span><span class="s1">same-origin</span><span class="dl">'</span><span class="p">,</span>
<span class="na">mode</span><span class="p">:</span> <span class="dl">'</span><span class="s1">cors</span><span class="dl">'</span><span class="p">,</span>
<span class="na">cache</span><span class="p">:</span> <span class="dl">'</span><span class="s1">default</span><span class="dl">'</span><span class="p">,</span>
<span class="na">cancelToken</span><span class="p">:</span> <span class="nx">source</span><span class="p">.</span><span class="nx">token</span> <span class="c1">// Provide a cancellation token</span>
<span class="p">}).</span><span class="nf">then</span><span class="p">(</span><span class="nx">checkStatus</span><span class="p">);</span>
<span class="p">}</span>
<span class="kd">function</span> <span class="nf">getStats</span><span class="p">(</span><span class="nx">searchUri</span> <span class="o">=</span> <span class="dl">''</span><span class="p">,</span> <span class="nx">metric</span> <span class="o">=</span> <span class="dl">''</span><span class="p">)</span> <span class="p">{</span>
<span class="kd">let</span> <span class="nx">endpoint</span> <span class="o">=</span> <span class="s2">`</span><span class="p">${</span><span class="nx">STATS_ENDPOINT</span><span class="p">}${</span><span class="nx">searchUri</span><span class="p">}</span><span class="s2">`</span><span class="p">;</span>
<span class="k">if </span><span class="p">(</span><span class="nx">metric</span><span class="p">)</span> <span class="p">{</span>
<span class="kd">const</span> <span class="nx">symbol</span> <span class="o">=</span> <span class="nx">searchUri</span> <span class="p">?</span> <span class="dl">'</span><span class="s1">&</span><span class="dl">'</span> <span class="p">:</span> <span class="dl">'</span><span class="s1">?</span><span class="dl">'</span><span class="p">;</span>
<span class="nx">endpoint</span> <span class="o">+=</span> <span class="s2">`</span><span class="p">${</span><span class="nx">symbol</span><span class="p">}</span><span class="s2">metric=</span><span class="p">${</span><span class="nx">metric</span><span class="p">}</span><span class="s2">`</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">return</span> <span class="nf">get</span><span class="p">(</span><span class="nx">endpoint</span><span class="p">);</span>
<span class="p">}</span></code></pre></figure>
<p>Then, we can just call <code class="language-plaintext highlighter-rouge">source.cancel();</code> when we want to cancel the pending requests. For more details about the usage of <code class="language-plaintext highlighter-rouge">CancelToken</code> visit the axios <a href="https://github.com/axios/axios#cancellation">docs</a>.</p>
<p>Finally, as we can see from the previous performance report, we can take some actions in order to improve the overall score:</p>
<ul>
<li>Reduce the initial server response time because React waits for the document to get served in order to start fetching the data</li>
<li>Remove potential unused javascript because they affect the network activity</li>
<li>Eliminate render-blocking resources and deliver non-critical assets asynchronously</li>
</ul>
<p><a href="https://engineering.skroutz.gr/blog/refactor-react-app-to-progressively-load-its-data/">Refactoring a React app to progressively load its data</a> was originally published by John Kapantzakis at <a href="https://engineering.skroutz.gr">Skroutz Engineering</a> on April 07, 2021.</p>https://engineering.skroutz.gr/blog/how-we-classify-products2021-03-05T22:00:00+00:002021-03-05T22:00:00+00:00George Hadjigeorgiouhttps://engineering.skroutz.gr<p>Skroutz is a marketplace that hosts more than 8,500 merchants and keeps
adding 500 new merchants per month. This translates to more than 80,000
new offers per day with peaks as high as 200,000 on certain occasions.</p>
<p>Our product content team is one of the largest in the organization (140
people as of Mar 2021) but to be able to handle high product loads we had to
implement a number of automation tools for product classification.</p>
<p>Merchants have two ways of uploading products to Skroutz:</p>
<ol>
<li>Via an XML file that always reflects the merchant’s up to date offers
including new ones</li>
<li>Through our merchants CMS (used by merchants without a platform of
their own)</li>
</ol>
<p>When a new offer (or a product in the Skroutz jargon) is detected, it is identified
and if possible placed in the appropriate category and merged into the corresponding SKU.</p>
<p>Before we move on to describe how the classification is achieved it is
important to describe some of the basics:</p>
<ul>
<li>An <a href="https://en.wikipedia.org/wiki/Stock_keeping_unit">SKU</a> is a brand’s unique product that is uniquely described by a part number or an <a href="https://en.wikipedia.org/wiki/International_Article_Number">EAN</a>.</li>
<li>An offer is unique to a merchant but generally describes
an SKU. The offer should in reality carry all the necessary attributes of the
SKU so that it can be correctly identified but, unfortunately, that’s rarely the
case.</li>
<li>A category represents a class of SKUs (e.g. smartphones, sneakers).
Merchants and Skroutz almost always have a different categorization
hierarchy. Our category tree has many levels but only leaves contain
SKUs, top level categories are there to help consumers navigate.</li>
<li>SKUs have specifications in a structured format that can be used by
consumers to filter results. E.g. a smartphone will have a screen size
specification whereas a dress will have a color specification. These
specifications are defined on a category level.</li>
<li>An SKU belongs to a brand or a manufacturer (e.g. Samsung)</li>
</ul>
<p>The above relations are better depicted in the diagram below:</p>
<p><img src="https://engineering.skroutz.gr/images/2021-classify/model.png" alt="" /></p>
<p>On average, 60% of all incoming products/offers belong to an existing
SKU. Ideally, the SKU part numbers or EAN should be enough for SKU
classification but in reality those attributes are often either missing or are
just plain wrong.</p>
<h2 id="classification">Classification</h2>
<p>Our classification tool goes by the name of Tron with two major
subtools:</p>
<ul>
<li><strong>Megatron</strong>: classifies incoming products to categories using
machine learning</li>
<li><strong>Ngntron (or new generation tron)</strong>: classifies products into SKUs using feature extraction</li>
</ul>
<p>The purpose of this post is to describe Ngntron and how feature analysis
has helped us build a myriad of satellite tools other than just
classification.</p>
<h3 id="sku-classification">SKU classification</h3>
<p>Incoming products are plain text representations of their attributes,
with all their necessary attributes included like brand name, color,
size, etc.</p>
<p>Below are example of those products from various categories:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Xiaomi Poco X3 Dual Sim 6.67" 6GB/128GB 4G NFC Γκρι M2007J20CG
LG Ψυγειοκαταψύκτης GBP62DSNFN (384lt, A+++) Total No Frost
Γυναικεία Παπούτσια Vans | Old Skool Platform Black | Womens Shoes Black VN0A3B3UY281
</code></pre></div></div>
<p>As evident from the examples above, product descriptions follow no standard pattern
and in some cases include marketing information not relevant to the
product such as special discounts. Below are the most common problems found in
product descriptions:</p>
<ul>
<li>Part numbers or EANs can refer to product families (e.g. Apple iPhone 12)
and not specific variants (e.g. Apple iPhone 12 64GB Black)</li>
<li>Random strings instead of part numbers</li>
<li>Missing or partial information</li>
<li>Country or region specific part numbers / EANs</li>
<li>Multiple part numbers</li>
<li>Redundant or irrelevant information</li>
</ul>
<p>Our first approach in using plain TF-IDF yielded poor performance. After
all our purpose was not to rank products based on relevance but find
that one true perfect match or just determine that this is a new product
that matches none of the existing.</p>
<h3 id="feature-extraction">Feature extraction</h3>
<p>The process of feature extraction aims to identify specific attributes
in the text representation of the product and tag them or even better
link them to known models.</p>
<p>For example, the first product in the list above yields the following results:</p>
<figure class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="n">product_name</span> <span class="o">=</span> <span class="s1">'Xiaomi Poco X3 Dual Sim 6.67" 6GB/128GB 4G NFC Γκρι M2007J20CG'</span>
<span class="n">analyzer</span> <span class="o">=</span> <span class="no">Ngntron</span><span class="o">::</span><span class="no">Analyzers</span><span class="o">::</span><span class="no">ProductAnalyzer</span><span class="p">.</span><span class="nf">new</span><span class="p">(</span><span class="n">product_name</span><span class="p">)</span>
<span class="nb">puts</span> <span class="n">analyzer</span><span class="p">.</span><span class="nf">phrase</span></code></pre></figure>
<p>The above snippet yields the following results:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[manufacturer] 0,0 => Xiaomi
[model] 1,2 => Poco X3
[filter] 3,4 => Dual Sim
[feature] 3,3 => Dual
[feature] 5,5 => 6.67"
[] 6,6 => 6GB/128GB
[feature] 7,7 => 4G
[filter, feature] 8,8 => NFC
[filter, feature, color] 9,9 => Γκρι
[] 10,10 => M2007J20CG
[pn] 11,11 => 30371
</code></pre></div></div>
<p>Each identified word in the original phrase has been tagged with one or
more tags that correspond to a specific attribute. Similarly:</p>
<figure class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="n">product_name</span> <span class="o">=</span> <span class="s1">'Γυναικεία Παπούτσια Vans | Old Skool Platform Black | Womens Shoes Black VN0A3B3UY281'</span>
<span class="n">analyzer</span> <span class="o">=</span> <span class="no">Ngntron</span><span class="o">::</span><span class="no">Analyzers</span><span class="o">::</span><span class="no">ProductAnalyzer</span><span class="p">.</span><span class="nf">new</span><span class="p">(</span><span class="n">product_name</span><span class="p">)</span>
<span class="nb">puts</span> <span class="n">analyzer</span><span class="p">.</span><span class="nf">phrase</span></code></pre></figure>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[filter, feature] 0,0 => Γυναικεία
[category] 1,1 => Παπούτσια
[manufacturer] 2,2 => Vans
[filter, feature] 3,4 => Old Skool
[model] 5,5 => Platform
[filter, feature, color] 6,6 (9, 9) => Black
[] 7,7 => Womens
[category] 8,8 => Shoes
[pn] 10,10 => VN0A3B3UY281
</code></pre></div></div>
<p>Each tag contains relevant information as to how it was identified and
what model it is referencing. For example the tag <code class="language-plaintext highlighter-rouge">manufacturer</code> would
have the manufacturer <code class="language-plaintext highlighter-rouge">id</code> that it matched.</p>
<p>The analyzer employees various heuristics and tricks to make sure that
all tags are identified such as aliases (Western Digital vs WD, Call of
Duty vs COD), years (2008 vs 08), numbers (IV vs 4) and the list goes
on.</p>
<h3 id="feature-comparison">Feature comparison</h3>
<p>When a new product arrives, its analysis is stored in a serialized
format and updated every time the product is changed.</p>
<p>After the category classification has ended the SKU classification takes place
by retrieving the product’s analysis and comparing it with existing
SKUs.</p>
<p>Based on some predefined strategies such as <code class="language-plaintext highlighter-rouge">absolute high entropy PN match</code>the
comparison phase will yield a match with a certain confidence level. We
have 3 confidence levels:</p>
<ul>
<li><strong>Auto</strong>: the product is classified with no human intervention</li>
<li><strong>Semi-Auto</strong>: the product is classified but a human must confirm at some
point</li>
<li><strong>Manual</strong>: a human will approve this classification but until then the
new product is not classified</li>
</ul>
<h3 id="stats">Stats</h3>
<p>As of today, more than 45% of incoming products that belong to existing
SKUs are classified with no human intervention, another 40% is classified
but requires approval, and 10% is classified after a human approves the
match. Only 5% of new products escape Ngntron and require a human to
look for a match.</p>
<p>With the help of Ngntron, merchants with thousands of products can
go live with more than 90% of their product catalog listed on Skroutz in
just an hour.</p>
<h3 id="other-uses">Other uses</h3>
<p>We use Ngntron’s feature extraction capabilities not just for
classification but for other cases as well. Our internal project QuLA
will use the same pipeline to determine if an XML feed is suitable for
Skroutz in advance and advise the account team accordingly.</p>
<p>We also use extracted features to guide the content team when
retroactively adding specifications to a category.</p>
<h3 id="scaling">Scaling</h3>
<p>Since we expect to reach 20,000 merchants and 250,000 products per day in the
near future, classification automation is one of the most important and
high impact processes in Skroutz.</p>
<p>We have already tweaked the algorithm to learn from past classifications
and adapt its category based confidence levels. In some categories, more
than 90% of products are auto classified, greatly reducing the load of
the content team and thus enabling us to scale our merchant base.</p>
<p>Of course even 5% of manual classification on such a scale is a huge
load and that’s why the content engineering team is already optimizing Ngntron
to further reduce the amount of manual work.</p>
<p>Oh and by they way, <a href="https://www.skroutz.gr/careers/52">they are hiring</a></p>
<hr />
<p>If you enjoyed reading this post and are curious to learn how Ngntron
and other tools in Skroutz work, <a href="https://www.skroutz.gr/careers#job-openings">checkout our open positions</a></p>
<p><a href="https://engineering.skroutz.gr/blog/how-we-classify-products/">How we classify products at Skroutz</a> was originally published by George Hadjigeorgiou at <a href="https://engineering.skroutz.gr">Skroutz Engineering</a> on March 05, 2021.</p>https://engineering.skroutz.gr/blog/uncovering-a-24-year-old-bug-in-the-linux-kernel2021-02-10T22:00:00+00:002021-02-10T22:00:00+00:00Apollon Oikonomopouloshttps://engineering.skroutz.gr<p>As part of our standard toolkit, we provide each developer at Skroutz
with a writable database snapshot against which she can develop. These
snapshots are updated daily through a pipeline that involves taking an
LVM snapshot of production data, anonymizing the dataset by stripping
all personal data, and transferring it via rsync to the development
database servers. The development servers in turn use ZFS snapshots to
expose a copy-on-write snapshot to each developer, with
self-service tools allowing rollback or upgrades to newer snapshots.</p>
<p>We use the same pipeline to expose MariaDB and MongoDB data, with a
full dataset size of 600GB and 200GB respectively, and a slightly
modified pipeline for Elasticsearch. While on-disk data changes
significantly for all data sources, rsync still saves significant time
by transferring roughly 1/3 of the full data set every night. This
setup has worked rather well for the better part of a decade and has
managed to scale from 15 developers to 150. However, as with most
systems, it has had its fair share of maintenance and has given us
some interesting moments.</p>
<p>One of the most interesting issues we encountered led to the discovery
of a fairly old bug in the Linux kernel TCP implementation: every now
and then, an rsync transfer from a source server would hang
indefinitely for no apparent reason, as — apart from the stuck transfer —
everything else seemed to be in order. What’s more, for reasons that became
apparent later, the issue could not be reproduced at will, although
some actions (e.g. adding an rsync-level rate limit) seemed to make
the issue less frequent, with frequency ranging from once or twice per
week to once every three months.</p>
<p>As is not unusual in these cases, we had more urgent systems and issues to
attend to, so we labeled this a “race condition in rsync” that we
should definitely look into at some point, and worked around it by
throttling the rsync transfers.</p>
<p>Until it started biting us every single day.</p>
<h2 id="rsync-as-a-pipeline">rsync as a pipeline</h2>
<p>While not strictly necessary, knowing how rsync works internally will help
understand the analysis that follows. The rsync site contains <a href="https://rsync.samba.org/how-rsync-works.html">a thorough
description</a> of rsync’s internal architecture, so I’ll try to
summarize the most important points here:</p>
<ol>
<li>
<p>rsync starts off as a single process on the client and a single
process on the server, communicating via a socket pair. When using
the rsync daemon, as in our case, communication is done over a
plain TCP connection</p>
</li>
<li>
<p>Based on the direction of sync, after the initial handshake is
over, each end assumes a <em>role</em>, either that of the <em>sender</em>, or
that of the <em>receiver</em>. In our case the client is the receiver,
and the server is the sender.</p>
</li>
<li>
<p>The receiver forks an additional process called the <em>generator</em>,
sharing the socket with the <em>receiver</em> process. The <em>generator</em>
figures out what it needs to ask from the <em>sender</em>, and the
<em>sender</em> subsequently sends the data to the <em>receiver</em>. What we
essentially have after this step is a pipeline, <em>generator</em> →
<em>sender</em> → <em>receiver</em>, where the arrows are the two directions of
<em>the same</em> TCP connection. While there is some signaling involved,
the pipeline operates in a <em>blocking</em> fashion and relies on OS
buffers and TCP receive windows to apply backpressure.</p>
</li>
</ol>
<h2 id="a-ghost-in-the-network">A ghost in the network?</h2>
<p>Our first reaction when we encountered the issue was to suspect the
network for errors, which was a <em>reasonable</em> thing to do since we had
recently upgraded our servers and switches. After eliminating the
usual suspects (NIC firmware bugs involving TSO/GSO/GRO/VLAN
offloading, excessive packet drops or CRC errors at the switches etc),
we came to the conclusion that everything was normal and something
else had to be going on.</p>
<p>Attaching the hung processes using strace and gdb told us little: the
generator was hung on <code class="language-plaintext highlighter-rouge">send()</code> and the sender and receiver were hung
on <code class="language-plaintext highlighter-rouge">recv()</code>, yet no data was moving. However, turning to the kernel on
both systems revealed something interesting! On the client the rsync
socket shared between the <em>generator</em> and the <em>receiver</em> processes was
in the following state:</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span>ss <span class="nt">-mito</span> dst :873
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 392827 ❶ 2001:db8:2a::3:38022 2001:db8:2a::18:rsync timer:<span class="o">(</span>persist,1min56sec,0<span class="o">)</span>
skmem:<span class="o">(</span>r0,rb4194304,t0,tb530944,f3733,w401771,o0,bl0,d757<span class="o">)</span> ts sack cubic wscale:7,7 rto:204 backoff:15 rtt:2.06/0.541 ato:40 mss:1428 cwnd:10 ssthresh:46 bytes_acked:22924107 bytes_received:100439119971 segs_out:7191833 segs_in:70503044 data_segs_out:16161 data_segs_in:70502223 send 55.5Mbps lastsnd:16281856 lastrcv:14261988 lastack:3164 pacing_rate 133.1Mbps retrans:0/11 rcv_rtt:20 rcv_space:2107888 notsent:392827 minrtt:0.189</code></pre></figure>
<p>while on the server, the socket state was the following:</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span>ss <span class="nt">-mito</span> src :873
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 0 2001:db8:2a::18:rsync 2001:db8:2a::3:38022 timer:<span class="o">(</span>keepalive,3min7sec,0<span class="o">)</span>
skmem:<span class="o">(</span>r0,rb3540548,t0,tb4194304,f0,w0,o0,bl0,d292<span class="o">)</span> ts sack cubic wscale:7,7 rto:204 rtt:1.234/1.809 ato:40 mss:1428 cwnd:1453 ssthresh:1431 bytes_acked:100439119971 bytes_received:22924106 segs_out:70503089 segs_in:7191833 data_segs_out:70502269 data_segs_in:16161 send 13451.4Mbps lastsnd:14277708 lastrcv:16297572 lastack:7012576 pacing_rate 16140.1Mbps retrans:0/794 rcv_rtt:7.5 rcv_space:589824 minrtt:0.026</code></pre></figure>
<p>The interesting thing here is that there are 3.5MB of data on the
client, queued to be sent (❶ in the first output) by the
<em>generator</em> to the server; however, while the server has an empty <code class="language-plaintext highlighter-rouge">Recv-Q</code>
and can accept data, nothing seems to be moving forward. If <code class="language-plaintext highlighter-rouge">Recv-Q</code>
in the second output was non-zero, we would be looking at rsync on the
server being stuck and not reading from the network, however here it
is obvious that rsync has consumed all incoming data and is not to
blame.</p>
<p>So why is data queued up on one end of the connection, while the other end is
obviously able to accept it? The answer is conveniently hidden in the <code class="language-plaintext highlighter-rouge">timer</code>
fields of both <code class="language-plaintext highlighter-rouge">ss</code> outputs, especially in
<code class="language-plaintext highlighter-rouge">timer:(persist,1min56sec,0)</code>. Quoting <code class="language-plaintext highlighter-rouge">ss(8)</code>:</p>
<figure class="highlight"><pre><code class="language-man" data-lang="man"> -o, --options
Show timer information. For TCP protocol, the output format is:
timer:(<timer_name>,<expire_time>,<retrans>)
<timer_name>
the name of the timer, there are five kind of timer names:
on : means one of these timers: TCP retrans timer, TCP
early retrans timer and tail loss probe timer
keepalive: tcp keep alive timer
timewait: timewait stage timer
persist: zero window probe timer
unknown: none of the above timers</code></pre></figure>
<p><code class="language-plaintext highlighter-rouge">persist</code> means that the connection has received a zero window
advertisement and is waiting for the peer to advertise a non-zero
window.</p>
<h2 id="tcp-zero-windows-and-zero-window-probes">TCP Zero Windows and Zero Window Probes</h2>
<p>TCP implements flow control by limiting the data in flight using a sliding
window called the <em>receive window</em>. Wikipedia has a <a href="https://en.wikipedia.org/wiki/Transmission_Control_Protocol#Flow_control">good description</a>, but in short each end of a TCP connection advertises how much
data it is willing to buffer for the connection, i.e. how much data the other
end may send before waiting for an acknowledgment.</p>
<p>When one side’s receive buffer (<code class="language-plaintext highlighter-rouge">Recv-Q</code>) fills up (in this case
because the rsync process is doing disk I/O at a speed slower than the
network’s), it will send out a zero window advertisement, which will
put that direction of the connection on hold. When buffer space
eventually frees up, the kernel will send an unsolicited window update
with a non-zero window size, and the data transfer continues. To be
safe, just in case this unsolicited window update is lost, the other
end will regularly poll the connection state using the so-called Zero
Window Probes (the <code class="language-plaintext highlighter-rouge">persist</code> mode we are seeing here).</p>
<h2 id="the-window-is-stuck-closed">The window is stuck closed</h2>
<p>It’s now time to dive a couple of layers deeper and use <code class="language-plaintext highlighter-rouge">tcpdump</code> to
see what’s going on at the network level:</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">[…]
09:34:34.165148 0c:c4:7a:f9:68:e4 > 0c:c4:7a:f9:69:78, ethertype IPv6 (0x86dd), length 86: (flowlabel 0xcbf6f, hlim 64, next-header TCP (6) payload length: 32) 2001:db8:2a::3.38022 > 2001:db8:2a::18.873: Flags [.], cksum 0x711b (incorrect -> 0x4d39), seq 4212361595, ack 1253278418, win 16384, options [nop,nop,TS val 2864739840 ecr 2885730760], length 0
09:34:34.165354 0c:c4:7a:f9:69:78 > 0c:c4:7a:f9:68:e4, ethertype IPv6 (0x86dd), length 86: (flowlabel 0x25712, hlim 64, next-header TCP (6) payload length: 32) 2001:db8:2a::18.873 > 2001:db8:2a::3.38022: Flags [.], cksum 0x1914 (correct), seq 1253278418, ack 4212361596, win 13831, options [nop,nop,TS val 2885760967 ecr 2863021624], length 0
[… repeats every 2 mins]</code></pre></figure>
<p>The first packet is the rsync client’s zero window probe, the second
packet is the server’s response. Surprisingly enough, the server is
advertising a non-zero window size of 13831 bytes¹ which the client
apparently ignores.</p>
<p>¹ actually multiplied by 128 because of a <a href="https://en.wikipedia.org/wiki/TCP_window_scale_option">window scaling</a> factor
of 7</p>
<p>We are finally making some progress and have a case to work on! At
some point the client encountered a zero window advertisement from the
server as part of regular TCP flow control, but then the window failed
to re-open for some reason. The client seems to be still ignoring
the new window advertised by the server and this is why the transfer
is stuck.</p>
<h2 id="linux-tcp-input-processing">Linux TCP input processing</h2>
<p>By now it’s obvious that the TCP connection is in a weird state on the
rsync client. Since TCP flow control happens at the kernel level, to
get to the root of this we need to look at how the Linux kernel
handles incoming TCP acknowledgments and try to figure out why it
ignores the incoming window advertisement.</p>
<p>Incoming TCP packet processing happens in
<a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/ipv4/tcp_input.c"><code class="language-plaintext highlighter-rouge">net/ipv4/tcp_input.c</code></a>.Despite
the <code class="language-plaintext highlighter-rouge">ipv4</code> component in the path, this is mostly shared code between
IPv4 and IPv6.</p>
<p>Digging a bit through the code we find out that incoming window
updates are handled in
<a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/ipv4/tcp_input.c?id=c3df39ac9b0e3747bf8233ea9ce4ed5ceb3199d3#n3552"><code class="language-plaintext highlighter-rouge">tcp_ack_update_window</code></a>
and actually updating the window is guarded by the following function:</p>
<figure class="highlight"><pre><code class="language-c" data-lang="c"><span class="cm">/* Check that window update is acceptable.
* The function assumes that snd_una<=ack<=snd_next.
*/</span>
<span class="k">static</span> <span class="kr">inline</span> <span class="n">bool</span> <span class="nf">tcp_may_update_window</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">tcp_sock</span> <span class="o">*</span><span class="n">tp</span><span class="p">,</span>
<span class="k">const</span> <span class="n">u32</span> <span class="n">ack</span><span class="p">,</span> <span class="k">const</span> <span class="n">u32</span> <span class="n">ack_seq</span><span class="p">,</span>
<span class="k">const</span> <span class="n">u32</span> <span class="n">nwin</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">return</span> <span class="n">after</span><span class="p">(</span><span class="n">ack</span><span class="p">,</span> <span class="n">tp</span><span class="o">-></span><span class="n">snd_una</span><span class="p">)</span> <span class="o">||</span> <span class="err">❶</span>
<span class="n">after</span><span class="p">(</span><span class="n">ack_seq</span><span class="p">,</span> <span class="n">tp</span><span class="o">-></span><span class="n">snd_wl1</span><span class="p">)</span> <span class="o">||</span> <span class="err">❷</span>
<span class="p">(</span><span class="n">ack_seq</span> <span class="o">==</span> <span class="n">tp</span><span class="o">-></span><span class="n">snd_wl1</span> <span class="o">&&</span> <span class="n">nwin</span> <span class="o">></span> <span class="n">tp</span><span class="o">-></span><span class="n">snd_wnd</span><span class="p">);</span> <span class="err">❸</span>
<span class="p">}</span></code></pre></figure>
<p>The <code class="language-plaintext highlighter-rouge">ack</code>, <code class="language-plaintext highlighter-rouge">ack_seq</code>, <code class="language-plaintext highlighter-rouge">snd_wl1</code> and <code class="language-plaintext highlighter-rouge">snd_una</code> variables hold TCP
sequence numbers that are used in TCP’s sliding window to keep track
of the data exchanged over the wire. These sequence numbers are 32-bit
unsigned integers (<code class="language-plaintext highlighter-rouge">u32</code>) and are incremented by 1 for each byte that
is exchanged, beginning from an arbitrary initial value (<em>initial
sequence number</em>). In particular:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">ack_seq</code> is the sequence number of the incoming segment</li>
<li><code class="language-plaintext highlighter-rouge">ack</code> is the <em>acknowledgment number</em> contained in the incoming
segment, i.e. it acknowledges the sequence number of the last
segment the peer received from us.</li>
<li><code class="language-plaintext highlighter-rouge">snd_wl1</code> is the sequence number of the incoming segment that last
updated the peer’s receive window.</li>
<li><code class="language-plaintext highlighter-rouge">snd_una</code> is the sequence number of the first <em>unacknowledged</em>
segment, i.e. a segment we have sent but has not been yet
acknowledged by the peer.</li>
</ul>
<p>Being fixed-size integers, the sequence numbers will eventually wrap
around, so the <code class="language-plaintext highlighter-rouge">after()</code> macro takes care of comparing two sequence
numbers <a href="https://en.wikipedia.org/wiki/Serial_number_arithmetic">in the face of wraparounds</a>.</p>
<p>For the record, the <code class="language-plaintext highlighter-rouge">snd_una</code> and <code class="language-plaintext highlighter-rouge">snd_wl1</code> names come directly from
the <a href="https://tools.ietf.org/html/rfc793#section-3.2">original TCP specification in RFC 793</a>, back in 1981!</p>
<p>Translating the rather cryptic check into plain English, we are
willing to accept a window update from a peer if:</p>
<dl>
<dt>❶</dt>
<dd>our peer acknowledges the receipt of data we previously sent; <em>or</em></dd>
<dt>❷</dt>
<dd>our peer is sending new data since the previous window update; <em>or</em></dd>
<dt>❸</dt>
<dd>our peer isn’t sending us new data since the previous window update,
but is advertising a larger window</dd>
</dl>
<p>Note that the comparison of <code class="language-plaintext highlighter-rouge">ack_seq</code> with <code class="language-plaintext highlighter-rouge">snd_wl1</code> is done to make
sure that the window is not accidentally updated by a
(retransmission of a) segment that was seen earlier.</p>
<p>In our case, at least condition ❸ should be able to re-open the window, but
apparently it doesn’t and we need access to these variables to figure out what
is happening. Unfortunately, these variables are part of the internal kernel
state and are not directly exposed to userspace, so it’s time to get a bit
creative.</p>
<h2 id="accessing-the-internal-kernel-state">Accessing the internal kernel state</h2>
<p>To get access to the kernel state, we somehow need to run code inside
the kernel. One way would be to patch the kernel with a few <code class="language-plaintext highlighter-rouge">printk()</code>
calls here and there, but that would require rebooting the machine and
waiting for rsync to hang again. Rather, we opted to live-patch the
kernel using <a href="https://sourceware.org/systemtap/">systemtap</a> with the following script:</p>
<figure class="highlight"><pre><code class="language-perl" data-lang="perl"><span class="nv">probe</span> <span class="nv">kernel</span><span class="o">.</span><span class="nv">statement</span><span class="p">("</span><span class="s2">tcp_ack@./net/ipv4/tcp_input.c:3751</span><span class="p">")</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="nv">$sk</span><span class="o">-></span><span class="nv">sk_send_head</span> <span class="o">!=</span> <span class="nv">NULL</span><span class="p">)</span> <span class="p">{</span>
<span class="nv">ack_seq</span> <span class="o">=</span> <span class="nv">@cast</span><span class="p">(</span><span class="nv">&$skb</span><span class="o">-></span><span class="nv">cb</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="p">"</span><span class="s2">tcp_skb_cb</span><span class="p">",</span> <span class="p">"</span><span class="s2">kernel<net/tcp.h></span><span class="p">")</span><span class="o">-></span><span class="nv">seq</span>
<span class="nb">printf</span><span class="p">("</span><span class="s2">ack: %d, ack_seq: %d, prior_snd_una: %d</span><span class="se">\n</span><span class="p">",</span> <span class="nv">$ack</span><span class="p">,</span> <span class="nv">ack_seq</span><span class="p">,</span> <span class="nv">$prior_snd_una</span><span class="p">)</span>
<span class="nv">seq</span> <span class="o">=</span> <span class="nv">@cast</span><span class="p">(</span><span class="nv">&$sk</span><span class="o">-></span><span class="nv">sk_send_head</span><span class="o">-></span><span class="nv">cb</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="p">"</span><span class="s2">tcp_skb_cb</span><span class="p">",</span> <span class="p">"</span><span class="s2">kernel<net/tcp.h></span><span class="p">")</span><span class="o">-></span><span class="nv">seq</span>
<span class="nv">end_seq</span> <span class="o">=</span> <span class="nv">@cast</span><span class="p">(</span><span class="nv">&$sk</span><span class="o">-></span><span class="nv">sk_send_head</span><span class="o">-></span><span class="nv">cb</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="p">"</span><span class="s2">tcp_skb_cb</span><span class="p">",</span> <span class="p">"</span><span class="s2">kernel<net/tcp.h></span><span class="p">")</span><span class="o">-></span><span class="nv">end_seq</span>
<span class="nb">printf</span><span class="p">("</span><span class="s2">sk_send_head seq:%d, end_seq: %d</span><span class="se">\n</span><span class="p">",</span> <span class="nv">seq</span><span class="p">,</span> <span class="nv">end_seq</span><span class="p">)</span>
<span class="nv">snd_wnd</span> <span class="o">=</span> <span class="nv">@cast</span><span class="p">(</span><span class="nv">$sk</span><span class="p">,</span> <span class="p">"</span><span class="s2">tcp_sock</span><span class="p">",</span> <span class="p">"</span><span class="s2">kernel<linux/tcp.h></span><span class="p">")</span><span class="o">-></span><span class="nv">snd_wnd</span>
<span class="nv">snd_wl1</span> <span class="o">=</span> <span class="nv">@cast</span><span class="p">(</span><span class="nv">$sk</span><span class="p">,</span> <span class="p">"</span><span class="s2">tcp_sock</span><span class="p">",</span> <span class="p">"</span><span class="s2">kernel<linux/tcp.h></span><span class="p">")</span><span class="o">-></span><span class="nv">snd_wl1</span>
<span class="nv">ts_recent</span> <span class="o">=</span> <span class="nv">@cast</span><span class="p">(</span><span class="nv">$sk</span><span class="p">,</span> <span class="p">"</span><span class="s2">tcp_sock</span><span class="p">",</span> <span class="p">"</span><span class="s2">kernel<linux/tcp.h></span><span class="p">")</span><span class="o">-></span><span class="nv">rx_opt</span><span class="o">-></span><span class="nv">ts_recent</span>
<span class="nv">rcv_tsval</span> <span class="o">=</span> <span class="nv">@cast</span><span class="p">(</span><span class="nv">$sk</span><span class="p">,</span> <span class="p">"</span><span class="s2">tcp_sock</span><span class="p">",</span> <span class="p">"</span><span class="s2">kernel<linux/tcp.h></span><span class="p">")</span><span class="o">-></span><span class="nv">rx_opt</span><span class="o">-></span><span class="nv">rcv_tsval</span>
<span class="nb">printf</span><span class="p">("</span><span class="s2">snd_wnd: %d, tcp_wnd_end: %d, snd_wl1: %d</span><span class="se">\n</span><span class="p">",</span> <span class="nv">snd_wnd</span><span class="p">,</span> <span class="nv">$prior_snd_una</span> <span class="o">+</span> <span class="nv">snd_wnd</span><span class="p">,</span> <span class="nv">snd_wl1</span><span class="p">)</span>
<span class="nb">printf</span><span class="p">("</span><span class="s2">flag: %x, may update window: %d</span><span class="se">\n</span><span class="p">",</span> <span class="nv">$flag</span><span class="p">,</span> <span class="nv">$flag</span> <span class="o">&</span> <span class="mh">0x02</span><span class="p">)</span>
<span class="nb">printf</span><span class="p">("</span><span class="s2">rcv_tsval: %d, ts_recent: %d</span><span class="se">\n</span><span class="p">",</span> <span class="nv">rcv_tsval</span><span class="p">,</span> <span class="nv">ts_recent</span><span class="p">)</span>
<span class="k">print</span><span class="p">("</span><span class="se">\n</span><span class="p">")</span>
<span class="p">}</span>
<span class="p">}</span></code></pre></figure>
<p>Systemtap works by converting systemtap scripts into C and building a
kernel module that hot-patches the kernel and overrides specific
instructions. Here we overrode <code class="language-plaintext highlighter-rouge">tcp_ack()</code>, hooked at its end and
dumped the internal TCP connection state. The <code class="language-plaintext highlighter-rouge">$sk->sk_send_head !=
NULL</code> check is a quick way to only match connections that have a
non-empty <code class="language-plaintext highlighter-rouge">Send-Q</code>.</p>
<p>Loading the resulting module into the kernel gave us the following:</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">ack: 4212361596, ack_seq: 1253278418, prior_snd_una: 4212361596
sk_send_head seq:4212361596, end_seq: 4212425472
snd_wnd: 0, tcp_wnd_end: 4212361596, snd_wl1: 1708927328
flag: 4100, may update window: 0
rcv_tsval: 2950255047, ts_recent: 2950255047</code></pre></figure>
<p>The two things of interest here are <code class="language-plaintext highlighter-rouge">snd_wl1: 1708927328</code> and
<code class="language-plaintext highlighter-rouge">ack_seq: 1253278418</code>. Not only are they not identical as we would
expect, but actually <code class="language-plaintext highlighter-rouge">ack_seq</code> is <em>smaller</em> than <code class="language-plaintext highlighter-rouge">snd_wl1</code>, indicating
that <code class="language-plaintext highlighter-rouge">ack_seq</code> wrapped around at some point and <code class="language-plaintext highlighter-rouge">snd_wl1</code> has not been
updated for a while. Using the <a href="https://en.wikipedia.org/wiki/Serial_number_arithmetic">serial number arithmetic</a> rules, we can figure out that this end has
received (at least) 3.8 GB since the last update of <code class="language-plaintext highlighter-rouge">snd_wl1</code>.</p>
<p>We already saw that <code class="language-plaintext highlighter-rouge">snd_wl1</code> contains the last sequence number used
to update the peer’s receive window (and thus our send window), with
the ultimate purpose of guarding against window updates from old
segments. It should be okay if <code class="language-plaintext highlighter-rouge">snd_wl1</code> is not updated for a while,
but it should not lag too far behind <code class="language-plaintext highlighter-rouge">ack_seq</code>, or else we risk
rejecting valid window updates, as in this case. So it looks like the
Linux kernel fails to update <code class="language-plaintext highlighter-rouge">snd_wl1</code> under some circumstances, which
leads to an inability to recover from a zero-window condition.</p>
<p>Having tangible proof that something was going on in the kernel, it
was time to get people familiar with the networking code in the loop.</p>
<h2 id="taking-things-upstream">Taking things upstream</h2>
<p>After sleeping on this, we wrote a good summary of what we knew so far
and what we supposed was happening, and reached out to <a href="https://lore.kernel.org/netdev/87eelz4abk.fsf@marvin.dmesg.gr/T/#u">the Linux
networking maintainers</a>. Confirmation came less than two
hours later, <a href="https://lore.kernel.org/netdev/87eelz4abk.fsf@marvin.dmesg.gr/T/#mf568052a4f9d76d847ae192d3632b8e87083d75a">together with a patch by Neal
Cardwell</a>.</p>
<p>Apparently, the bug was in the <em>bulk receiver fast-path</em>, a code path
that skips most of the expensive, strict TCP processing to optimize
for the common case of bulk data reception. This is a significant
optimization, outlined 28 years ago² by Van Jacobson in his <a href="https://www.pdl.cmu.edu/mailinglists/ips/mail/msg00133.html">“TCP
receive in 30 instructions” email</a>. Apparently
the Linux implementation did not update <code class="language-plaintext highlighter-rouge">snd_wl1</code> while in the
receiver fast path. If a connection uses the fast path for too long,
<code class="language-plaintext highlighter-rouge">snd_wl1</code> will fall so far behind that <code class="language-plaintext highlighter-rouge">ack_seq</code> will wrap around with
respect to it. And if this happens while the receive window is zero,
there is no way to re-open the window, as demonstrated above. What’s
more, this bug had been present in Linux <a href="https://git.kernel.org/pub/scm/linux/kernel/git/history/history.git/commit/net/ipv4/tcp_input.c?h=2.1.8&id=0f9cac5b27076f801b29a0867868e1bce7310e00&ignorews=1">since v2.1.8</a>, dating
back to 1996!</p>
<p>² This optimization is still relevant today: a relatively recent
<a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=45f119bf936b1f9f546a0b139c5b56f9bb2bdc78">attempt</a> to remove the header prediction code and associated fast
paths to simplify the code was <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=31770e34e43d6c8dee129bfee77e56c34e61f0e5">reverted</a> on performance
regression grounds.</p>
<p>As soon as we got the patch, we applied it, rebuilt the kernel,
deployed it on the affected machines and waited to see if the issue
was fixed. A couple of days later we were certain that the fix was indeed
correct and did not cause any ill side-effects.</p>
<p>After a bit of discussion, the <a href="https://patchwork.ozlabs.org/project/netdev/patch/20201022143331.1887495-1-ncardwell.kernel@gmail.com/">final commit</a> landed in
<code class="language-plaintext highlighter-rouge">linux-net</code>, and from there it was merged into Linux mainline for 5.10-rc1.
Eventually it found its way to the stable 4.9 and 4.19 kernel series that we
use on our Debian systems, in 4.9.241 and 4.19.153 respectively.</p>
<h2 id="aftermath">Aftermath</h2>
<p>With the fix in place, we still had a couple of questions to answer,
namely:</p>
<ul>
<li>
<p>How is it possible for a TCP bug that leads to stuck connections to
go unnoticed for 24 years?</p>
</li>
<li>
<p>Out of an infrastructure with more than 600 systems running all kinds of
software, how come we only witnessed this bug when using rsync?</p>
</li>
</ul>
<p>It’s hard to give a definitive answer to these questions, but we can
reason about it this way:</p>
<ol>
<li>
<p>This bug will not be triggered by most L7 protocols. In
“synchronous” request-response protocols such as HTTP, usually
each side will consume all available data before sending. In this
case, even if <code class="language-plaintext highlighter-rouge">snd_wl1</code> wraps around, the bulk receiver will be
left with a non-zero window and will still be able to send out
data, causing the next acknowledgment to update the window and
adjust <code class="language-plaintext highlighter-rouge">snd_wl1</code> through check ❶ in <code class="language-plaintext highlighter-rouge">tcp_may_update_window</code>. <code class="language-plaintext highlighter-rouge">rsync</code> on the
other hand uses a pretty aggressive pipeline where the server might send
out multi-GB responses without consuming incoming data in the process.
Even in <code class="language-plaintext highlighter-rouge">rsync</code>’s case, using <code class="language-plaintext highlighter-rouge">rsync</code> over SSH (a rather common
combination) rather than the plain TCP transport would not expose this bug,
as SSH framing/signaling would most likely not allow data to queue up on
the server this way.</p>
</li>
<li>
<p>Regardless of the application protocol, the receiver must remain
long enough (for at least 2GB) with a zero send window in the fast
path to cause a wrap-around — but not too long for <code class="language-plaintext highlighter-rouge">ack_seq</code>
to overtake <code class="language-plaintext highlighter-rouge">snd_wl1</code> again. For this to happen, there must be no
packet loss or other conditions that would cause the fast path’s header
prediction to fail. This is very unlikely to happen in practice as TCP
itself determines the network capacity by actually causing packets to be
lost.</p>
</li>
<li>
<p>Most applications will care about network timeouts and will either fail or
reconnect, making it appear as a “random network glitch” and leaving no
trace to debug behind.</p>
</li>
</ol>
<p>Finally, even if none of the above happens and you end up with a stuck
TCP connection, it takes a lot of annoyance to decide to deal with it
and drill deep in kernel code. And when you do, you are rewarded with
a nice adventure, where you get to learn about internet protocol
history, have a glimpse at kernel internals, and witness open source
work in motion!</p>
<hr />
<p>If you enjoyed reading this post and you like hunting weird bugs and
looking at kernel code, you might want to drop us a line
— we are always looking for talented <a href="https://apply.workable.com/skroutz/j/485671FB1F/">SREs</a> and <a href="https://apply.workable.com/skroutz/j/9D8A0589DE/">DevOps
Engineers</a>!</p>
<p><a href="https://engineering.skroutz.gr/blog/uncovering-a-24-year-old-bug-in-the-linux-kernel/">Uncovering a 24-year-old bug in the Linux Kernel</a> was originally published by Apollon Oikonomopoulos at <a href="https://engineering.skroutz.gr">Skroutz Engineering</a> on February 10, 2021.</p>https://engineering.skroutz.gr/blog/speed-the-journey-to-delivering-a-faster-experience-at-skroutz-gr2020-10-22T21:00:00+00:002020-10-22T21:00:00+00:00Skroutz Engineering Teamhttps://engineering.skroutz.gr<h1 id="tldr">TL;DR</h1>
<p>We’ve always placed the user experience first, here at Skroutz. Since a performant application is essential for a seamless journey, speed has always been at our core.</p>
<p>Our rapidly evolving environment -the number of development teams, the adoption of new technologies, the addition of new features etc.- gradually slowed us down.</p>
<p>We knew we had to take action.</p>
<p>For this, we formed a non-typical task-force team to speed us up. We identified the problems, chose our measurement tools and methods and took the plunge.</p>
<p>Measuring performance is not an easy task. It involves both user perception and strictly defined metrics and thresholds.</p>
<p>In order to improve the speed, we tried various solutions. Some worked. Some didn’t. Below you can read in short the key takeaways.</p>
<p><strong>Assets</strong>. Our main goal was to optimize the number and timing of requests. By initially loading only the necessary above the fold images and fine tuning our lazy loading mechanisms, we noticed significant gains in terms of initial requests (almost half in our Product page and up to 30 less in our Listing) and therefore some worthy improvement in Speed Index metrics (in some cases up to ~4.5%).</p>
<p><strong>HTML</strong>. Excessive DOM size was one of our most critical performance bottlenecks. Our Product pages (the most important section) could reach up to ~8k nodes in some cases, far from Google’s proposal of 1,5k.<br />
We tried various solutions involving windowing (rejected), async loading product cards’ content and showing less user reviews (by risking losing valuable user generated content).<br />
What did make a huge difference was timing: Loading the information when it actually needed to exist. This was achieved by implementing a mechanism that would notify each card when it was about to appear in the viewport. The only element needed beforehand was a single-node placeholder. In some cases the DOM nodes were reduced by 45%, which results in an increase of ~10 points in our overall Lighthouse score!</p>
<p><strong>CSS</strong>. Although our styling architecture was in pretty good shape, we thought it might be worth trying critical CSS. The concept was to initially load only the necessary styles for rendering anything above-the fold. This would improve metrics such as First Contentful Paint & Largest Contentful Paint while making the loading feel faster. It turned out that the above metrics were too slightly improved compared to the effort needed to add it in our pipeline. In short, this didn’t work for us.</p>
<p><strong>Javascript</strong>. Moving gradually from static to interactive pages caused code bloating, especially at the Javascript side. Our main JS file was including lots of libraries that were not used in every page. This is a problem, especially for mobile devices, due to the fact that JS runs in the main thread.<br />
Our actions, directed to reduce our webpack bundle size in order to release main thread calculations for the initial load, and iterate over the Redux architecture to improve speed after user interaction, led to slightly better performance.</p>
<p>During this journey, we also started addressing some issues on new <strong>Web Vitals</strong> user-centric metrics. We mainly focused on visual stability, by eliminating any layout shifts.</p>
<p>After a year’s work, we <strong>made Skroutz.gr faster</strong>. And more stable.</p>
<p>If you are interested in more details, and you’re ready for a deeper technical dive, make yourself a coffee and keep on reading (it will take ~30 minutes to read).</p>
<hr />
<blockquote>
<p><strong>Table of Contents</strong></p>
<p><a href="#a-brief-history">A Brief History</a> <br /></p>
<p><a href="#speed-not-a-metric-but-a-users-issue">Speed: not a Metric, but a Users’ Issue</a> <br /></p>
<p><a href="#evolution-of-performance-metrics-from-speed-index-to-core-web-vitals">Evolution of Performance Metrics: from Speed Index to Core Web Vitals</a> <br />
› <a href="#pagespeed-insights-psi">Pagespeed Insights (PSI)</a> <br />
› <a href="#core-web-vitals">Core Web Vitals</a> <br /></p>
<p><a href="#the-problems-of-skroutzgr">The Problems of Skroutz.gr</a> <br />
› <a href="#html">HTML</a> <br />
› <a href="#css">CSS</a> <br />
› <a href="#javascript">Javascript</a> <br />
› <a href="#assets">Assets</a> <br /></p>
<p><a href="#the-journey-what-worked-and-what-didnt">The Journey: What Worked and What Didn’t</a> <br />
› <a href="#assets-networking">Assets</a> <br />
› <a href="#html-1">HTML</a> <br />
› <a href="#css-1">CSS</a> <br />
› <a href="#javascript-1">Javascript</a> <br />
› <a href="#core-web-vitals-cumulative-layout-shifts-cls-issues">Core Web Vitals: Cumulative Layout Shift (CLS)</a> <br /></p>
<p><a href="#onwards---closing">Onwards - Closing</a> <br /></p>
</blockquote>
<h1 id="a-brief-history">A Brief History</h1>
<p><a href="https://www.skroutz.gr/" target="_blank">Skroutz.gr</a> was always a quite fast and sophisticated web application.</p>
<p>Speed has always been a critical component for <a href="https://www.skroutz.gr/" target="_blank">Skroutz.gr</a> since we believe
that for a modern web experience, it’s important to get fast and stay fast.</p>
<p>Historically, the biggest problem we were facing regarding speed (and the biggest blessing at the same time),
was the really huge amount of content (DOM) in some of our most popular pages, which contains a lot of shops and user-generated content, like reviews, questions, etc.
This problem becomes bigger and bigger as we add extra information for Products and Categories or extra services
(we have developed a <a href="https://www.skroutz.gr/ecommerce/landing">Marketplace functionality</a> where users can buy directly from Skroutz.gr).</p>
<p>Back in 2016, the huge DOM of some pages was causing crashes due to memory restrictions in some devices (i.e. iPad),
while at the same time the performance was poor, in terms of rendering and painting.
<a href="https://engineering.skroutz.gr/blog/Skroutz-redesign-how-we-designed-and-implemented-our-own-Design-System/#html" target="_blank">To solve these issues at that time</a>,
we started requesting and rendering elements asynchronously.</p>
<p>However, since <a href="https://engineering.skroutz.gr/blog/Skroutz-redesign-how-we-designed-and-implemented-our-own-Design-System/" target="_blank">our last major redesign in 2016</a>,
lots of things have changed.</p>
<p>Facts like the rapidly growing number of development teams, the adoption of new technologies (i.e. React js, CSS Grid),
the addition of more and more features in our pages, etc., led to worse rendering performance, despite the fact that today
there are better and more powerful devices our applications are running on.</p>
<p>Rendering speed took a backseat.</p>
<p>On the other hand, one of the main questions we’re regularly asking ourselves here at Skroutz, is whether our website responds to our users’ expectations and what we can do in order to help them with their buying decisions. When it comes to user experience, speed matters.</p>
<p>Today, consumers are more demanding than they’ve ever been. When they weigh up the experience on a site, they aren’t just
comparing it with their competitors, they’re rating it against the best in class services they use every day.</p>
<p>Being of “Moderate Speed” was not acceptable for us, so we decided to take action in order to resolve the issues.</p>
<p>We formed a non-typical task-force team, consisting of engineers, SEO-ers and product owners and we started working on,
in order to improve our speed.</p>
<p>In the following, we describe things we did, how we measured our actions, what worked for us, what didn’t work, and some
takeaways from our experience during the journey.</p>
<hr />
<h1 id="speed-not-a-metric-but-a-users-issue">Speed: not a Metric, but a Users’ Issue</h1>
<p>Imagine you’re walking through an unfamiliar city to get to an important appointment. <br />
You walk through various streets and city centers on your way. But here and there, there are slow automatic doors
you have to wait for to open and unexpected construction detours lead you astray. All of these events interrupt
your progress, increase stress and distract you from reaching your destination.</p>
<p>People using the web are also on a journey, with each of their actions constituting one step in what would ideally be a continuous flow.
And just like in the real world, they can be interrupted by delays, distracted from their tasks and led to make errors. <br />
These events, in turn, can lead to reduced satisfaction and abandonment of a site or the whole journey.</p>
<p>In both cases, removing interruptions and obstacles is the key to a smooth journey and a satisfied user
[<a href="https://blog.chromium.org/2020/05/the-science-behind-web-vitals.html" target="_blank">chromium blog</a>].</p>
<p>When it comes to user experience, speed matters. A
<a href="https://www.ericsson.com/en/press-releases/2016/2/streaming-delays-mentally-taxing-for-smartphone-users-ericsson-mobility-report" target="_blank">consumer study</a>
shows that the <strong>stress response to delays in mobile speed are similar to that of watching a horror movie or solving
a mathematical problem</strong>, and greater than waiting in a checkout line at a retail store [<a href="https://web.dev/why-speed-matters/" target="_blank">ref</a>]. <br /></p>
<p>Website performance is crucial to a web application’s success. <br /></p>
<p>Amazon found that each additional 1/10th of a second of load time corresponded with a 1% reduction in sales.
Walmart found that for every second they improved their page load times they added an additional 2% to their conversion rate
[<a href="https://www.alphabetcreative.com/speed-matters-website-performance-and-perception/" target="_blank">ref</a>].
EBay saw a 0.5% increase in “Add to Cart” count for every 100 milliseconds improvement in search page loading time
[<a href="https://web.dev/shopping-for-speed-on-ebay/" target="_blank">ref</a>].</p>
<p>Besides conversion rates, you may know that <a href="https://webmasters.googleblog.com/2018/01/using-page-speed-in-mobile-search.html" target="_blank">Google uses the performance of a website as a ranking factor</a> in search results as well!</p>
<p>In his book <a href="https://www.nngroup.com/books/usability-engineering/" target="_blank">Usability Engineering (1993), Jakob Nielsen</a>*
identifies three main response time limits.</p>
<ul>
<li><strong>0.1 second</strong> — Operations that are completed in 100ms or fewer will feel instantaneous to the user.
This is the gold standard that one should aim for when optimising your websites.</li>
<li><strong>1 second</strong> — Operations that take 1 second to finish are generally OK, but the user will feel the pause.
If all operations take 1 second to complete, a website may feel a little sluggish.</li>
<li><strong>10 seconds</strong> — If an operation takes 10 seconds or more to complete, the user may switch over to a new tab,
or give up on the website completely (this depends on what operation is being completed.
For example, users are more likely to stick around if they’ve just submitted their card details in the checkout
than if they’re waiting to load a product page).</li>
</ul>
<p>* <em>Since these limits published back in 1993, as internet speed have increased and we are now browsing the web
at a lightning pace, there is a speculation that the upper limit is pretty smaller, close to 5 seconds or even lower.</em></p>
<p><strong>Takeaway: Performance is important</strong>! It can mean the difference between making a sale, or losing a customer to the competition.</p>
<hr />
<h1 id="evolution-of-performance-metrics-from-speed-index-to-core-web-vitals">Evolution of Performance Metrics: from Speed Index to Core Web Vitals</h1>
<p>Performance is a foundational aspect of good user experiences.</p>
<p><strong>But what exactly is Performance?</strong></p>
<p>And how do we put a page in the fast or in the slow bucket?</p>
<p>Users of the web expect that the pages they visit will be fastly rendered, interactive and smooth.
Pages should not only load quickly, but also run well; scrolling should be stick-to-finger fast, and animations and interactions should be silky smooth.</p>
<p>Performance is more about user perception and less about the actual, objective duration.
How fast a website feels like it’s loading and rendering has a greater impact on user experience than how fast the website actually loads and renders.</p>
<p>How fast or slow something feels like, depends a lot on whether the user is actively or passively waiting for this thing to happen. Waits can have an active and passive phase. When the user is active - moving the mouse,
thinking, being entertained, they are in an active phase. <br />
The passive phase occurs when the user is passively waiting, like staring at a monochrome screen. If both the passive and active waits time were objectively equal, users would estimate that the passive waiting period was longer than the active. If a load, render, or response time can not be objectively minimized any further, turning the wait into an active wait instead of a
passive wait can make it feel faster.</p>
<p>Besides perception, as the web evolves over time, the metrics and the thresholds evolve too.</p>
<p>How we measure and assort a page today regarding their rendering speed, may be completely irrelevant tomorrow.</p>
<p>While a lot of things constantly change, there is something that remains the same: <strong>human perceptual abilities</strong>, which are critical in evaluating an experience.</p>
<p>But how do we practically evaluate whether a page is fast or not in Skroutz all these years?</p>
<p>There are 2 main phases regarding this.</p>
<p>We used to focus on low level timings, like the Time to First Byte (server response, networking), the <a href="https://developer.mozilla.org/en-US/docs/Glossary/Speed_index" target="_blank">Speed Index</a> (visual display),
the First Paint etc. <br />
Now, we try to incorporate more quality user metrics.</p>
<p>Let’s see the most important ones… starting from Google.</p>
<p>According to Google too, speed matters. For this, Google encourages developers to think broadly about how performance affects a user’s experience of their page and to consider a variety of user experience metrics.</p>
<p>To the time being, the following are some resources that we, at Skroutz, use to evaluate a page’s performance:</p>
<ul>
<li><a href="https://developers.google.com/web/tools/lighthouse/" target="_blank">Lighthouse</a>, an automated tool and a part of
Chrome Developer Tools for auditing the quality (performance, accessibility, and more) of web pages. <br /></li>
<li><a href="https://developers.google.com/speed/pagespeed/insights/" target="_blank">PageSpeed Insights</a>, a tool that
indicates how well a page performs on the Chrome UX Report and suggests performance optimizations.</li>
<li><a href="https://web.dev/vitals/" target="_blank">Web Vitals</a> is the latest initiative by Google, to provide unified
guidance for quality signals that are essential to delivering a great user experience on the web.</li>
<li><a href="https://developers.google.com/web/tools/chrome-user-experience-report/" target="_blank">Chrome User Experience Report</a>,
a public dataset of key user experience metrics for popular destinations on the web, as experienced by <strong>Chrome users under real-world conditions</strong>.</li>
</ul>
<p>Google has long used page speed as a signal for rankings, and the new (and different) approach in this signal uses
data measured directly by Chrome on users’ desktop and mobile devices. As a result, Google <a href="https://webmasters.googleblog.com/2020/05/evaluating-page-experience.html" target="_blank">announced</a>
that in 2021 the Core Web Vitals metrics will join other user experience (UX) signals to become <strong>a ranking signal</strong>.</p>
<h2 id="pagespeed-insights-psi">PageSpeed Insights (PSI)</h2>
<p><a href="https://developers.google.com/speed/docs/insights/v5/about" target="_blank">Google’s PageSpeed Insights (PSI)</a> reports on the performance of a page on both mobile and desktop devices, and provides suggestions on how that
page may be improved.</p>
<p>PSI provides both <strong>lab and field data</strong> about a page. Lab data is useful for debugging performance issues, as it is
collected in a controlled environment. However, it may not capture real-world bottlenecks. Field data is useful for
capturing true, real-world user experience - but has a more limited set of metrics. See <a href="https://developers.google.com/web/fundamentals/performance/speed-tools" target="_blank">How To Think About Speed Tools</a> for more information on the 2 types of data.</p>
<p>At the top of the report, PSI provides a score which summarizes the page’s performance. This score is determined by running Lighthouse to collect and analyze lab data about the page. A score of 90 or above is considered good. 50 to 90 is a score that needs improvement, and below 50 is considered poor.</p>
<h2 id="core-web-vitals">Core Web Vitals</h2>
<p><a href="https://web.dev/vitals/#core-web-vitals" target="_blank">Core Web Vitals</a> are the subset of Web Vitals that apply to all web pages, should be measured by all site owners, and will be surfaced across all Google tools.</p>
<p>Each of the Core Web Vitals represents a distinct facet of the user experience, is measurable in the field, and reflects the real-world experience of a critical user-centric outcome.</p>
<p>Although the metrics that make up Core Web Vitals is being evolved over time, the current set for 2020 focuses on three aspects of the user experience: <strong>loading</strong>, <strong>interactivity</strong>, and <strong>visual stability</strong>:</p>
<ul>
<li>Largest Contentful Paint (LCP): measures loading performance. To provide a good user experience, LCP should occur within 2.5 seconds of when the page first starts loading.</li>
<li>First Input Delay (FID): measures interactivity. To provide a good user experience, pages should have a FID of less than 100 milliseconds.</li>
<li>Cumulative Layout Shift (CLS): measures visual stability. To provide a good user experience, pages should maintain a CLS of less than 0.1.</li>
</ul>
<hr />
<h1 id="the-problems-of-skroutzgr">The Problems of Skroutz.gr</h1>
<p>Generally, when a user types a URL in her browser, the browser makes a GET request to a remote server, the server responds with some resources, which when they arrive at the browser, they are integrated together in order to render a visual.</p>
<p>For this procedure to deploy, besides networking timings and delays, one of the most critical components is the <strong>weight of the resources asked</strong>.</p>
<p>These resources are usually, the HTML (from where the DOM gets built), the CSS (from where the CSSOM gets built), probably one or more JS scripts, Images and Fonts (Assets). Let’s break down each one.</p>
<h2 id="html">HTML</h2>
<p>A large DOM tree can slow down page performance in multiple ways.</p>
<p>First of all, a large DOM tree often includes many nodes that aren’t visible when the user first loads the page, which unnecessarily increases data costs for the users and slows down load time. Furthermore, as users and scripts interact with the page, the browser must constantly recompute the position and styling of nodes, causing rendering lags. Last but not least, targeting elements (through CSS or JS) applies to a large number of nodes,
which can overwhelm the memory capabilities of devices.</p>
<p>Skroutz’s main issue at the time being, was the excessive DOM size, especially at the Product pages.</p>
<p>Unfortunately, our Product pages are the most important sections of our application and have a lot of content, user generated or not. Even worse, Product pages with a lot of content (and excessive DOM) are the most popular ones, since the content regards many shops, a lot of product information, multimedia, many user reviews etc.</p>
<p>Although many sections were already coming asynchronously, they were still too heavy. At that time, our most heavy Product pages had <strong>~8k nodes</strong>. This was far from Google’s Lighthouse proposal of <strong>1,5K nodes total maximum</strong>.</p>
<p>Below is a graph of our top visited 3.000 Product pages, regarding Shops and Reviews numbers.</p>
<p><img src="https://engineering.skroutz.gr/images/speed_at_skroutz/shops_and_reviews.png" alt="'shops and reviews'" /></p>
<p class="caption">Shops & reviews of top 3.000 products</p>
<p>With the most popular pages having more than 30 shop cards and at least 30 user reviews each, it was clear that we had to find ways to lighten the weight without running the risk of getting hit by an SEO issue (rankings).</p>
<p>That was a quite difficult exercise to solve.</p>
<h2 id="css">CSS</h2>
<p>CSS is, besides HTML, the most critical component for a browser.</p>
<p>The browser can only paint the page once it has downloaded the CSS and built the CSS object model. For this reason, CSS is render blocking.</p>
<p>Browsers follow a specific rendering path: paint only occurs after layout, which occurs after the render tree is created, which in turn requires both the DOM and the CSSOM trees.</p>
<p>Our styling architecture was in pretty good shape (<a href="https://engineering.skroutz.gr/blog/Skroutz-redesign-how-we-designed-and-implemented-our-own-Design-System/#css" target="_blank">you can read our approach in detail here, which is close to the current state</a>).</p>
<p>We bundle our CSS files depending on the viewport (mobile-first approach) and we further separate them in a few major sections in order for them to be easily handled from the browser (i.e. Books section, logged section etc).</p>
<p>As it was hype during this period, we thought we can try critical CSS, especially on mobile viewports to test if it could speed up the rendering process.</p>
<h2 id="javascript">Javascript</h2>
<p>When a browser runs many events, it’s going to do it on the same thread that handles user input (called the main thread).</p>
<p>By default, the main thread of the renderer process typically handles most code: it parses the HTML and builds the DOM, parses the CSS and applies the specified styles, and parses, evaluates, and executes Javascript.</p>
<p>The main thread also processes user events. So, any time the main thread is busy doing something else, a web page may not respond to user interactions, leading to a bad experience.</p>
<p>Loading too much Javascript into the main thread (via <code class="language-plaintext highlighter-rouge"><script></code>, etc.) was the main issue for us, especially for mobile devices.</p>
<p>The size of our JS bundle (named skr_load.js) was 312KB (1.2MB) after compression!</p>
<p>The main issues regarding Javascript were the following:</p>
<ul>
<li>Lack of Tree shaking, many unused components and dead code</li>
<li>Lots of application and library code were in the same big fat JS bundle</li>
<li>Lots of libraries like <strong>lodash</strong> were fully imported instead of partially</li>
<li>Heavy dependencies included in the abovementioned JS bundle still not used in any other page</li>
</ul>
<h2 id="assets">Assets</h2>
<p>According to HTTP Archive, as of November 2018, images makeup on average 21% of a total webpage’s weight.</p>
<p>So when it comes to optimizing a website, after video content, images are by far the first place one should start!</p>
<p>Optimizing images is more important than scripts and fonts.</p>
<p>And ironically, a good image optimization workflow is one of the easiest things to implement, yet a lot of website owners overlook this.</p>
<p>This was true for us too.</p>
<p>We found many images in different sections that got requested initially, although they weren’t rendered unless the users scrolled down a lot.</p>
<hr />
<h1 id="the-journey-what-worked-and-what-didnt">The Journey: What Worked and What Didn’t</h1>
<p>Having had written down the total set of performance bottlenecks, it was the time for actions.</p>
<p>Although for most web pages it’s pretty straightforward what’s necessary for a better rendering performance, this was not true for us.</p>
<p>Because there is one magic word, regarding speed: <strong>diet</strong>!</p>
<p>In general, page speed could be improved by reducing the payload across all resources. By simply loading less code. Trimming all the unused and unnecessary bytes of JavaScript, CSS, HTML, and JSON responses served to users.</p>
<p>However, <a href="https://www.skroutz.gr/" target="_blank">Skroutz.gr</a> is a popular web application with more than <strong>30 million sessions per month</strong>.</p>
<p>We had to be very careful in terms of user experience, since even a small change could add-up to a huge drop in sales.</p>
<p>Furthermore, the majority of our visitors come from organic searches, so we had to deploy that diet without running the risk to negatively impact our SEO performance.</p>
<p>Here is how we did it.</p>
<h2 id="assets-networking">Assets (networking)</h2>
<p>While, according to our initial analysis, the main bottlenecks were DOM size (HTML) and JS scripting, we opted for the low hanging fruits first.</p>
<p><strong>Assets loading</strong> was the first and most obvious place to look for unnecessary initial calls that could easily be made async.</p>
<p><strong>Images’ optimization</strong> was our best shot regarding assets, since we don’t have any non-safe webfonts or any other assets.</p>
<p>For the most part, images were loading on scroll and were adequately lightweight and optimized. But we had room for improvement.</p>
<h3 id="product-page">Product page</h3>
<p>In our Product pages, however, there were a few exceptions, mostly due to our -somewhat outdated- image lightbox.</p>
<p>Although UI-wise there are only 5 visible thumbnails on large screens (image below) and none on mobile, all images’ thumbs were loaded beforehand, along with the first high-res image of the carousel. The more the images, the bigger the problem. Note that our most popular Product pages, like mobile phones, could have anywhere from 20 to 30 images each.</p>
<p><img src="https://engineering.skroutz.gr/images/speed_at_skroutz/thumbnail_gallery.png" alt="'listing page speed index'" /></p>
<p class="caption">Product page's gallery thumbnails</p>
<p>The lightbox was indeed outdated, but so did the structure of the list holding the thumbs. A brief refactor not only saved the redundant image requests, it also saved 3 DOM nodes per lightbox image (minus the 5 thumbs on desktop).</p>
<p>Most notably, we removed the <code class="language-plaintext highlighter-rouge"><img></code> tags, which also held the data-attributes used to populate the lightbox. We moved the data-attributes to the parent <code class="language-plaintext highlighter-rouge"><li></code> and used anchor tags only for the 5 visible thumbs, placing the images as background-image directly on them.</p>
<p>Background-image, unlike regular <code class="language-plaintext highlighter-rouge"><img></code>, does not load unless visible*, thus saving the extra requests from mobile viewports without the need to have a different markup structure.</p>
<p>Taking into account some additional minor cuts (e.g. async load the 3 images of product suggestions, load first high-res lightbox image only after opening), image requests were reduced to almost half.</p>
<p>In numbers, one of our most popular phones with 25 images, instead of 39, now does 20 image requests, all being necessary above-the-fold images.</p>
<p>After deploying, Speed Index showed a decrease of ~4.5% (see image below, red line for the Product score).</p>
<p><img src="https://engineering.skroutz.gr/images/speed_at_skroutz/thumbnail_diet.png" alt="'Thumbnail diet results graph'" /></p>
<p class="caption">Thumbnails "diet" results graph</p>
<p>Apart from the top (above the fold) section, we already used lazy loading on the product cards in the Product pages, but there was some room for improvement and this involved the reviews section at the bottom of the page.</p>
<p>Down there, we noticed that the user thumbnails were loaded immediately even though they were far down below the fold. After some code inspection, we realized that there was a lazy loading mechanism (using an external library) but it didn’t work properly.</p>
<p>This was caused by a CSS rule that was setting the user thumbnail as a background image on the appropriate element. Thumbnails were loaded immediately* and the lazy loading library didn’t have to do any work at all.</p>
<p>We fixed this by removing the specific CSS rule and replacing the old lazy loading mechanism with a newer one (using Intersection Observer).</p>
<p>The results on pages with 30 reviews were:</p>
<ul>
<li>30 HTTP requests less <br /></li>
<li>30 - 100kb less on page load</li>
</ul>
<p>Test results (table below) showed a small improvement even though this can just be score fluctuations from Pagespeed.
In any case, it was an easy fix that reduced HTTP calls and network traffic.</p>
<table>
<tr>
<th>User thumbnails load initially</th>
<th>User thumbnails load asynchronously</th>
<th>Difference</th>
</tr>
<tr>
<td>60.4</td>
<td>63.1</td>
<td>4.5%</td>
</tr>
</table>
<p class="caption">Pagespeed scores for user thumbnails</p>
<p>* <small>Images in stylesheets will trigger an HTTP request only after the render tree has been calculated and the corresponding elements are about to be rendered. However there are <a href="https://csswizardry.com/2018/06/image-inconsistencies-how-and-when-browsers-download-images/" target="_blank">inconsistencies among browsers</a>.</small></p>
<h3 id="listing-page">Listing page</h3>
<p>In <a href="https://www.skroutz.gr/" target="_blank">Skroutz.gr</a> we have 2 types of Listing layouts: normal & tile.</p>
<ul>
<li><strong>Normal (list) layout</strong>: every row has one product which translates to one image per row.</li>
<li><strong>Tile layout</strong>: every row has more than one product, which means more images per row (4 in desktop, 2 in mobile viewports).</li>
</ul>
<p>In normal layout, we had an average Pagespeed performance score range from 80 to 90+ and in tile layout from 40+ to 50+.</p>
<p>Truth be told, tile rows are bigger (higher) than list rows so their ratio is not exactly 4:1 but generally speaking, tile lists load more images/products than normal lists.</p>
<p>In tile layout lists, we had more than 60 HTTP requests for images for about 800kb of data.</p>
<p>That’s a lot of requests and data we could shave off!</p>
<p>We tried solving this with the native HTML attribute “loading”.</p>
<p>This posed 2 problems: <br />
First, browser coverage is somewhat low (~70%) mainly because of Safari not supporting the feature. (as of 07/2020) <br />
Second, browsers implement native lazy load differently. The biggest difference is between Chrome and Firefox.
Chrome is playing it safe, loading a lot of images before being scrolled into view (they’re trying to find the sweet spot).
On the other hand, Firefox is really aggressive with lazy loading, only loading images that are 50% or more inside the viewport.</p>
<p>As we couldn’t rely on HTML for this, JS came to the rescue.</p>
<p>We created a React Higher Order Component (<a href="https://reactjs.org/docs/higher-order-components.html" target="_blank">HOC</a>) that utilises <a href="https://developer.mozilla.org/en-US/docs/Web/API/Intersection_Observer_API" target="_blank">IntersectionObserver</a> capabilities. <br /></p>
<p>Using this HOC, we implemented lazy loading in Listing images that works in the same way in every browser that supports Intersection Observer API (almost 90% including Safari).</p>
<p>We now have control over the loading threshold and we don’t rely on every different native implementation of every browser.</p>
<p>Running tests with Pagespeed Insights on quite heavy Listing pages (like <a href="https://www.skroutz.gr/c/1009/andrika-mpoufan.html" target="_blank">men’s jackets</a>) yielded some really good results (~10 points improvement).</p>
<p>Below is the graph of a heavy Listing’s page Speed Index and the average pagespeed improvement.</p>
<p><img src="https://engineering.skroutz.gr/images/speed_at_skroutz/lazy_load_listing.png" alt="'listing page speed index'" /></p>
<p class="caption">Speed index improvement graph</p>
<table>
<tr>
<th>Before lazy load</th>
<th>After lazy load</th>
<th>Difference</th>
</tr>
<tr>
<td>48.5</td>
<td>59.5</td>
<td>22.6%</td>
</tr>
</table>
<p class="caption">Pagespeed score for tile layout list</p>
<h2 id="html-1">HTML</h2>
<p>As already mentioned above, the excessive number of DOM nodes was one of our most critical performance bottlenecks in our pages.</p>
<h3 id="product-page-1">Product page</h3>
<p>In Product pages there are rendered shops that sell the product. In some popular ones, due to the large number of shops selling the product, the DOM nodes exceed 8.000!</p>
<p>Undoubtedly, for our products pages this was the most challenging part, since click to shops is the most critical path to a buyer’s journey.</p>
<p>Google’s Lighthouse suggests that in order to optimize large lists, one should use a library called <strong>react-window</strong>. By using this library we could achieve to only render the list items in our viewport. <br />
In other words, while a user is scrolling through the shops’ list, the actual items that are being rendered are the ones that currently exist in the viewport along with a few items before and after those that have been already displayed.</p>
<p>This eventually did not work for us, and the main disadvantage was that the product cards did not have a fixed height.
Although the library provides a solution for dynamic list items, in our case, the shop cards have a lot of information that should be rendered. But the result wasn’t the expected one. Many shop cards failed to be rendered at the right time, mostly on “faster” scrolls and the overall experience seemed broken.</p>
<p>The solution was in another direction.</p>
<p>We had to load the information at the right time when it was actually needed to exist. It was crucial that the cards should maintain their fixed height while being loaded, in order to avoid layout shifts.</p>
<p>In order to achieve this we had to separate the primary information, from which the card’s height was defined, from the secondary one. We considered product links as the primary information because they designate the card’s height and price, shop location, ratings etc. as secondary.</p>
<p>The solution was to <strong>render a single node as a placeholder instead of a bunch of nodes</strong> that represent secondary information on the initial page load.</p>
<p>Next step was to implement a mechanism that would notify each card that would appear in the viewport and IntersectionObserver suited perfectly for this!</p>
<p>Last and final step, for each card displayed in viewport, we replaced the placeholder with the actual information.</p>
<p>By completing all the aforementioned steps, the number of the DOM nodes was
reduced dramatically.</p>
<p>In some cases the <strong>DOM nodes were reduced by 45%</strong>, which results in an increase of <strong>~10 points</strong> in our overall page score.</p>
<p>In addition to the abovementioned, we kept an eye to the <strong>users’ reviews section</strong>.</p>
<p>The reviews reduction experiment was part of our effort to reduce DOM elements in the Product page, without running the risk of dropping in organic results, from an SEO perspective.</p>
<p>User reviews are the most typical form of user-generated content (UGC). User reviews about a product is one of the most critical things that can impact purchasing decisions. Consumers are actively looking for content that is unique, relevant, and trustworthy. In fact, according to BrightLocal, 88 percent of consumers trust online reviews written by other consumers as much as they trust suggestions from their personal network
[<a href="https://www.brightlocal.com/research/local-consumer-review-survey-2014/" target="_blank">ref.</a>].</p>
<p>Yet what many don’t know is that UGC is also good for SEO. Search engines such as Google and Bing rank websites based on whether the sites’ content is relevant and useful. Over 25% of the search results for the 20 largest brands in the world are linked to user-generated content [<a href="https://www.pixlee.com/blog/seo-ideas-how-to-improve-seo-with-user-generated-content/" target="_blank">ref.</a>].</p>
<p>In order to reduce reviews’ number at initial load, we had to carefully implement and deploy an experiment first to see if the SEO can be impacted.</p>
<p>We currently render the first 30 reviews with a “load more” button for loading the rest of them. Every Review has roughly 30 DOM elements which translates to 900 elements more or less on every Product page.</p>
<p>For the experiment, we divided two Product page groups, one with twelve (12) initial reviews and the other with seven (7).</p>
<p>First of all, we wanted to see how the reduced reviews impact the rendering performance. <br />
Second, we kept an eye on the conversion rates and the users’ flow onsite. <br />
Third, we were up at SEO performance, comparing the 2 groups having reduced reviews’ number with a control group (no change). <br /></p>
<p>After running a number of Pagespeed index tests for every before and after state, we got the following results.</p>
<table>
<tr>
<th>30 reviews (group 1)</th>
<th>7 reviews (group 1)</th>
<th>Difference</th>
</tr>
<tr>
<td>60.6</td>
<td>69.2</td>
<td>14.2%</td>
</tr>
</table>
<table>
<tr>
<th>30 reviews (group 2)</th>
<th>12 reviews (group 2)</th>
<th>Difference</th>
</tr>
<tr>
<td>64.7</td>
<td>73.3</td>
<td>13.3%</td>
</tr>
</table>
<p class="caption">Review reduction experiment results on 2 groups of products</p>
<p>We had an improvement of almost 9 points for both groups which leads us to believe that:</p>
<ul>
<li>We probably reached the biggest improvement we can get from DOM elements reduction.</li>
<li>There is no reason to reduce our initial reviews number to 7 since 12 yields the same improved scores.</li>
</ul>
<p>Also, regarding the users’ flow and the conversion rates and sales, we didn’t record unusual fluctuations.</p>
<p>Last but not least, we didn’t notice statistically significant SEO performance changes, that would discourage us from exposing the change across the site.</p>
<h2 id="css-1">CSS</h2>
<p><a href="https://web.dev/extract-critical-css/" target="_blank">Critical CSS</a> was a really weird concept the first time we came around it.</p>
<p>The general idea is: Take all the CSS rules you need for rendering above-the-fold elements and put them in your HTML file.</p>
<p>The pros of this trick are that the browser will instantly read this “Critical CSS” and start rendering the above-the-fold elements with their applied rules instead of waiting for a CSS file to download and then do the rendering.</p>
<p>The rest of the CSS is downloaded when onload event is fired thus not blocking the browser from rendering.</p>
<p>Critical CSS affects metrics like <strong>First Contentful Paint</strong> & <strong>Largest Contentful Paint</strong>.</p>
<p>After some research for possible implementation methods and an experiment that ran in selected Product pages, we reached the following conclusions:</p>
<ul>
<li>The change in scores was minuscule (1-2 points) and probably was caused by fluctuations in Pagespeed Index results.</li>
<li>The implementation of critical CSS for production needed a lot of effort. We would probably have to set up an automated job, generating all the critical CSS rules every time a change in our styles was pushed into master.</li>
</ul>
<p>The combination of high effort & low gains made us stop focusing on this idea and pursue other ways to improve performance and lower rendering times.</p>
<p><strong>Takeaway</strong>: Critical CSS didn’t work for Skroutz.gr!</p>
<h2 id="javascript-1">Javascript</h2>
<p>In order to optimize our JS performance, we worked on reducing the main bundle file that was causing overload of the main thread (initial request), and on Redux architecture for faster response on user’s inputs. <br />
We finally came up with the following solutions:</p>
<h3 id="ways-to-reduce-our-webpack-bundle-size">Ways to reduce our webpack bundle size</h3>
<p>After analysis we started by avoiding libraries’ global imports and enforcing this rule with eslint.
For example, requiring only the needed specific lodash functions resulted in a <strong>9% bundle reduction</strong>.
Enforcing this rule with eslint made sure we won’t come across this issue again.</p>
<p>Then we used code splitting. With webpack you can split your bundle up into many smaller ones and only load the bundles needed by each page. We tried to split our code and ship it in different bundles, but unfortunately this
didn’t work for us, because of the many shared heavy dependencies between our main pages.</p>
<p><strong>It did not reduce overall bundle size (it even slightly increased it), so we decided not to proceed with it</strong>.</p>
<h3 id="redesign-the-state-of-one-main-page-of-our-react-redux-application-into-a-normalised-shape">Redesign the state of one main page of our React Redux application into a normalised shape</h3>
<p>This initiative was about improving the performance (response) after a user’s action on a page (i.e. filtering the results of a Listing), not for the initial request.</p>
<p>Keeping state normalised plays a key role in improving performance and avoiding unnecessary re-renders of the React components.</p>
<p>In a normalised state each type of data gets its own “table”, each “data table” should store the individual items in an object (with the IDs of the items as keys and the items themselves as the values), any references to individual items should be done by storing the item’s ID and ordering should be indicated by the use of arrays of IDs.</p>
<p>With this normalised shape, no changes in multiple places are required when an item is updated, the reducer logic doesn’t have to deal with deep levels of nesting and the logic for retrieving or updating a given item is now fairly simple and consistent [<a href="https://redux.js.org/recipes/structuring-reducers/normalizing-state-shape" target="_blank">read more on this</a>].</p>
<h3 id="react-hydration-takes-long">React hydration takes long</h3>
<p>Another problem we found was the hydration on the client.</p>
<p>Hydration is the process by which React attempts to attach event listeners to the existing markup on client side, it is also an important process because it validates that the markup generated from the server and the markup on the client is the same, proof that SSR works as expected.</p>
<p>Hydration is a time consuming process that increases load time and delays TTI. The solution to that problem is progressive hydration, unfortunately due to our SSR implementation we couldn’t implement that.</p>
<p>However we can implement lazy hydration as a replacement but React is already considering including progressive hydration in its core soon.</p>
<h2 id="core-web-vitals-cumulative-layout-shifts-cls-issues">Core Web Vitals: Cumulative Layout Shifts (CLS) issues</h2>
<p>In late May 2020, while we had already progressed in our making-Skroutz-faster journey, Google announced they’ll be “<a href="https://webmasters.googleblog.com/2020/05/evaluating-page-experience.html" target="_blank">Evaluating page experience for a better web</a>”.</p>
<p>What this meant for us, is that we had to focus on enhancing page experience, according to Google’s <a href="https://web.dev/vitals/#core-web-vitals" target="_blank">Core Web Vitals</a> metrics.</p>
<p>As Google announced, the above metrics will evolve over time. Therefore, it’s likely that we would be chasing a moving target here, but we wanted to see it, so we emphasized on <a href="https://web.dev/cls/" target="_blank">CLS, a user-centric metric for measuring visual stability</a>, that was our main issue at that time, according to <a href="https://search.google.com/search-console/about" target="_blank">Google’s Search Console</a>.</p>
<p>There were 2 main areas that induced layout shifts: <strong>Image loads & user interactions</strong>.</p>
<h3 id="image-loads">Image loads</h3>
<p>Although it is quite common for image loading to cause layout shifts (LS), it is also easy to solve, by defining image dimensions.</p>
<p>Our most affected page was the Product page with color variations (on desktop), which had two types of images causing LS when loading: main image and color variation thumbs.</p>
<p>The latter was easier to solve, by adding a fixed height placeholder on the container element.</p>
<p>Fixing the main image LS was trickier, because of its orientation-dependent, variable height.
Predefining its height was not an option, at least not for all products. While predefined height on portrait images seemed to solve the problem, this wasn’t the case for landscape images.</p>
<p>We then tried preloading the main image. If the network is fast enough to fetch the image before page rendering starts, no LS is caused.</p>
<p>The above fixes eliminated LS that occurred on Product page initial load, which essentially zeroed out lab data CLS.</p>
<p>Although the initial CLS score caused by image loads was not that significant (~0.03) any gain that will keep our pages score < 0.1 (marked as fast by Google) is important.</p>
<h3 id="user-interactions">User interactions</h3>
<p>Google search console marked a large number of our URLs as poor, the issue being CLS.
The marked issues concerned both Product & Listing pages on mobile viewports.</p>
<p>After some investigation, the cause was found.</p>
<p>CLS was caused by our <strong>sticky header</strong>.</p>
<p>Header gets sticky after users scroll past a certain point, after which fixed positioning is added. Apart from the header itself, the issue involved sticky navigation on the Product page and sticky filters on the Listing page.</p>
<p>While the issue was a bit more complex (e.g. paddings were added to other elements to keep everything in place) simply put, adding or removing these sticky elements from the static flow of the document caused a Layout Shift.</p>
<p>Even more, this LS kept adding up each time our header got stuck or unstuck, resulting in significant CLS scores.</p>
<p>A simplified description of the solution is that we explicitly declared the heights of the sticky element containers.
The containers then functioned as placeholders, maintaining the sticky element heights, even when they got out of the static flow.</p>
<p>A similar problem occurred in our <strong>product cards</strong>, where the shop’s rating and location were displayed.
This information is fetched asynchronously which means that in the initial render the content of that section is empty.
Once the data is fetched and the section populated, the container’s height changes, causing the next cards to be pushed down.</p>
<p>The solution was simple in that case too, we just had to specify the height of the placeholder’s container.</p>
<p>After the above mentioned fixes, our pages got improved, being now marked as “good URLs” instead of “URLs need improvement”, as the images below show.</p>
<p><strong>Yeah</strong>!</p>
<p><img src="https://engineering.skroutz.gr/images/speed_at_skroutz/crux_report.png" alt="'Chrome User Experience CLS Report'" /></p>
<p class="caption">Skroutz's good CLS improved by 45%! (based on Chrome User Experience report)</p>
<p><img src="https://engineering.skroutz.gr/images/speed_at_skroutz/skroutz_mobile_web_vitals.png" alt="'Web vitals for mobile in Google's Search Console'" /></p>
<p class="caption">Good URLs rising after our CLS fixes (Google Search Console)</p>
<h1 id="onwards---closing">Onwards - Closing</h1>
<p>After a lot and fun work for over a year, varied from things that were low-effort to a few that were advanced, we’ve done it.</p>
<p><strong>We’ve made Skroutz.gr faster.</strong></p>
<p>Performance is a feature at Skroutz. But it is also a competitive advantage. Optimized experiences lead to higher user engagement, conversions, and ROI.</p>
<p>Striving for speed is a never-ending journey. Although we achieved a better performance during the last year -and hopefully a better user experience for our visitors-, this is not the end of the story.</p>
<p>We are now in a training mode, we are <strong>setting-up a “speed mentality”</strong> to our Front-End engineers, especially for the latest and greatest things on rendering performance (Core Web Vitals). This post is part of the training!</p>
<p>We are also <strong>establishing an additional continuous monitoring system</strong>, that is a set of tools and methodologies that we will further apply to the existing ones, in order to have the new performance metrics under our daily radars.</p>
<p>We strive for <strong>fast pages and fast development</strong>. At the same time.</p>
<p>We have lots to do more! :)</p>
<p>Congratulations if you made it to the bottom of this huge post.</p>
<p>We hope you got some valuable points from our speed journey.</p>
<p>Have you tried optimizing your speed before? <br />
Yes? No? Kinda? <br />
Let us know, writing your experience and your findings in a comment below.</p>
<p><strong>Best, <br />
Skroutz Devs.</strong></p>
<hr />
<p><em>top image source: <a href="https://www.google.com/url?q=https://unsplash.com/photos/0ZBRKEG_5no&sa=D&ust=1603379132565000&usg=AOvVaw0Y5xRp43UuO_ubGksLeIvx" target="_blank">unsplash</a></em></p>
<style type="text/css">
#tldr {
font-size: 20px;
}
.entry-content .caption {
text-align: center;
font-size: 14px;
font-style: italic;
margin-bottom: 3rem;
}
.entry-content > h1 {
margin-top: 3rem;
}
.entry-content > h2 {
font-size: 1.5rem;
}
.entry-content > h3 {
font-size: 1.2rem;
text-decoration: underline;
}
.entry-content blockquote {
background: #f6f6f6;
padding: 20px 25px;
border: 0;
margin: 30px 0;
transition: none;
}
.entry-content blockquote {
font-style: normal;
}
.entry-content blockquote p {
border-bottom: 1px dotted #ccc;
padding-bottom: 5px;
}
.entry-content blockquote > p > a {
color: #1d1db8;
}
.entry-content blockquote p,
.entry-content blockquote li {
font-size: .9rem;
}
.entry-content td {
background: #fdfdfd;
}
.entry-content small a {
color: #549B70;
}
@media screen and (min-width: 48em) {
.entry-content blockquote p,
.entry-content blockquote li {
font-size: 1rem;
}
}
</style>
<p><a href="https://engineering.skroutz.gr/blog/speed-the-journey-to-delivering-a-faster-experience-at-skroutz-gr/">Speed: The Journey to Delivering a Faster Experience at Skroutz.gr</a> was originally published by Skroutz Engineering Team at <a href="https://engineering.skroutz.gr">Skroutz Engineering</a> on October 22, 2020.</p>https://engineering.skroutz.gr/blog/process-optimization2020-06-07T08:23:09+00:002020-06-07T08:23:09+00:00George Hadjigeorgiouhttps://engineering.skroutz.gr<p>In every stage of a business, there will be some processes as part of everyday operations. Some examples include the onboarding of a new customer, replying to customer support tickets, or interviewing candidates for a new position.</p>
<p><strong>All processes,</strong> whether well defined and documented or just common knowledge, <strong>start small and simple</strong> and almost certainly <strong>end up huge and complicated</strong>. Drawing a parallel with the world of physics, processes obey the rule of inverted entropy: <strong>every process wants to transition from small and simple (low energy) to big and complex (high energy)</strong>.</p>
<p>This transition will not happen overnight but with small distinctive steps that will eventually slow down the team’s performance. In most of those cases, the slowdown will not be attributed to the increasing complexity of one or more processes but to the high load of the team which will result in more hires and further performance degradation.</p>
<p>Apart from the obvious problems that a complex process has, taking more time to complete and requiring more resources, there is one more hidden in the background that poses an even bigger threat than a slowdown: teams with complex processes don’t scale. Onboarding a new member takes a huge amount of time and the more people you add the more managers are required to control the complexity.</p>
<p>So how does someone optimize a process? As a rule of thumb, expect a significant efficiency boost in any process optimization effort. In <a href="https://www.goodreads.com/book/show/324750.High_Output_Management">High output management</a>, Andy Grove says:</p>
<blockquote>
<p>This is called work simplification. To get leverage this way, you first need to create a flow chart of the production process as it exists. Every single step must be shown on it; no step should be omitted in order to pretty things up on paper. Second, count the number of steps in the flow chart so that you know how many you started with. Third, set a rough target for reduction of the number of steps. In the first round of work simplification, our experience shows that you can reasonably expect a 30 to 50 percent reduction”</p>
</blockquote>
<p>Andy refers to production steps as in a factory but work simplification can be applied to any process. To <strong>keep</strong> your processes <strong>small and simple</strong> you need to do two things: <strong>don’t allow them to get complex and inspect them frequently to reduce the number of steps</strong>.</p>
<p>Both actions require asking the same questions either while introducing a new step or when inspecting a process to optimize it.</p>
<h2 id="optimizing">Optimizing</h2>
<p>Process optimization is mostly about inspecting a process and eliminating all unnecessary steps or optimizing them if elimination is not possible. To identify those steps we need to take a look at the most common causes of process complexity which are described below.</p>
<h3 id="better-safe-than-sorry">Better safe than sorry</h3>
<p>Some processes have a number of steps to ensure that nothing ever goes wrong. Say, for example, your support team has a process for handling customer requests of a certain type that is working fairly well. At some point, an angry customer reports a not so common case where your process failed in a way that it created a lot of frustration or actual damage. One or a few members of the team took a lot of heat and the customer eventually churned.</p>
<p>To avoid this happening again the manager of the team will add an extra step with additional checks to make sure the process is fail-safe. With time, a few of those not so common cases will translate to a number of additional steps being added to the process.</p>
<h3 id="legacy">Legacy</h3>
<p>This is a classic especially for processes that have been around for a long time. A step was introduced to gather some extra information required by law but that law no longer exists. Because of the distance between those handling the process and those designing it the step will sit there for quite some time even though it’s no longer required.</p>
<h3 id="scope-creep">Scope creep</h3>
<p>It’s not so uncommon to find processes with steps that apply to only a specific attribute of the business but have no scope. An e-commerce platform, for example, may require specific handling of a certain category that will be introduced as a new step. That step, however, isn’t that critical to all other categories and could be easily scoped affecting only a small share of the team’s resources.</p>
<h3 id="premature-steps">Premature steps</h3>
<p>Every process that has more than a few steps will probably have dependencies between them. Signing up a new customer, for example, will require an up-front setup fee and probably a few other steps. That setup-fee step should be placed at the top and all other steps should be blocked until the customer has made the initial payment. In many cases, processes pack all steps interdependently resulting in extra work that goes wasted if that critical step is never completed.</p>
<h3 id="requiring-various-levels-of-authority">Requiring various levels of authority</h3>
<p>Processes may have steps that can’t be executed by the same individual and requires a different, usually higher, level of authority. People with authority are usually far less than the people executing tasks which results in a bottleneck.</p>
<h3 id="no-automation">No automation</h3>
<p>This is probably the easiest one to address. Some steps are gradually degraded to something that can be automated or in other cases the technology required for automation just wasn’t there when the process was designed (e.g. checking the creditworthiness of an individual). Note that optimizing parts of the step but not the whole step will still increase efficiency.</p>
<p>Designing a process is equally important as maintaining it after it has been deployed. The most common cause of process inefficiency is lack of maintenance, it’s not uncommon to find processes that haven’t been inspected for years. Heavy load processes should be inspected every 3-6 months, while less frequent can be inspected in bigger intervals.</p>
<p><a href="https://engineering.skroutz.gr/blog/process-optimization/">Process Optimization</a> was originally published by George Hadjigeorgiou at <a href="https://engineering.skroutz.gr">Skroutz Engineering</a> on June 07, 2020.</p>https://engineering.skroutz.gr/blog/hiring-engineers-while-working-from-home2020-05-31T21:00:00+00:002020-05-31T21:00:00+00:00Nikos Fertakishttps://engineering.skroutz.gr<h4 id="or-how-we-learned-to-stop-worrying-and-love-the-engineering-interview-process">Or: How We Learned to Stop Worrying and Love the Engineering Interview Process</h4>
<p>Skroutz is hiring! We write this phrase on our social media and blog posts, and
we discuss it internally within our hiring teams. We are growing rapidly and
hiring people is one of our top priorities.</p>
<p>Even when the pandemic hit and we started working from home, not only did we not
stop our hiring efforts, but we doubled down on them and quickly adapted our
interview process to the new realities. To put it into perspective, of the 29
engineers we have hired in 2020, 25 were hired after March! In this post we will
discuss how we think about interviewing engineers, how our approach has evolved
over time, and how with a few tweaks the same approach worked well when it
became fully remote.</p>
<p>When this big hiring initiative started, our engineering hiring process was in
need of rethinking. It wasn’t really broken - we have hired many good people -
but with more colleagues joining our hiring team and more positions open than
ever, it had started to show its age. So we wrote down the things that concerned
us about it and spent a couple of weeks thinking, discussing, and reading
relevant articles and book chapters. In the end, we came up with a process that,
while not radically different, seemed to iron a few kinks out.</p>
<p>Of course, a hiring process that looks good on paper might not withstand
colliding with the real world. So we decided to try this process out on one of
our openings first and then share the results. Meanwhile, other divisions within
the company are experimenting with different variations of the process, which
means we get to meet and exchange experiences after!</p>
<p>Before moving forward, we must note that hiring is an inherently flawed
process. <em>Judging individuals for their skills is messy, hard, and at times
unfair</em>. Candidates are called to compete in something that barely resembles
their everyday work. In practice, the whole hiring process boils down to
minimizing “false negatives” in the early stages - candidates that should have
moved forward but didn’t - and “false positives” in the hiring decision -
candidates that were hired but weren’t a good fit after all. This is tricky and
you’re bound to make mistakes. Andy Grove wrote that <em>“careful interviewing
doesn’t guarantee you anything, it merely increases your odds of getting lucky”</em>.
In this post we share our experience and our current understanding which might
change in the future, so please take everything we say with a grain of salt.</p>
<h2 id="search-team-job-opening">Search Team Job Opening</h2>
<p>The search team was looking to hire two engineers, and we chose that job opening
to try our process. We received the first resume on 7 February and the last (we
removed the job listing) on 12 March. 83 candidates applied in total, and we
eventually hired 3 of them - two joined the search team and one joined the
content engineering team. Overall we were quite happy with the way it turned
out: we are confident that we made the right decisions and are excited to start
working together with our new colleagues.</p>
<p>What didn’t go that well was our response times: on average we needed 10 days
from the day we received a resume to the day we conducted the first screening
with a candidate. The average time from resume to job offer was 44 days. We can
attribute this to three main reasons: the first is the fact that we had many
open positions simultaneously and our HR department was, at the time,
understaffed for the candidate load. The second reason is the fact that only
four people were involved in the interviews, which put a cap on the total
interviews we could arrange per day. The third reason is of course the elephant
in the room: the COVID-19 pandemic and the resulting lockdown which made us
switch the whole recruitment process online.</p>
<h2 id="interview-process-revamped">Interview Process Revamped</h2>
<p>Our old process consisted of three parts: first filter through the incoming
resumes. Then, do a screening call with those that we think might fit the role.
Finally, do an onsite interview with the most promising candidates. The onsite
interview consisted of two parts: a coding exercise and some database-related
questions.</p>
<p>Before the lockdown happened, we were thinking of adapting that process in order
to address a few issues we had identified. First, the screening call would be
more structured, with specific things to check for.</p>
<p>Furthermore, we decided to introduce a second screening call that would include
a simple coding exercise. The reasoning behind this was that onsite interviews
are very “expensive” both for the candidate that would have to come over to our
offices and spend a few hours there, as well as for the interviewers who would
end up spending a large portion of their day preparing for and conducting the
interview. It made sense then to only call the most promising candidates for
onsite interviews, and we had identified that the coding exercise would help us
do that.</p>
<p>Finally, the onsite interview would be split into three distinct parts: first
another coding exercise, a bit harder this time. Then, a question about system
design. Finally, a chat around the candidate’s past experience.</p>
<h2 id="interview-process-turned-remote">Interview Process Turned Remote</h2>
<p>…then one day we stopped going to the office altogether!</p>
<p>Fortunately, the process above was easy to adapt for a remote setting. First of
all, there would only be one screening call. Onsite didn’t make sense anymore
since there was no “site” to go to and the whole process was just a series of
video calls on Google Meet. So we decided to split that into two calls: first
the coding challenge, and then the system design and past experience
discussions.</p>
<p>This is an overview of the full hiring process, assuming the candidate always
reaches the next step:</p>
<ol>
<li>Screen resume</li>
<li>Do a screening call</li>
<li>Do a coding exercise call</li>
<li>Do a system design/past experience call</li>
<li>Make an offer they can’t refuse</li>
</ol>
<p>Note that all interviews are conducted by engineers who are members of the
recruiting team and there are always at least two interviewers in each call to
reduce bias. The whole process is facilitated by the HR department, members of
which also join the call on the last step to make a few questions themselves.</p>
<h2 id="screening-call">Screening Call</h2>
<p>The goal of a screening call is to get a first impression of the candidate and
try to establish whether they might be a good fit for the position. We spoke
with 17 people (20% of the total applicants), and 10 of them went forward to the
next step.</p>
<p>We try to keep the length of screening calls at around 45 minutes, and having a
predetermined structure helps a lot. First, we ask the candidate to talk a bit
about their experience, urging them to be concise. This serves as a nice
icebreaker and also becomes the starting point for follow up questions so we can
understand a bit more about their background.</p>
<p>We then ask them a bit about their motivation, the reason they applied for the
specific position, and also what they want to primarily focus on for the next
couple of years. Of course there is no “right” answer here, but this is a nice
way to move the conversation forward and learn more about the candidate and
their expectations.</p>
<p>Finally, and depending on the position the candidate is interviewing for, we ask
a couple “knowledge” questions. These can be of two types, the first is checking
whether the candidate knows about something that is considered a requirement for
the role. For example, we might ask about Ruby symbols if the position requires
Ruby background. The second type is something based on the candidate’s
experience. For example, a candidate might talk about (or mention on their
resume) their experience with a distributed systems project and we could follow
up with asking about consistency or availability issues. This way we can get an
impression of the depth of their knowledge, but also of their skills in
communicating technical topics.</p>
<h2 id="coding-exercise">Coding Exercise</h2>
<p>Of the 17 people we did a screening call with, we moved forward with the coding
exercise call with 10 (59%).</p>
<p>What’s nice about having the coding exercise as a separate step, is that you can
be less strict in the screening call, because they can get pretty messy; <em>there
were times that we were left uncertain whether we actually learned something
useful about the candidate</em>. In such cases, moving forward with the coding
exercise was an easy decision, since it would give an opportunity to the
candidate to do well, while being easier for us to judge. An example would be
giving a chance to people with little interviewing experience that were visibly
stressed during the screening call.</p>
<p><em>The goal of this step is to determine the problem-solving, coding, and
communication skills of the candidate</em>. First, we let them know that while
reaching a solution is important, we also care about communicating their
thinking out loud. We also note that the quality of the code is important and
that we want to simulate a scenario where we work as a team to solve a problem
but the candidate takes the lead and we just follow along. Finally, we let the
candidate know that there is a time limit of 45 minutes and we actually try to
conclude the call within that range.</p>
<p>In practical terms, we use <a href="https://coderpad.io/">Coderpad</a> for this step.
What’s nice about it is it allows us to watch the candidate’s progress in real
time, and it offers an environment on which we can actually run and test the
code. In preparation for the call we will have created a “pad” with the exercise
description, and some boilerplate code in the language the candidate is most
experienced with.</p>
<p>As for the exercise itself, we try to pick problems that are not trivial but
also that do not require a certain “aha moment” to figure out the solution.
Rather, we prefer problems that are amenable to the sort of incremental problem
solving that is common in day to day work. During the call, we encourage the
candidate to take some time to think about the problem, even use pencil and
paper if they want to. We also try to give them hints if they are stuck, and try
to steer them towards a solution, sometimes by giving them specific examples to
work with.</p>
<p>In order to gain confidence about the exercises we picked, we actually did a
couple simulations with Skroutz engineers: we actually logged on Coderpad, gave
them the exercise, and watched them trying to solve it within the predetermined
45 minute limit. While of course this isn’t realistic as there is no stress
involved, it gave us a nice (albeit optimistic) baseline and reduced the doubt
about the quality of the exercise significantly. What’s more, we have decided to
adopt this approach before approving any exercise to be used in Skroutz
interviews.</p>
<p>Finally, we gave the same exercise to every candidate. While this might sound a
bit risky (it might leak, or they might already be familiar with it), in
practice it was very helpful in that we were able to judge the candidates
compared to each other, rather than in absolute terms. This helped increase our
confidence about the people we chose to move to the next step of the process.</p>
<h2 id="system-design--past-experience">System Design & Past Experience</h2>
<p>Of the 10 people that did the coding exercise, 5 went forward to the next step.
We should note here that all five candidates were good engineers, able to
communicate their thinking, and we would probably be happy working with any of
them.</p>
<p>The final session has two parts. The first and most technical is a 45 minute
discussion around a system design topic. The second is a 30 minute discussion
based on the candidate’s past experience.</p>
<p>In the system design, we ask the candidate to assume that we are an engineering
team that gets assigned to create a new system/service/website, etc. We want
them to take the lead and tell us how they would approach this task. These are
intentionally open ended: for example “create a twitter clone”, “design a system
that enables users to ‘like’ posts”, and “design a bit.ly clone”, are all
potential topics.</p>
<p>Note that there is not a single right answer and that what we are looking for
can be adapted based on the candidate’s experience and seniority. For a junior
candidate we could focus more on database schema design, API endpoints, and
queries. On the other hand, we would expect a more senior candidate to do
requirement gathering and trade-off discussion before diving into a design
proposal.</p>
<p>What’s nice about this type of question is that we can dive as deep as we want:
we can explore potential issues with the proposed systems, e.g. “what happens if
there’s a spike in traffic?”, we can ask how the proposed systems could be
extended, e.g. “how can we support lists in our twitter clone?”, or even
alternative technologies, e.g. “is a relational database ideal for this
scenario?” It goes without saying that this type of question requires
preparation from the interviewers as well.</p>
<p>We conclude this call with a 30 minute discussion based on the candidate’s past
experience. This is mostly informal and includes questions on technical topics,
for example we could ask “what is a project you worked on you are especially
proud of?”, “what is the weirdest bug you have encountered?”, but also possibly
touch on teamwork-related topics, for example “what was for you a good
experience of a well-functioning team?”, or “how did you resolve disagreements
with your lead?”, etc.</p>
<p>Touching on such topics can let us determine the seniority of a candidate in
various areas, not strictly technical. This can play a role in determining the
team and the manager we assign them to, should they join us. Moreover, many of
the answers we get are truly interesting, informative and sometimes even
surprising.</p>
<h2 id="conclusion-and-lessons-learned">Conclusion and Lessons Learned</h2>
<p>We hired 3 of the 5 people that made it to the last interview stage, a 4% of the
total applicants. All in all we are quite happy with the process and we felt we
learned a lot along the way.</p>
<p>Other teams have started adopting some key parts of this process:</p>
<ul>
<li>The coding & design step is permanently split into two calls instead of a very
long (~3 hours) one. We believe this helps with scheduling and can reduce
fatigue for both the candidates and the interviewers.</li>
<li>The interviewers are always suitably prepared and are expected to take notes
during the interview and submit their feedback on the candidate within a
couple of days.</li>
<li>For each open position we try to determine “knowledge requirements”
beforehand. That is, things a candidate must know in order to be considered
for the position.</li>
<li>When researching coding exercises we take care that they are not trivial and
do not require a single aha moment to solve.</li>
<li>We have a common pool of coding exercises, so we can experience how different
people try to solve them and judge their performance compared to each other,
rather than in absolute terms.</li>
<li>Before a coding exercise enters the pool, we do a simulation where our
colleagues try their hand in solving them!</li>
<li>Similarly, we have prepared a pool of system design questions.</li>
</ul>
<p>Of course the process is continuously evolving and we try to get better in time.
Asking the candidates for feedback is very helpful in that regard. We want to
treat the whole process as we would a product proposal: first develop some
assumptions on what we can improve, then try the changes out, and finally adapt
the process or the assumptions accordingly, based on the outcome.</p>
<p><em>We believe that people are what matters most in an organization</em>. A proper
hiring process then is critical for growing the organization successfully -
maintaining the core values intact and an excellent level of technical aptitude.
It also shapes the candidate’s initial impression of the organisation and its
people. Thus we believe that we should keep working on it, and that sharing our
experience is important.</p>
<p><a href="https://engineering.skroutz.gr/blog/hiring-engineers-while-working-from-home/">Hiring engineers while working from home</a> was originally published by Nikos Fertakis at <a href="https://engineering.skroutz.gr">Skroutz Engineering</a> on May 31, 2020.</p>https://engineering.skroutz.gr/blog/performance-management-skroutz2020-02-13T22:00:00+00:002020-02-13T22:00:00+00:00Roza Tapinihttps://engineering.skroutz.gr<h1 id="introduction">Introduction</h1>
<p>This is a post on how we are managing performance at Skroutz and how
we transitioned from informal semi-annual feedback meetings to a
structured continuous performance management framework. It is about
how we translated our value <a href="https://www.skroutz.gr/careers#goals">Set big goals. Take small steps</a> to an
actual set of events, processes, and tools that drive our performance
daily and support our career development.</p>
<h1 id="history">History</h1>
<p>At Skroutz, we have always cared about our people’s personal and
professional development.</p>
<p>In our early days, George H., Vassilis and George A. were having
meetings with all team members giving them feedback and helping them
grow. As our team grew bigger, this task was assigned to people
managers who continued meeting with their team members talking about
strengths and improvement points as well as setting developmental goals.</p>
<p>In both instances, it was an informal discussion where both sides
shared feedback based mainly on recent events and was ending with a
few actionables vaguely stated. It was a process that could serve its
purpose if we were to stay a small-medium sized company.</p>
<p>However, our vision is greater than this and therefore, we needed to
set up a process that <strong>our people’s happiness and professional
development would remain at the focal point</strong>.</p>
<p>Following the example of companies like Adobe and Google, we
introduced a new approach called Continuous Performance Management (CPM).</p>
<p>The differences between CPM and the traditional performance appraisals
are summarized in the table below.</p>
<table>
<thead>
<tr>
<th>Continuous Performance Management</th>
<th>Performance Appraisals</th>
</tr>
</thead>
<tbody>
<tr>
<td>Continuous feedback</td>
<td>Annual or semi-annual feedback</td>
</tr>
<tr>
<td>Coaching</td>
<td>Directing</td>
</tr>
<tr>
<td>Democratic</td>
<td>Autocratic</td>
</tr>
<tr>
<td>Process focused</td>
<td>Outcome focused</td>
</tr>
<tr>
<td>Strength-based</td>
<td>Weakness-based</td>
</tr>
<tr>
<td>Fact driven</td>
<td>Prone to bias</td>
</tr>
</tbody>
</table>
<h1 id="cpm-in-action">CPM in action</h1>
<p>Continuous performance management (CPM) is a framework where managers
and team members collaborate to create short-term developmental goals
and meet on a more regular basis to promote growth, recognition and
happiness. The idea is that everyone can rise to the top and be
successful with their current set of skills.</p>
<p>CPM consists of various components that each one serves a different
purpose and all of them are supplementary to each other.</p>
<h1 id="one-on-one-conversations">One-on-One conversations</h1>
<p>Collaboration is a core element of work-life at Skroutz and 1-1s
ensure that a manager-team member relationship has this
characteristic. In a nutshell, one-on-one conversations promote an
ongoing forward-looking dialogue between us and our manager.
So, every two weeks we meet with our manager for 30 to 45 minutes.
Topics of discussion vary; we share updates on work progress, ask for
guidance and support, make questions regarding tasks, team and company
OKRs, talk about personal matters, follow up with developmental goals
and the list goes on. This time is about us!!</p>
<p>In a recent internal survey regarding CPM, there was a unanimous
response that having regular 1-1 conversations was one of the best
practices we have ever rolled out. During the last 8 months, we all
have experienced genuine communication with our manager and we have
received the support and guidance we needed to achieve our tasks and goals.</p>
<p>On the downfall, finding an available meeting room at Skroutz Awesome
Factory resembles a treasure hunting. :)</p>
<h1 id="performance--career-development-discussions">Performance & Career development discussions</h1>
<p>Another component of CPM is the performance discussion, which takes
place quarterly and it serves as a feedback and development mechanism.</p>
<p>This event gives us the opportunity to look back on our 1-1
conversations, on feedback that was shared over the previous 3 months
and have a future-forward talk with our manager about our career development.</p>
<p>At the beginning of every quarter, we take a self-assessment, which we
send to our manager. S/he then prepares a performance review doc and
s/he sends it to us prior to our discussion, so that we are all
prepared. During our talk, we recognize superpowers and
accomplishments, but most importantly discuss our future, what are our
career aspirations, what skills we need to develop in order to fulfill
these aspirations, we set priorities and agree on action items for us and our manager.</p>
<p>Goals and action items set in this discussion will be a recurring
topic in our 1-1s for the following quarter.</p>
<h1 id="peer-and-manager-feedback">Peer and Manager feedback</h1>
<p>Skroutz grew on receiving feedback. Getting feedback on features,
services and processes is part of who we are. We believe that feedback
can help us become better at what we do. This mindset applies to all of us, as well.</p>
<p>In the context of the CPM framework, we run peer-review surveys as
well as manager-review surveys. This way, we have the opportunity to
give and receive actionable feedback from our peers and from our team
members, in the case of people managers. The purpose of a feedback
survey is to assist each one of us to better understand our strengths
and weaknesses and to get an insight into aspects of our work needing
professional development.</p>
<p>We run our first peer and manager reviews in June. Each one of us got
feedback that was true and deep inside we already knew about it. Yes,
we got a bit defensive when we first read our report, but then we took
out of it action items that fueled some of our 1-1 talks.</p>
<h1 id="the-benefits">The benefits</h1>
<p>We have transitioned to the CPM framework for less than a year and the
positive impact on our daily work-life was obvious from the beginning.</p>
<p>To start with, we are now more aware of where we stand. At any time,
we know what we did well, what we need to work on and we have the
support and guidance we need to achieve our goals. Feedback on
performance is given with specific actionables and in a timely manner.
Good efforts and accomplishments are given the appropriate recognition.</p>
<p>Performance discussions are currently more fruitful since recency bias
has been eliminated, they are more focused in the future and we are
constantly examining opportunities for development.</p>
<p>Our relationships with our managers have improved significantly and
our interaction is more meaningful. Our people managers act as coaches
and mentors and focus their attention on how they could help each one
of us to grow and work towards our goals.</p>
<h1 id="to-sum-up">To sum up</h1>
<p>Continuous performance management has helped us reinforce our culture
of continuous growth, feedback, and recognition. It has contributed to
making our values come alive.</p>
<p>We still have some fine-tuning to do, but we are all certain that it
is a framework that will ensure that our people will be in the centre
of our attention and efforts, no matter how big Skroutz becomes.</p>
<p><a href="https://engineering.skroutz.gr/blog/performance-management-skroutz/">Performance Management @ Skroutz</a> was originally published by Roza Tapini at <a href="https://engineering.skroutz.gr">Skroutz Engineering</a> on February 13, 2020.</p>https://engineering.skroutz.gr/blog/SEO-Crawl-Budget-Optimization-20192019-10-30T21:00:00+00:002019-10-30T21:00:00+00:00Vasilis Giannakourishttps://engineering.skroutz.gr<h1 id="introduction">Introduction</h1>
<p>This is a story about the technical side of SEO on a large e-commerce website like <a href="https://www.skroutz.gr/" target="_blank">Skroutz.gr</a>, with nearly 1 million sessions daily and how we dealt with some significant technical issues we found a year and a half ago.</p>
<p>Let’s give you a sneak peek on the milestones of our efforts, which are covered in this case study. During the last 1.5 year, we managed to:</p>
<ol>
<li>Decrease our index size by <strong>18 million</strong> URLs while <strong>improving</strong> our Impressions, Clicks and Average Position.</li>
<li>Create a <strong>real-time</strong> crawl analyzer tool that can handle millions of URLs.</li>
<li>Implement a custom <strong>alert mechanism</strong> for important SEO index and crawl issues.</li>
<li>Automate the technical SEO process of merging or splitting e-commerce categories.</li>
</ol>
<p>If you are interested to see why and how we did all the above, grab a seat!</p>
<blockquote>
<p><strong>Table of Contents</strong></p>
<p><a href="#part-1-seo-analysis-february-2018">Part 1: SEO Analysis (February 2018)</a> <br />
› <a href="#what-issues-initiated-our-analysis">What issues initiated our analysis</a> <br />
› <a href="#how-we-did-the-analysis">How we did the Analysis</a></p>
<p><a href="#part-2-action-plan-and-execution-feb-2018---june-2019">Part 2: Action Plan and Execution (Feb 2018 - June 2019)</a> <br />
› <a href="#action-plan">Action Plan</a> <br />
› <a href="#execution">Execution</a></p>
<p><a href="#part-3-results">Part 3: Results</a></p>
<p><a href="#what-we-learned">What We Learned</a></p>
</blockquote>
<p>But before we take off, let us introduce ourselves.</p>
<p><a href="https://www.skroutz.gr/">Skroutz.gr</a> is the leading price comparison search engine and marketplace of Greece and a top-1000 ranked website globally by <a href="https://www.similarweb.com/website/skroutz.gr">Similar Web</a>. <a href="https://www.skroutz.gr/">Skroutz.gr</a> helped its merchants generate a Gross Merchandise Volume (GMV) of €535 Million in 2018 (≈20% of Greece’s total Retail Ecommerce GMV).</p>
<p>Except for the main B2C price comparison service, <a href="https://www.skroutz.gr/">Skroutz.gr</a> also provides <a href="https://www.skroutz.gr/c/2978/epaggelmatikos-exoplismos-b2b.html">a B2B price comparison service</a> and a new <a href="https://www.skroutz.gr/food">food online delivery service</a> for the Greek market, namely SkroutzFood. Finally, Skroutz.gr operates its own <a href="https://www.skroutz.gr/ecommerce/landing">marketplace</a> of 500+ merchants.</p>
<h3 id="seo-challenges-in-large-sites-the-case-of-skroutzgr">SEO Challenges in Large Sites: The case of Skroutz.gr</h3>
<p>So, what challenges does a Site with millions of pages encounter?</p>
<p>First of all, imagine the difficulties you have when you optimize your Rankings for an average-sized site, such as keyword research and monitoring, on-page SEO, etc. Now think about the same things on a million-page size website; You have to deal with a vast amount of data and automate things in a way that does not compromise quality.</p>
<p>Besides this, SEO is not just rankings…</p>
<p>Indeed, large website SEOs have another big headache: Crawling and Indexing. These essential steps take place even before Google ranks your content and can be extremely complicated on huge sites.</p>
<blockquote>
<p><strong>Note</strong>: In this case study we focused on the Google Search Engine and GoogleBot. However, all search engines are operating similarly.</p>
</blockquote>
<p>Most problems which occurred are related mostly to <a href="https://webmasters.googleblog.com/2017/01/what-crawl-budget-means-for-googlebot.html">Crawl Budget</a> and <a href="https://support.google.com/webmasters/answer/66359?hl=en">Duplicate Content</a>. More specifically:</p>
<ul>
<li><strong>Crawl Budget</strong>: Google has a crawl rate limit for every website. If the website has fewer than a few thousand URLs, it will be usually crawled just fine. However, if you have a site with a million or more pages, you need to enhance your structure so that crawlers have a far easier time accessing and crawling your most important pages</li>
<li><strong>Duplicate Content</strong>: If the same content appears at more than one web address, you’ve got duplicate content. While there is no duplicate content penalty, duplicate content can drive, sometimes, on ranking and traffic drops.
As <a href="https://moz.com/learn/seo/duplicate-content">Moz</a> says, this happens because GoogleBot *don’t know whether to direct the link metrics (trust, authority, anchor text, link equity, etc.) to one page, or keep it separated between multiple versions and, secondly, it doesn’t know which version(s) to rank for query results.</li>
</ul>
<p><img src="https://engineering.skroutz.gr/images/seo-crawl-budget-2019/seo-2019-1.png" alt="" /></p>
<p>For example, Skroutz.gr has more than 3,000,000 products in 3,000 categories. It also uses a faceted navigation with more than 13,000 filters (which can be combined - up to 3 filters), three sorting options and an internal search function. Most of these options produce <strong>a unique page</strong> (URL).</p>
<p>Thus, large sites’ administrators have to:</p>
<ul>
<li>automate the monitoring the site’s SEO performance</li>
<li>watch out for thin or duplicate content issues so that they don’t confuse Google about what pages are essential for them</li>
<li>control over what pages are crawled and indexed</li>
</ul>
<p><br /></p>
<h1 id="part-1-seo-analysis-february-2018">Part 1: SEO Analysis (February 2018)</h1>
<h3 id="what-issues-initiated-our-analysis">What issues initiated our analysis</h3>
<p>If everything goes like clockwork, most of your rankings are on top 3 positions and the organic traffic growth is stable, then, it is hard to suspect that something might not be going so well SEO-wise. That was the case with Skroutz.gr back in 2018.</p>
<p>If you look at the graph of GA sessions over the past five years below, it’s evident that our traffic is increasing every year with a 15-20% YoY organic growth, even surpassing 30Μ monthly sessions (80% of that traffic is organic).</p>
<p><img src="https://engineering.skroutz.gr/images/seo-crawl-budget-2019/seo-2019-2.png" alt="" /></p>
<p>So what was the first sign that something was not going as expected?
Three issues raised flags, especially when we realized the correlation between them.</p>
<h4 id="1-index-size">1. Index Size</h4>
<p>The first one was the index size we saw on Search Console (nearly 25 million URLs) compared to the “real” number of our pages that we thought we had.</p>
<p>
<img src="/images/seo-crawl-budget-2019/seo-2019-3.png" style="width:auto" />
</p>
<h4 id="2-increased-time-for-new-pages-to-get-indexed-and-rank-high">2. Increased time for new pages to get indexed and rank high</h4>
<p>Delays in rankings recovery were more evident in cases we had to break a broad category into 2-3 subcategories. This kind of splitting produces many new URLs, as well as many 301s redirects from the old URLs to the new ones (e.g., old filter URLs).</p>
<p><img src="https://engineering.skroutz.gr/images/seo-crawl-budget-2019/seo-2019-4.png" alt="Ahrefs History Chart for "foam roller" keyword" />
<small style="display: block; margin: 7px 0;">Ahrefs History Chart for “foam roller” keyword</small></p>
<p>For example, look at the data above from <a href="https://help.ahrefs.com/en/articles/580856-can-i-see-the-ranking-history-of-a-given-keyword">Ahrefs History Charts</a> for the keyphrase “foam roller” (click <a href="https://www.google.gr/search?q=foam%20roller&glp=1&adtest=on&tci=g:2300&uule=w+CAIQICIGR3JlZWNl&safe=images&safe=high">here</a> to see Greek SERPs). Foam Roller products used to be in a broader category called Gym Balance Equipment (<span style="color:green;">green line</span>). On 03/03/2018, the content team decided to create a new category named Foam Rollers (<span style="color:blue;">blue line</span>) and moved the relevant products there.</p>
<p>As you can see, historically, we ranked on 1st place for “foam rollers” with the internal search page <code class="language-plaintext highlighter-rouge">skroutz.gr/c/1338/balance_gym.html?keyphrase=foam+roller</code>. On 03/03/2018, we created <code class="language-plaintext highlighter-rouge">skroutz.gr/c/2900/foam-rollers.html</code> category and we redirected the first URL, plus a few hundred relevant URLs (e.g., <code class="language-plaintext highlighter-rouge">skroutz.gr/c/1338/balance_gym.html?keyphrase=foam+rollers</code>), to the latter.</p>
<p>Based on previous years’ stats, a new URL needed just a few days to a couple of weeks to recover it’s rankings, after the consolidation of the signals. Yet, in this case, it took almost three months (!) to rank in the first place. Besides this, old redirected URLs remained indexed for months instead of being removed after a few days. That indicated that our crawling efficiency had decreased over the years.</p>
<h4 id="3-increased-time-for-metadata-to-refresh-in-google-index">3. Increased time for metadata to refresh in Google Index</h4>
<p>Titles and Meta Descriptions weren’t updated in Google’s index as fast as in the previous years, especially for pages with low traffic.</p>
<p>As a result, fresh content and schema markups (availability, reviews, and others) weren’t reflected in Google SERPs within a reasonable time.</p>
<h3 id="how-we-did-the-analysis">How we did the Analysis</h3>
<h4 id="step-1---a-first-look-at-the-problem">Step 1 - A first look at the problem</h4>
<p>At first, we wanted to validate our concerns about the index bloat of 25M pages. So, we tried to figure out how many out of the 25M pages, were actually supposed to be in the SERPs.</p>
<p>We drilled down the different types of landing pages, calculating an estimation of:</p>
<ul>
<li>the number of currently indexed pages per type, using <a href="https://moz.com/learn/seo/search-operators">Google Search Operators</a></li>
<li>their share of the total traffic, using Internal Analytics tools</li>
<li>the number of indexed pages we should have, based on some criteria like current or potential traffic</li>
</ul>
<table>
<thead>
<tr>
<th>Type Of Page</th>
<th style="text-align: center">Current Estimated Indexed Pages</th>
<th style="text-align: center">Share of Total Organic Traffic</th>
<th style="text-align: center">Indexed Pages we should have</th>
</tr>
</thead>
<tbody>
<tr>
<td>Homepage</td>
<td style="text-align: center">1</td>
<td style="text-align: center">24%</td>
<td style="text-align: center">1</td>
</tr>
<tr>
<td>Product Pages [like <a href="https://www.skroutz.gr/s/15809760/Apple-iPhone-XR-64GB.html">Apple iPhone XR (64GB)</a>]</td>
<td style="text-align: center">4.5M</td>
<td style="text-align: center">20%</td>
<td style="text-align: center">3M</td>
</tr>
<tr>
<td>Clean Category Pages [like <a href="https://www.skroutz.gr/c/3363/sneakers.html">Sneakers</a>]</td>
<td style="text-align: center">2500</td>
<td style="text-align: center">25%</td>
<td style="text-align: center">2450</td>
</tr>
<tr>
<td>Category Filter Pages [like <a href="https://www.skroutz.gr/c/3363/sneakers/f/936064/Stan-Smith.html">Stan Smith Sneakers</a> or <a href="https://www.skroutz.gr/c/3363/sneakers/m/1464/Nike.html">Nike Sneakers</a>]</td>
<td style="text-align: center">2.5M</td>
<td style="text-align: center">10%</td>
<td style="text-align: center">1.5M</td>
</tr>
<tr>
<td>Internal Search Pages [like <a href="https://www.skroutz.gr/c/108/game_consoles/m/2/Sony.html?keyphrase=ps4">ps4</a>]</td>
<td style="text-align: center">14M</td>
<td style="text-align: center">20%</td>
<td style="text-align: center">1M</td>
</tr>
<tr>
<td>Other Pages (Blog, Guides, <a href="https://www.skroutz.gr/comparelists/40?compare=17437356,19023344">Compare Lists</a>, Pagination, Parameters)</td>
<td style="text-align: center">4M</td>
<td style="text-align: center">1%</td>
<td style="text-align: center">1M</td>
</tr>
<tr>
<td>Total</td>
<td style="text-align: center">25M</td>
<td style="text-align: center">100%</td>
<td style="text-align: center">6.5M</td>
</tr>
</tbody>
</table>
<p>The results were stunning.</p>
<ul>
<li>Actually indexed pages versus our estimated pages differed by a count of nearly <strong>19 million</strong> URLs</li>
<li>The <strong>Internal Search pages</strong> index bloat seemed the most crucial issue. Probably we had tons of meaningless, regarding traffic, indexed pages</li>
<li>Product and Filter Pages had a reasonable amount of low-quality pages</li>
<li>Pagination Pages were the top suspect of the 4M pages of “Other Pages” type. <a href="https://searchengineland.com/google-no-longer-supports-relnext-prev-314319">This</a> announcement might explain why :-)</li>
</ul>
<p>To tackle these issues, we all agreed that we should face the problem starting with a quick sprint and following-up with more sophisticated solutions down the road.</p>
<h4 id="step-2---setting-up-the-team-and-the-tools">Step 2 - Setting up the team and the tools</h4>
<p>After the above analysis, we formed a wider vertical “SEO purpose” team, which included SEO Analysts, Developers and System Engineers. This team would analyze the problem deeper, create an action plan and implement the proposals.</p>
<p>In our first meeting, we decided that an extensive crawl analysis is needed to fully understand the magnitude of the problem. We chose to set up an in-house real-time crawl monitoring tool, instead of a paid solution for the following reasons:</p>
<ul>
<li><strong>Scalability</strong>: analyze more than 25 million pages and see the changes in behavior every time we needed.</li>
<li><strong>Real-Time Data</strong>: see the impact on the behavior of the crawler, right after a significant change</li>
<li><strong>Customization</strong>: customize the tool and add whatever function we wanted for every different situation</li>
</ul>
<p><img src="https://engineering.skroutz.gr/images/seo-crawl-budget-2019/seo-2019-5.png" alt="" /></p>
<blockquote>
<p><strong>Note</strong>: Depending on your needs, you can use other paid tools such as <a href="https://www.deepcrawl.com/">Deepcrawl</a> or <a href="https://www.botify.com">Botify</a>, which have some handy ready-to-use features.</p>
</blockquote>
<p>As we already had some experience with the <a href="https://www.elastic.co/what-is/elk-stack">ELK Stack</a> (this is one of our primary analytics tools), we decided to set up an internal crawl monitoring tool using <a href="https://www.elastic.co/products/kibana">Kibana</a>.</p>
<p>Kibana is a powerful tool and helped us find a lot of significant crawl issues. If we had to choose just one thing that expanded our capabilities on crawl monitoring, that would be the annotation of pageviews with rich meta tags. With the use of rich meta tags, URLs carry additional structured information which provides a way to query a specific subset.</p>
<p>For example, let’s say that we have the URL: <a href="https://www.skroutz.gr/c/3363/sneakers/m/1464/Nike/f/935450_935460/Flats-43.html?order_by=popularity">skroutz.gr/c/3363/sneakers/m/1464/Nike/f/935450_935460/Flats-43.html?order_by=popularity</a></p>
<p>Some of the information that we inject on that URL is the following:</p>
<ul>
<li><strong>Page Type</strong>: Filter Page (other option could be Internal Search Page for example)</li>
<li><strong>Category ID</strong>: 3363</li>
<li><strong>Number of Filters Applied</strong>: 3</li>
<li><strong>Type of Filters</strong>: Normal Filter (Flats), Brand (Nike), Size (43)</li>
<li><strong>HTTP Status</strong>: 200</li>
<li><strong>URL Parameters</strong>: ?order_by=popularity</li>
</ul>
<p>With that kind of information we are able to answer questions like:</p>
<ol>
<li>Does GoogleBot crawl pages with more than 2 filters enabled?</li>
<li>How much does the Googlebot Crawl a specific popular category?</li>
<li>Which are the top URL parameters that GoogleBot crawls?</li>
<li>Does GoogleBot crawl pages with filters like “Size” which are nofollow by default?</li>
</ol>
<blockquote>
<p><strong>Tip</strong>: You can use this information, not only for SEO purposes, but also for debugging.</p>
<p>For example, we use page load speed information to monitor the page speed per Page Type (Product Page, Category Page etc.) instead of monitoring just the average site speed.</p>
</blockquote>
<p>Imagine how much you can drill down to find Googlebot’s crawl patterns using simple <a href="https://www.elastic.co/guide/en/beats/packetbeat/current/kibana-queries-filters.html">Kibana Queries</a>.</p>
<blockquote>
<p><strong>How we inject the URL information</strong></p>
<p>We use custom HTTP headers. These headers flow through our application stack and any component, like our Realtime SEO Analyser, can extract and process the information it needs. At the end, and before the response is returned to the client, we strip meta headers off the request.</p>
</blockquote>
<p>To sum things up, Kibana gave us the ability to do three critical things:</p>
<ol>
<li>
<p>See every single Google Bot crawl hit on a <strong>real-time</strong> basis
<img src="https://engineering.skroutz.gr/images/seo-crawl-budget-2019/seo-2019-6.png" alt="" /></p>
</li>
<li>
<p><strong>Narrow results</strong> with Filters such Product Category, URL Type (Product, Internal Search, etc.) and many more
<img src="https://engineering.skroutz.gr/images/seo-crawl-budget-2019/seo-2019-7.png" alt="" /></p>
</li>
<li>
<p>Create <strong>Visualizations or Tables</strong> to monitor the crawl behavior thoroughly
<img src="https://engineering.skroutz.gr/images/seo-crawl-budget-2019/seo-2019-8.png" alt="" />
<small style="display: block; margin: 7px 0;">The line chart shows monthly crawls for Category Filters, Internal Search and Product Pages.</small></p>
</li>
</ol>
<h4 id="step-3---conclusions-of-the-analysis">Step 3 - Conclusions of the Analysis</h4>
<p>After much digging in through the crawl reports combined with traffic stats and at least one month of continuous monitoring on both real-time data and previous months’ log data (at least ten months), we had, at last, our first findings. We will sum up the most important below:</p>
<p><strong>The Good</strong></p>
<ol>
<li>GoogleBot crawled our most popular product pages (200k out of 4.5M) almost every day. These pages had high authority and many backlinks, so it was kind of expected</li>
</ol>
<p><strong>The Bad</strong></p>
<ol>
<li>With an average daily crawl budget of 1M and our index of 25M, Googlebot could only crawl 4% of our total pages every day</li>
<li>More than 50% of our daily crawl budget was spent on internal search pages. Besides this, most of those pages didn’t have traffic at all</li>
<li>
<p>In addition to the above, we saw a weird pattern, with a significant volume of internal search page URLs with the same generic keyphrase. For example, “v2” keyphrase was present on thousands of URLs. Examples:</p>
<ul>
<li><a href="https://www.skroutz.gr/c/3363/sneakers.html?keyphrase=v2">skroutz.gr/c/3363/sneakers.html?keyphrase=v2</a></li>
<li><a href="https://www.skroutz.gr/c/663/Gamepads.html?keyphrase=v2">skroutz.gr/c/663/Gamepads.html?keyphrase=v2</a></li>
<li><a href="https://www.skroutz.gr/c/1850/Gaming_Headsets.html?keyphrase=v2">skroutz.gr/c/1850/Gaming_Headsets.html?keyphrase=v2</a></li>
</ul>
</li>
</ol>
<p>We never thought that the combinations of internal searches with category pages would be crawled and indexed at such a high rate.</p>
<p><br /></p>
<h1 id="part-2-action-plan-and-execution-feb-2018---june-2019">Part 2: Action Plan and Execution (Feb 2018 - June 2019)</h1>
<h3 id="action-plan">Action Plan</h3>
<p>After the analysis, our team decided on the next actions. Based on the findings, the most crucial problem was the crawl and index bloat of URLs with Internal Search Queries. We suspected the index bloat to be the main cause of the issues mentioned in Part 1.</p>
<p>We devised an action plan for the upcoming months consisting of two different projects:</p>
<h4 id="a-primary-crawl-budget-optimization-cbo-project">A. Primary Crawl Budget Optimization (CBO) Project:</h4>
<ol>
<li>Find and fix crawling loopholes which create more and more indexable internal search pages</li>
<li>Decrease the index size of internal search pages by removing or consolidating those pages accordingly</li>
</ol>
<h4 id="b-secondary-crawl-budget-optimization-cbo-project">B. Secondary Crawl Budget Optimization (CBO) Project:</h4>
<ol>
<li>Enhance the crawl and indexing phase of new URLs when we create a new category or when we merge two or more categories into one. We saw that rankings were very slowly recovered in such cases</li>
<li>Create an alert mechanism for important crawl issues</li>
</ol>
<h3 id="execution">Execution</h3>
<h4 id="a-primary-crawl-budget-optimization-cbo-project-1">A. Primary Crawl Budget Optimization (CBO) Project:</h4>
<h5 id="1-find-crawling-loopholes">1. Find crawling loopholes</h5>
<p>At first, we wanted to see if any loopholes in our link structure allowed Googlebot to find new crappy internal search pages.</p>
<p>Before we start with the execution, it would be helpful to fully understand the way that our search engine works and how internal search pages are created.</p>
<blockquote>
<p><strong>Search Function on Skroutz.gr</strong></p>
<p>As we said earlier, Skroutz.gr has always had search at the forefront, meaning that the vast majority of our users search for a product instead of just browsing. In fact, we have more than 600,000 searches per day!</p>
<p>That’s why we have a dedicated Search Team of 5 engineers who strive to enhance the experience of the user after he types a query inside the search box. The Search Team has created dozens of mechanisms to make our search engine return, in most cases, high quality and relevant results to the user. That’s why our bounce rate on those pages is very low (under 30%), near the site’s average.</p>
<p><strong>Internal Search Pages: How are they created?</strong></p>
<p>Firstly, we should point out that all internal search pages of Skroutz.gr have the parameter “?keyphrase=” on the URL.</p>
<p>There are 2 types of internal search pages; Let’s see which they are.</p>
<p>After a user inputs a query into the search box, our search mechanism will try to find the most relevant results from all the categories and return</p>
<ul>
<li>a mixed category search page. Example: <a href="https://www.skroutz.gr/search?keyphrase=shoes">skroutz.gr/search?keyphrase=shoes</a></li>
</ul>
<p><img src="https://engineering.skroutz.gr/images/seo-crawl-budget-2019/seo-2019-9.png" alt="" /></p>
<ul>
<li>a dedicated category search page. Example: <a href="https://www.skroutz.gr/c/3363/sneakers.html?from=catspan&keyphrase=shoes">skroutz.gr/c/3363/sneakers.html?keyphrase=shoes</a></li>
</ul>
<p><img src="https://engineering.skroutz.gr/images/seo-crawl-budget-2019/seo-2019-10.png" alt="" /></p>
<p><strong>Important note:</strong> Every link of a mixed category search page, point to a dedicated category search page. On our example, if a user is on <a href="https://www.skroutz.gr/search?keyphrase=shoes">skroutz.gr/search?keyphrase=shoes</a> and clicks on “Sneakers”, he will be moved to <a href="https://www.skroutz.gr/c/3363/sneakers.html?from=catspan&keyphrase=shoes">skroutz.gr/c/3363/sneakers.html?keyphrase=shoes</a>.</p>
<p>That’s how an internal search page is created. 95% of indexed internal search pages are dedicated category pages.</p>
</blockquote>
<p>It is now apparent what the loophole was… The few mixed category search pages had dozens of follow links to dedicated category search pages with the same query. With this loophole, every different search query could create hundreds of category internal search pages.</p>
<p>That’s why <a href="https://www.skroutz.gr/search?keyphrase=V2">skroutz.gr/search?keyphrase=v2</a> was creating tons of new dedicated category internal search pages like</p>
<ul>
<li><a href="https://www.skroutz.gr/c/3363/sneakers.html?keyphrase=v2">skroutz.gr/c/3363/sneakers.html?keyphrase=v2</a></li>
<li><a href="https://www.skroutz.gr/c/663/Gamepads.html?keyphrase=v2">skroutz.gr/c/663/gamepads.html?keyphrase=v2</a></li>
<li><a href="https://www.skroutz.gr/c/1850/Gaming_Headsets.html?keyphrase=v2">skroutz.gr/c/1850/Gaming_Headsets.html?keyphrase=v2</a></li>
</ul>
<p><img src="https://engineering.skroutz.gr/images/seo-crawl-budget-2019/seo-2019-11.png" alt="" />
<small style="display: block; margin: 7px 0;">Googlebot can follow all red links. Thus every time it had access to a mixed category page, it could follow and crawl hundreds of new internal search pages for every different category that matched with user query.</small></p>
<p>We fixed this issue by</p>
<ul>
<li>making all those links no-follow, except for some valuable, valid keyphrases (we will explain what a valid keyphrase is shortly)</li>
<li>checking browsing and UX stats of the 70,000 most popular internal searches and redirect 20,000 of them directly inside a specific category filter or internal category search. As a result, both Googlebot and users won’t see the mixed category pages when there is no reason.</li>
</ul>
<p>For example, we saw that >95% of the users who search for <code class="language-plaintext highlighter-rouge">iphone</code> wanted to see the mobile phone and not any accessories. So, instead of showing a mixed category page:</p>
<p><img src="https://engineering.skroutz.gr/images/seo-crawl-budget-2019/seo-2019-12.png" alt="" /></p>
<p>We redirect the user directly to a dedicated category search, based on their search intent:</p>
<p><img src="https://engineering.skroutz.gr/images/seo-crawl-budget-2019/seo-2019-13.png" alt="" /></p>
<h4 id="2-decrease-index-size">2. Decrease index size</h4>
<p>After the fix of the issue with the mixed category pages, like <a href="https://www.skroutz.gr/search?keyphrase=V2">skroutz.gr/search?keyphrase=v2</a>, which created more and more new dedicated category search pages, it was time to deal with the latter.</p>
<p>Dedicated category search pages, like <a href="https://www.skroutz.gr/c/3363/sneakers.html?keyphrase=v2">skroutz.gr/c/3363/sneakers.html?keyphrase=v2</a>, had an index of enormous size. So, we had to see how many pages are crawled by Googlebot and which of them had the quality to be indexed.</p>
<p>This task took us more than one year to finish (February 2018 till June 2019). It was massive and expensive in terms of hours and workforce, but it was worth it.</p>
<p>For this task, we decided to create a mechanism so that the SEO team could consolidate our no-index pages without the involvement of a developer.</p>
<p>But how and where could we consolidate the internal search pages?</p>
<p>That was pretty easy! We found out that most of the internal search URLs were near duplicates of existing category filters. Example:</p>
<ul>
<li><a href="http://skroutz.gr/c/25/laptop.html?keyphrase=ultrabook">skroutz.gr/c/25/laptop.html?keyphrase=ultrabook</a> (Internal Search Page)</li>
<li><a href="https://www.skroutz.gr/c/25/laptop/f/343297/Ultrabook.html">skroutz.gr/c/25/laptop/f/343297/Ultrabook.html</a> (Filter Page)</li>
</ul>
<p>So, what have we done?</p>
<p>At first, we created a dashboard with all internal search keyphrases for every category, combined with traffic and number of crawls (we called it Keyphrase Curation Dashboard). As we said earlier, every keyphrase may be present on more than one internal search URL.</p>
<p>Then, we added quick action buttons, so the SEO team could eventually do the following actions, without the help of a developer:</p>
<ul>
<li>redirect (Consolidate)</li>
<li>noindex or</li>
<li>mark the keyphrase as a valid, valuable internal search URL</li>
</ul>
<p><img src="https://engineering.skroutz.gr/images/seo-crawl-budget-2019/seo-2019-14.png" alt="" /></p>
<p>When someone chooses the Redirect action, they are presented with a pop-up so they can choose the redirect targets (maximum 2 Filters + 1 Manufacturer Filter).</p>
<blockquote>
<p>Why did we group by keyphrase and not just URLs?</p>
<p>Because the same keyphrase is present in many URL combinations (Filter + Keyphrase), every action we made for one keyphrase could affect dozens of similar URLs with the same keyphrase and save us time.</p>
<p>For example, let’s say that we have these two internal search URLs with the keyphrase ><code class="language-plaintext highlighter-rouge">ultrabook</code> in Laptop Category:</p>
<ol>
<li><a href="http://skroutz.gr/c/25/laptop.html?keyphrase=ultrabook">skroutz.gr/c/25/laptop.html?keyphrase=ultrabook</a> (Keyphrase)</li>
<li><a href="http://skroutz.gr/c/25/laptop/m/355/Asus.html?keyphrase=ultrabook">skroutz.gr/c/25/laptop/m/355/Asus.html?keyphrase=ultrabook</a> (Filter + Keyphrase)</li>
</ol>
<p>For both URLs, the dashboard would show us <code class="language-plaintext highlighter-rouge">ultrabook</code> as the keyphrase, but we know that Laptop Category has a filter for Ultrabooks.</p>
<p>We could select the action Redirect and choose Ultrabook filter as the redirect target. Then the mechanism would redirect the above URLs to the following URLs respectively:</p>
<ol>
<li><a href="https://www.skroutz.gr/c/25/laptop/f/343297/Ultrabook.html">skroutz.gr/c/25/laptop/f/343297/Ultrabook.html</a></li>
<li><a href="http://skroutz.gr/c/25/laptop/m/355/Asus/f/343297/Ultrabook.html">skroutz.gr/c/25/laptop/m/355/Asus/f/343297/Ultrabook.html</a></li>
</ol>
</blockquote>
<p>The mechanism gathered an immense amount of keywords, <strong>reaching 2.7 millions in total!</strong> These 2.7Μ keyphrases where part of 14M indexed URLs (estimated).</p>
<p>After that, our team began to manually curate these keywords starting from the most popular in terms of traffic and crawl hits. Also, our dev team helped with some handy automations like grouping keyphrases with the same product results and handled them all together with one action.</p>
<p><img src="https://engineering.skroutz.gr/images/seo-crawl-budget-2019/seo-2019-15.png" alt="" />
<small style="display: block; margin: 7px 0;">All the above internal search keyphrases had the same number of product results in Laptops Category. As you can see, they are all about Dell Laptops. So, they could be redirected at once in Dell category Filter.</small></p>
<p>This step helped to curate around 5% of the total keyphrases. The index size decreased in July 2018, from 25M to 21M, but it wasn’t enough.</p>
<p><img src="https://engineering.skroutz.gr/images/seo-crawl-budget-2019/seo-2019-16.png" alt="" /></p>
<p>Along with our manual efforts, we created some automated scripts and mechanisms for redirecting and mostly no-indexing internal search pages. Some of the most important were the following:</p>
<ol>
<li><strong>No-index Scripts</strong>: We No-indexed all dedicated category search pages:
<ul>
<li>with zero organic sessions in the last two months or</li>
<li>nearly zero organic sessions and up to 3 crawls over the previous 6 months</li>
</ul>
</li>
<li><strong>Redirect Scripts</strong>: We Redirected a dedicated category search page:
<ul>
<li>to the category clean URL, if the search (keyphrase) was returning all the products of the category.
For example “sneaker” keyphrase was returning 100% of products on “Sneakers” category. So, <a href="http://skroutz.gr/c/3363/sneakers.html?keyphrase=sneakers">skroutz.gr/c/3363/sneakers.html?keyphrase=sneakers</a> is redirected to the clean category listing)</li>
<li>to a specific filter URL, using a script that could linguistically identify combinations of category names with filters or manufacturers just from the keyphrase.
For example “stan smith black” query is matching two different filters: “Stan Smith” and “Black”. So, if a user search for “<a href="http://skroutz.gr/c/3363/sneakers.html?keyphrase=stan+smith+black">stan smith black</a>” he will redirected to the category page with the two filters enabled.</li>
</ul>
</li>
</ol>
<blockquote>
<p><strong>Note</strong>: In the last few months, we are running a more sophisticated mechanism that uses some intelligence from the above linguistic identifier script combined with other factors. The mechanism can decompose every search query, match it’s keyphrases to existing filters and redirect the internal search URL to the specific filter/ filter combination URL.</p>
<p>This mechanism handles a significant number of the daily internal search queries: 120,000 (18%).</p>
</blockquote>
<p>When everything was done by the SEO team, manually or automatically, we made a final big step to curate the long tail of the internal search pages. We created a basic SEO training course with Workshops, 1-1 hands-on and Wiki Guides for many of Content Team members. These members could, in turn, help us with the procedure. The SEO team, of course, was always keeping an eye out on this ongoing process.</p>
<p>The sharing of this knowledge has greatly benefited us in many ways. For example, because of human curation for crawl budget optimization, our content teams gained a better view of the things our visitors are searching for, which helped them to create more useful category filters.</p>
<p>In conclusion, after nearly one year of manual and automated curation, we finally curated 2,700,000 keyphrases, which correspond to approximately 14,000,000 URLs!</p>
<p>Specifically, from the 2,700,000 million internal search keyphrases:</p>
<ul>
<li><strong>2,200,000 were no-indexed</strong> (don’t expect these to be removed immediately from Google Index. We saw some delays ranging from a few days to a few months)</li>
<li><strong>300,000 were redirected</strong> to a filter page url or a category clean url</li>
<li><strong>200,000</strong> were marked as <strong>valid keyphrases</strong></li>
</ul>
<p>And that was our primary project.</p>
<p>Before we see the results of our efforts, let’s see what else have we done in a few words.</p>
<h4 id="secondary-seo-project-optimize-mergesplit-categories-and-alert-mechanism">Secondary SEO Project (Optimize Merge/Split Categories and Alert Mechanism)</h4>
<h5 id="1-optimize-seo-when-mergingsplitting-categories">1. Optimize SEO when merging/splitting categories</h5>
<p>While working on our primary project (crawl budget), we also allocated time for some secondary tasks.</p>
<p>The first one was to optimize and automate our procedure when we merged or split categories so that we won’t lose any SEO value and to provide a better user experience.</p>
<ul>
<li>By <strong>merging</strong> categories, we mean the merge of 2 different categories, like “Baby Shampoo” and “Kids Shampoo”, into one</li>
<li>By <strong>splitting</strong> categories, we mean the process of dividing one category into two or more categories. For example, “Jackets” category can be divided into Women’s Jackets and Men’s Jackets categories.</li>
</ul>
<p>All the above result in lots of redirects, so the SEO juice must be “transferred” from the old URLs to the new ones. What we did to optimize the whole procedure was to create an easy to use Merge/Split Tool, so the content team (which is responsible for the products) can easily map the old URLs with the new ones.</p>
<p><img src="https://engineering.skroutz.gr/images/seo-crawl-budget-2019/seo-2019-17.png" alt="" />
<small style="display: block; margin: 7px 0;">The merge tool shows all the filters from both categories, so the content team can map them or copy them. The mechanism will then use this information to make the redirects automatically.</small></p>
<h5 id="2-create-alert-mechanism">2. Create alert mechanism</h5>
<p>Alongside Keyphrase Curation, we built a mechanism that sends out notifications to the SEO team, when a critical crawling or indexing issue arises.</p>
<p>How does this mechanism work?</p>
<p>Depending on the alert type, the mechanism sends an alert when a metric (numeric):</p>
<ul>
<li>exceeds a specific threshold (for example 20,000 Not Found Pages)</li>
<li>differs significantly from the normal statistical fluctuation of the last 30 days</li>
</ul>
<p>As for the tools we use, we have set up alert rules in <a href="https://grafana.com/docs/alerting/rules/">Grafana Alerting Engine</a> that get delivered to a Slack channel.</p>
<p><img src="https://engineering.skroutz.gr/images/seo-crawl-budget-2019/seo-2019-18.png" alt="" /></p>
<p>After a notification is received, we use Kibana Monitoring tool to deeper analyze the root of the problem.</p>
<p>Some examples of the alerts we have set:</p>
<ul>
<li><strong>Sitemap Differences</strong>: Before the daily update of our sitemap files, the mechanism compares each generated file with the already submitted one. If they differ a lot, the alert mechanism informs us and blocks the sitemap submission instantly, until we validate the data</li>
<li><strong>Noindex Crawls</strong>: If crawls of Noindex Pages fall outside of a specified safe range</li>
<li><strong>Not Found Crawls</strong>: If crawls of 404 Pages fall outside of a specified safe range</li>
<li><strong>Redirect Counts</strong>: If crawls of Pages with enabled redirect fall outside of a specified safe range</li>
</ul>
<p><br /></p>
<h1 id="part-3-results">Part 3: Results</h1>
<h3 id="1-decreased-index-size">1. Decreased Index Size</h3>
<p><img src="https://engineering.skroutz.gr/images/seo-crawl-budget-2019/seo-2019-19.png" alt="" /></p>
<p>The above graph shows our index size after seven months of hard work and 90% of keyphrases being curated.</p>
<p>As for today?</p>
<p>Even better!</p>
<p>We have now dramatically closed the gap between the actual and expected indexed pages, meaning we reduced the size from 25M to only 7.6M.</p>
<p><img src="https://engineering.skroutz.gr/images/seo-crawl-budget-2019/seo-2019-20.png" alt="" /></p>
<p>One interesting thing that we observed is this:</p>
<p>GoogleBot doesn’t stop crawling a URL immediately, even if you mark it as no-index. So, if you think that a no-index tag will save your crawl budget instantly, you are wrong.</p>
<p>Notably, we saw that in some cases, GoogleBot returned after 2 or 3 months to crawl a no-index page. We created some metrics for these, and we saw that:</p>
<ul>
<li><strong>Only half of our no-index</strong> internal search URLs haven’t been crawled for at least <strong>three months</strong></li>
<li>Only, <strong>38.28% of our no-index</strong> internal search URLs haven’t been crawled for at least <strong>six months</strong></li>
</ul>
<p><img src="https://engineering.skroutz.gr/images/seo-crawl-budget-2019/seo-2019-21.png" alt="" /></p>
<h3 id="2-increased-filter-crawl-rate">2. Increased Filter Crawl Rate</h3>
<p>If we take, for example, the fluctuation of Internal Search Pages Crawls (<span style="color:blue;">blue</span>) versus Filter Pages Crawls (<span style="color:green;">green</span>) during the last year, it’s clear that we forced Googlebot to crawl the Filter Pages more frequently than Internal Search Pages.</p>
<p><img src="https://engineering.skroutz.gr/images/seo-crawl-budget-2019/seo-2019-22.png" alt="" /></p>
<h3 id="3-decreased-time-for-new-urls-to-be-indexed-and-ranked">3. Decreased time for new URLs to be indexed and ranked</h3>
<p>Instead of taking 2-3 months to index and rank unique URLs, as we saw in a previous example, indexing and ranking phase now take only a few days.</p>
<p><img src="https://engineering.skroutz.gr/images/seo-crawl-budget-2019/seo-2019-23.png" alt="" />
<small style="display: block; margin: 7px 0;">On 29/03/2019, we redirected an internal search page skroutz.gr/c/1487/Soutien.html?keyphrase=bralette to a Category page skroutz.gr/c/3361/Bralettes.html.</small></p>
<h3 id="4-filter-urls-have-increasing-visibility">4. Filter URLs have increasing visibility</h3>
<p>Take a look at the Data Studio chart below, with data (Clicks) of Search Console from June 2018 to May 2019. You can see how the organic traffic of Filter URLs is increasing compared to the Internal Search Keyphrases URLs traffic, which is slightly decreasing.</p>
<p><img src="https://engineering.skroutz.gr/images/seo-crawl-budget-2019/seo-2019-24.png" alt="" /></p>
<h3 id="5-average-position-improved-pushing-up-impressions-and-clicks">5. Average Position Improved, pushing up impressions and clicks</h3>
<p><img src="https://engineering.skroutz.gr/images/seo-crawl-budget-2019/seo-2019-25.png" alt="" />
<small style="display: block; margin: 7px 0;">The table from Google Search Console compares the summer of 2018 (exactly when we started the SEO Project) versus the summer of 2019.</small></p>
<p><br /></p>
<h1 id="what-we-learned">What We Learned</h1>
<p>Over the past two years, we’ve learned a lot during this technical SEO project, and we want to share some things which could eventually help the community.</p>
<p>So here it goes; these are the five most important things we learned:</p>
<h5 id="takeaway-1">Takeaway 1:</h5>
<p>Crawling monitoring is a must for large sites. You can find insights in such a way you would have never guessed. By monitoring, we don’t necessarily mean real-time like we did. You can also use a website crawler like <a href="https://www.screamingfrog.co.uk/seo-spider/">Screaming Frog</a> or <a href="https://sitebulb.com/">Sitebulb</a> every month or after a critical change in your site. You would be amazed by the value you could earn by doing this.</p>
<p>An interesting example of other useful insights you can get from crawl monitoring is some critical issues we found when we switched our category listing product pages to React. Without getting into details, after React deployment, GoogleBot started to crawl like crazy no-indexed pages that shouldn’t be crawled, despite being nofollowed from every other link. With crawl monitoring, we were able to immediately see what type of pages had that issue.</p>
<p><img src="https://engineering.skroutz.gr/images/seo-crawl-budget-2019/seo-2019-26-27.png" alt="" /></p>
<p>We saw that most of the crawls on no-indexed pages where a combination of size and manufacturer filters on the category with the ID 1764</p>
<p>After all, we found out that GoogleBot executed an inline <code class="language-plaintext highlighter-rouge"><script /></code> and interpreted some relative URL paths as regular URLs, which then crawled at a high rate. We validated this assumption with the addition of a dummy URL in the script, which we later saw that GoogleBot was able to crawl.</p>
<hr />
<h5 id="takeaway-2">Takeaway 2:</h5>
<p>Googlebot doesn’t stop crawls immediately after you change a page to no-index. It can take some time. We saw no-indexed URLs to be crawled for months before they have been removed from the Google Index.</p>
<hr />
<h5 id="takeaway-3">Takeaway 3:</h5>
<p>Consolidating URLs can backfire easily if not done right. Every URL that is redirected to another must be highly related (nearly duplicate) to each other. We have seen that redirects to irrelevant pages had the opposite results.</p>
<hr />
<h5 id="takeaway-4">Takeaway 4:</h5>
<p>Always pay attention when merging or splitting categories. We saw that even if you keep your rankings stable, there might be a delay of up to a few months where you can lose many clicks. Mapping old URLs to new ones and 301 redirects can really help.</p>
<hr />
<h5 id="takeaway-5">Takeaway 5:</h5>
<p>SEO is not a one-person show or even one-team show. Sharing of SEO knowledge and cooperation with other teams can empower the entire organization in many ways. For example, Search Team of Skroutz.gr has made a magnificent work by setting-up most of the technical infrastructure of the tools and mechanisms we used on our SEO project.</p>
<p>Finally, you can’t imagine how many SEO issues we have found using feedback from other departments such as Content Teams and Marketing. Even the <a href="https://www.linkedin.com/in/bandito/">CEO of Skroutz.gr</a> himself, has helped a lot on technical issues we had (Scripts etc.).</p>
<hr />
<p>That’s all folks.</p>
<p>Congratulations on getting to the very end of this, quite large :-), case study!</p>
<p>Have you ever used any insights from the crawling behavior of GoogleBot to solve issues on your site? How did you deal with them? Let us know in the comments section below!</p>
<p><a href="https://www.linkedin.com/in/vgiannakouris">Vasilis Giannakouris</a>,<br />
on behalf of <a href="mailto:growth@teams.skroutz.gr">Skroutz SEO Team</a></p>
<style type="text/css">
.entry-content h3 {
line-height: 1.2;
}
.entry-content img {
margin: 20px 0;
}
.entry-content td {
background: #fafafa;
font-size: 12px;
}
.entry-content blockquote {
background: #f5f5f5;
padding: 20px 25px;
border-radius: 3px;
border: 1px solid #b5b5b5;
margin: 30px 0;
}
.entry-content p:last-child {
margin-bottom: 0;
}
.entry-content blockquote {
font-style: normal;
}
.entry-content blockquote p,
.entry-content blockquote li {
font-size: .9rem;
}
@media screen and (min-width: 48em) {
.entry-content blockquote p,
.entry-content blockquote li {
font-size: 1rem;
}
}
.entry-content a,
.entry-content code {
white-space: normal;
word-break: break-word;
}
</style>
<p><a href="https://engineering.skroutz.gr/blog/SEO-Crawl-Budget-Optimization-2019/">[Case Study] How we optimized our Crawl Budget</a> was originally published by Vasilis Giannakouris at <a href="https://engineering.skroutz.gr">Skroutz Engineering</a> on October 30, 2019.</p>https://engineering.skroutz.gr/blog/agile-summit-athens-20192019-10-03T21:00:00+00:002019-10-03T21:00:00+00:00John Makridishttps://engineering.skroutz.gr<p>On September 19th and 20th we* attended <a href="https://agilesummit.gr/">Agile Summit in Athens</a>. Agile Summit is an international conference gathering world class speakers, agile experts & practitioners from around the world. <a href="https://www.skroutz.gr/">Skroutz</a> supports Agile Summit and last year’s conference was quite inspiring, so we decided to attend it again this year. here are our notes.</p>
<h1 id="applying-the-heart-of-agile">Applying the Heart of Agile</h1>
<p class="note">
From Alistair Cockburn, Creator Heart of Agile
</p>
<p><a href="https://agilesummit.gr/alistair-cockburn/">Alistair Cockburn</a> is one of the authors of the Agile Manifesto (in 2001) and shared with us the principles of <a href="http://heartofagile.com/">heart of agile</a>. His point of view is that the whole idea of the Agile Manifesto is simple. But since 2001 Agile became more and more complicated, more things have been put on it. Agile became a complete industry. Heart of Agile says that we should go back to the essence, which is four words:</p>
<ul>
<li>Collaborate</li>
<li>Deliver</li>
<li>Reflect</li>
<li>Improve</li>
</ul>
<figure>
<a href="../../../images/agile-summit-2019/heart-of-agile.png" class="image-popup">
<img src="../../../images/agile-summit-2019/heart-of-agile.png" alt="image" />
</a>
</figure>
<p>He didn’t go extensively through the framework, and prompted the audience to see his full presentation from a <a href="https://heartofagile.com/video-of-the-latest-talk-on-heart-of-agile-by-alistair-cockburn-denmark-october-2018/">conference in Denmark</a>. He described a framework for learning and mastering skills, called Shu Ha Ri and Kokoro, which is also explained in the video, so it’s highly recommended.</p>
<h1 id="innovation-at-scale">Innovation at scale</h1>
<p class="note">
From Yariv Adan, Product Manager @Google
</p>
<p><a href="https://agilesummit.gr/yariv-adan/">Yariv</a> shared a few insights on how Google is enabling innovative products. His main focus was the <a href="https://www.youtube.com/watch?v=QMW8ZsXxOKw">20% time</a> projects, which is responsible for multiple products with more than 1B users like Gmail, Google Translate or Google News. Achieving and maintaining that in these highly competitive markets requires constant product & technology innovation. He shared his observations & principles through his 10+ years experience at Google about the process to generate them through TGIF meetings, and Google’s what <a href="https://rework.withgoogle.com/guides/managers-identify-what-makes-a-great-manager/steps/learn-about-googles-manager-research/">makes a good manager research</a> & continuously iterations . Some key points:</p>
<ul>
<li>Always focus on the User</li>
<li>Launch & iterate, rather than perfection</li>
<li>Ideas come from everywhere -> Share everything</li>
<li>Empower people -> Data, not opinions</li>
<li>Let people pursue Dream</li>
</ul>
<figure>
<a href="../../../images/agile-summit-2019/innovation-at-scale-1.png" class="image-popup">
<img src="../../../images/agile-summit-2019/innovation-at-scale-1.png" alt="image" />
</a>
</figure>
<figure>
<a href="../../../images/agile-summit-2019/innovation-at-scale-2.png" class="image-popup">
<img src="../../../images/agile-summit-2019/innovation-at-scale-2.png" alt="image" />
</a>
</figure>
<figure>
<a href="../../../images/agile-summit-2019/innovation-at-scale-3.png" class="image-popup">
<img src="../../../images/agile-summit-2019/innovation-at-scale-3.png" alt="image" />
</a>
</figure>
<h1 id="lessons-from-an-ex-project-manager-turned-product-manager">Lessons from an ex-Project Manager turned Product Manager</h1>
<p class="note">
From Emma Septon, Account Manager @ProdPad
</p>
<p><a href="https://agilesummit.gr/emma-sephton/">Emma Septon</a> talked about the principles and best practices she learned and applied in order to help her with the new role of the Product Manager making the transition from the Project Manager. Key Points:</p>
<figure>
<a href="../../../images/agile-summit-2019/lesson-from-1.jpg" class="image-popup">
<img src="../../../images/agile-summit-2019/lesson-from-1.jpg" alt="image" />
</a>
</figure>
<ul>
<li><strong>Customer should be on the centre</strong>. Typically the stakeholders and clients ask for a feature, not a solution. Listen to them to deeply understand their problems.</li>
<li><strong>Define the strategy and focus on it</strong>. Put it as the first priority and <strong>learn to say no</strong> to irrelevant requests. ‘<a href="https://www.mindtools.com/pages/article/newTMC_5W.htm">Five whys</a>’ technique will help to focus on the why’s and not on the how’s.</li>
<li>Represent the plan in a way that can be understood by everyone in the business. A <strong>roadmap</strong> (now-next-later) will help in that direction, while a <strong>time-based</strong> project plan, like gantt chart might be more challenging since it may need to be redone many times.</li>
</ul>
<figure>
<a href="../../../images/agile-summit-2019/lesson-from-2.png" class="image-popup">
<img src="../../../images/agile-summit-2019/lesson-from-2.png" alt="image" />
</a>
</figure>
<ul>
<li><strong>Find balance between</strong> focusing on <strong>strategy</strong> and day-to-day <strong>development</strong> involvement (which may be time consuming).</li>
<li><strong>Use</strong> <a href="https://www.eisenhower.me/eisenhower-matrix/">Eisenhower matrix</a> <strong>technique</strong> to define what is important and needs immediate action and what can be delegated or even eliminated.</li>
</ul>
<figure>
<a href="../../../images/agile-summit-2019/lesson-from-3.png" class="image-popup">
<img src="../../../images/agile-summit-2019/lesson-from-3.png" alt="image" />
</a>
</figure>
<ul>
<li><strong>Focus on outcomes and not on outputs</strong>. Output is just about implementing a feature while outcome is about meeting the objectives; is a learning experience.</li>
</ul>
<figure>
<a href="../../../images/agile-summit-2019/lesson-from-4.jpg" class="image-popup">
<img src="../../../images/agile-summit-2019/lesson-from-4.jpg" alt="image" />
</a>
</figure>
<h1 id="empathy-is-a-technical-skill">Empathy is a technical skill</h1>
<p class="note">
From Andrea Goulet, CEO of Corgibytes
</p>
<p>What is Empathy? Is it a feeling? Technicians can’t access empathy? Is just a high-level, touchy-feely fad? Nope. <a href="https://agilesummit.gr/empathy-is-a-technical-skill/">Andrea</a> demonstrated how empathy is a crucial skill for developing software and focused on giving us practical and immediately actionable advice for <strong>making empathy a central focus of our daily development practice</strong>.</p>
<p><a href="https://agilesummit.gr/empathy-is-a-technical-skill/">Andrea</a> mentioned the differences between cognitive and mirrored empathy and how to exercise yourself in order to build a stronger empathy, for example:</p>
<ul>
<li>Start with a broad topic</li>
<li>User the fewest number of words</li>
<li>Avoid introducing words the speaker may not have heard</li>
<li>Try not to say “I”</li>
<li>Be supportive and present</li>
<li>Resist the urge to demonstrate how smart you are</li>
<li>Neutralize your reactions</li>
</ul>
<figure>
<a href="../../../images/agile-summit-2019/empathy-1.png" class="image-popup">
<img src="../../../images/agile-summit-2019/empathy-1.png" alt="image" />
</a>
</figure>
<figure>
<a href="../../../images/agile-summit-2019/empathy-2.png" class="image-popup">
<img src="../../../images/agile-summit-2019/empathy-2.png" alt="image" />
</a>
</figure>
<h1 id="conclusion">Conclusion</h1>
<p>Wrapping up, <strong>Agile Summit</strong> as one of the biggest agile conferences in Southern Europe was quite inspiring once again. The organisation of the conference was great as well as the talks. With so many interesting people to interact, learn and exchange experiences was clearly met our expectations. See you there next year! You are more than welcome to leave a comment.</p>
<p>Written By:</p>
<p>*
<em><a href="https://www.linkedin.com/in/stavroula-vasilopoulou-b7248730/">Stavroula Vasilopoulou</a>, <a href="https://twitter.com/giorgostsiftsis">Giorgos Tsiftsis</a>, <a href="https://www.linkedin.com/in/ioannismakridis" rel="nofollow">John Makridis</a>, <a href="https://gr.linkedin.com/in/dimitris-promponas-06131761">Dimitris Promponas</a>, <a href="http://vangeltzo.com/">Vagelis Tzortzis</a></em></p>
<p><a href="https://engineering.skroutz.gr/blog/agile-summit-athens-2019/">Agile Summit Athens 2019</a> was originally published by John Makridis at <a href="https://engineering.skroutz.gr">Skroutz Engineering</a> on October 03, 2019.</p>https://engineering.skroutz.gr/blog/entropy-changes-in-debian2019-09-09T00:00:00+00:002019-09-09T00:00:00+00:00Alexandros Afentoulis, Nikos Kormpakishttps://engineering.skroutz.gr<h2 id="intro">Intro</h2>
<p>At Skroutz we operate a wide variety of services comprising the ecosystem
behind <a href="https://www.skroutz.gr">Skroutz.gr</a>, a comparison shopping engine which
evolved to an e-commerce marketplace. We run these services on our own
infrastructure, bare metal servers and virtual machines. All hosts are running
Debian GNU/Linux, which on July 6th 2019 had its latest stable release, called
Buster. Buster came with lots of changes in included packages, as expected in a
major release.</p>
<p>We started experimenting with dist-upgraded Buster hosts a couple of months
before the official release, as soon as Buster got in “freeze” state. This
strategy would give us a taste of what to expect with the new software versions
and how to get better prepared to smoothly upgrade the operating system
underneath our services with minimum disruption.</p>
<h2 id="the-problem">The problem</h2>
<p>The issue we’re going to discuss in this post manifests pretty simply: after
dist-upgrading a virtual machine to Buster and rebooting it, it took a couple
of minutes before we could actually regain access via ssh. Virtual machine
reboots are part of routine maintenance work to keep our services up-to-date
and secure. When orchestrating such works across a fleet of hundred hosts, we
certainly would like to avoid spending minutes before verifying that each host
did come back up and healthy.</p>
<h2 id="investigation">Investigation</h2>
<p>It’s widely known that virtual machines do not enjoy the privilege of high
quality randomness as the physical hosts do, since a virtual machine’s devices
are emulated by design, thus do not feature unpredictable behavior, a useful
ingredient for randomness <sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> <sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> <sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup>.</p>
<p>Various references, e.g. Debian bug reports <sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup> <sup id="fnref:5" role="doc-noteref"><a href="#fn:5" class="footnote" rel="footnote">5</a></sup>, suggested that this
behavior was to be attributed to OpenSSL and how it gathers entropy via the
<code class="language-plaintext highlighter-rouge">getrandom()</code> system call. But all these online references were not descriptive
enough or conclusive, so we opted for digging deeper and understand the issue.</p>
<p>Kernel ring buffer displays important information coming from the kernelspace
and it’s the first place we looked at. Consider this snippet from a Buster VM
that just booted:</p>
<figure class="highlight"><pre><code class="language-sh" data-lang="sh"><span class="c"># journalctl -k | grep random</span>
Apr 17 12:05:06 somevm kernel: random: get_random_bytes called from start_kernel+0x93/0x531 with <span class="nv">crng_init</span><span class="o">=</span>0
Apr 17 12:05:06 somevm kernel: random: fast init <span class="k">done
</span>Apr 17 12:05:06 somevm kernel: random: systemd: uninitialized urandom <span class="nb">read</span> <span class="o">(</span>16 bytes <span class="nb">read</span><span class="o">)</span>
Apr 17 12:05:06 somevm kernel: random: systemd: uninitialized urandom <span class="nb">read</span> <span class="o">(</span>16 bytes <span class="nb">read</span><span class="o">)</span>
Apr 17 12:05:06 somevm kernel: random: systemd: uninitialized urandom <span class="nb">read</span> <span class="o">(</span>16 bytes <span class="nb">read</span><span class="o">)</span>
Apr 17 12:06:48 somevm kernel: random: crng init <span class="k">done
</span>Apr 17 12:06:48 somevm kernel: random: 7 urandom warning<span class="o">(</span>s<span class="o">)</span> missed due to ratelimiting</code></pre></figure>
<p>Three important points stand out:</p>
<ul>
<li>
<p>before anything else it’s the kernel entry point which requests randomness
with <code class="language-plaintext highlighter-rouge">get_random_bytes()</code> kernel function. We will explain its behavior and
usage below.</p>
</li>
<li>
<p>systemd (userspace) is also requesting randomness while bringing up system’s
services</p>
</li>
<li>
<p><code class="language-plaintext highlighter-rouge">crng init</code> (crng stands for cryptographic random number generator) takes
almost 2 minutes since boot</p>
</li>
</ul>
<h3 id="kernels-get_random_bytes">kernel’s <code class="language-plaintext highlighter-rouge">get_random_bytes()</code></h3>
<p><code class="language-plaintext highlighter-rouge">get_random_bytes()</code> is an in-kernel interface to provide random bytes. In our
case, it is called from kernel’s entry point <sup id="fnref:6" role="doc-noteref"><a href="#fn:6" class="footnote" rel="footnote">6</a></sup> if <code class="language-plaintext highlighter-rouge">CONFIG_STACKPROTECTOR</code>
is set, which is true for kernels packaged in Debian. That message is printed
if <code class="language-plaintext highlighter-rouge">CONFIG_WARN_ALL_UNSEEDED_RANDOM</code> is not set (again true for Debian) to
inform us that we don’t have a fully seeded CRNG. In case you’re curious, these
numbers are required for GCC’s “stack-protector” feature. When a function gets
called, a random number is placed on the stack, just before the return address.
This number is called “canary” and is validated by the kernel after returning.
If an attacker performs a stack-based buffer overflow, the canary value will be
overwritten. The kernel will detect this attack and throw a kernel panic <sup id="fnref:7" role="doc-noteref"><a href="#fn:7" class="footnote" rel="footnote">7</a></sup>.</p>
<p>A quick look into the kernel codebase shows us that it is unlikely that the
boot process will actually block here, rather we have a clear indication that
kernel’s CRNG is not properly initialized and we’ll see how that affects
userspace processes that depend on that.</p>
<h3 id="systemd-sshservice">systemd ssh.service</h3>
<p>Following lines in dmesg show that systemd has started as well and it actually
reads bytes from urandom, albeit uninitialized.</p>
<p>systemd allows us to print a tree of the time-critical chain of systemd units
(including services) as well as the time spend for each one to be started. This
is done via:</p>
<figure class="highlight"><pre><code class="language-sh" data-lang="sh"><span class="c"># systemd-analyze critical-chain</span>
The <span class="nb">time </span>after the unit is active or started is printed after the <span class="s2">"@"</span> character.
The <span class="nb">time </span>the unit takes to start is printed after the <span class="s2">"+"</span> character.
graphical.target @1min 45.121s
└─multi-user.target @1min 45.121s
└─ssh.service @1min 34.242s +10.857s
└─network.target @3.887s
└─networking.service @1.096s +2.790s
└─network-pre.target @1.095s
└─ferm.service @288ms +807ms
└─systemd-journald.socket @287ms
└─system.slice @282ms
└─-.slice @282ms</code></pre></figure>
<p>It’s clear that ssh service takes somewhat longer than usual to get up. Its
journal reads:</p>
<figure class="highlight"><pre><code class="language-sh" data-lang="sh"><span class="c"># journalctl -u ssh.service</span>
<span class="nt">--</span> Logs begin at Wed 2019-04-17 11:53:56 EEST, end at Wed 2019-04-17 11:56:43 EEST. <span class="nt">--</span>
Apr 17 11:54:00 somevm systemd[1]: Starting OpenBSD Secure Shell server...
Apr 17 11:55:30 somevm systemd[1]: ssh.service: Start-pre operation timed out. Terminating.
Apr 17 11:55:30 somevm systemd[1]: ssh.service: Control process exited, <span class="nv">code</span><span class="o">=</span>killed, <span class="nv">status</span><span class="o">=</span>15/TERM
Apr 17 11:55:30 somevm systemd[1]: ssh.service: Failed with result <span class="s1">'timeout'</span><span class="nb">.</span>
Apr 17 11:55:30 somevm systemd[1]: Failed to start OpenBSD Secure Shell server.
Apr 17 11:55:30 somevm systemd[1]: ssh.service: Service <span class="nv">RestartSec</span><span class="o">=</span>100ms expired, scheduling restart.
Apr 17 11:55:30 somevm systemd[1]: ssh.service: Scheduled restart job, restart counter is at 1.
Apr 17 11:55:30 somevm systemd[1]: Stopped OpenBSD Secure Shell server.
Apr 17 11:55:30 somevm systemd[1]: Starting OpenBSD Secure Shell server...
Apr 17 11:55:41 somevm sshd[1184]: Server listening on 0.0.0.0 port 22.
Apr 17 11:55:41 somevm sshd[1184]: Server listening on :: port 22.
Apr 17 11:55:41 somevm systemd[1]: Started OpenBSD Secure Shell server.</code></pre></figure>
<p>It seems that ssh.service gets stuck in its <code class="language-plaintext highlighter-rouge">ExecStartPre</code> command:</p>
<figure class="highlight"><pre><code class="language-sh" data-lang="sh"><span class="c"># systemctl cat ssh.service | ag ExecStartPre</span>
<span class="nv">ExecStartPre</span><span class="o">=</span>/usr/sbin/sshd <span class="nt">-t</span></code></pre></figure>
<p><code class="language-plaintext highlighter-rouge">sshd -t</code> just checks the validity of configuration files and sanity of keys.
So, why is it blocking? To get an insight on why <code class="language-plaintext highlighter-rouge">ExecStartPre</code> times out, we
decided to enrich it like this:</p>
<figure class="highlight"><pre><code class="language-sh" data-lang="sh"><span class="c">#!/bin/sh</span>
strace <span class="nt">-f</span> <span class="nt">-c</span> <span class="nt">-w</span> /usr/sbin/sshd <span class="nt">-t</span> <span class="o">></span> /tmp/sshd_strace_<span class="sb">`</span><span class="nb">date</span> +%s<span class="sb">`</span> 2>&1</code></pre></figure>
<p>We basically wrap the <code class="language-plaintext highlighter-rouge">sshd</code> invocation with <code class="language-plaintext highlighter-rouge">strace</code> and instruct it to keep
aggregate time statistics about each system call made by the executable. Our
intention is to identify the system call sshd is spending most of its time at
before finally get killed by systemd.</p>
<p>After rebooting the VM we got our sshd strace logfiles:</p>
<figure class="highlight"><pre><code class="language-sh" data-lang="sh"><span class="c"># ls -l /tmp/sshd_strace*</span>
<span class="nt">-rw-r--r--</span> 1 root root 2152 Apr 17 12:49 /tmp/sshd_strace_1555494448
<span class="nt">-rw-r--r--</span> 1 root root 2152 Apr 17 12:49 /tmp/sshd_strace_1555494538</code></pre></figure>
<p>This is the output of the first attempt (which gets killed by systemd):</p>
<figure class="highlight"><pre><code class="language-sh" data-lang="sh"><span class="c"># cat sshd_strace_1555494448</span>
% <span class="nb">time </span>seconds usecs/call calls errors syscall
<span class="nt">------</span> <span class="nt">-----------</span> <span class="nt">-----------</span> <span class="nt">---------</span> <span class="nt">---------</span> <span class="nt">----------------</span>
99.96 101.669156 101669156 1 getrandom
0.01 0.007609 7609 1 execve
0.01 0.006644 120 55 <span class="nb">read
</span>0.01 0.006289 49 128 mmap
0.00 0.004297 104 41 mprotect
<span class="o">[</span>...]
<span class="nt">------</span> <span class="nt">-----------</span> <span class="nt">-----------</span> <span class="nt">---------</span> <span class="nt">---------</span> <span class="nt">----------------</span>
100.00 101.706415 444 7 total</code></pre></figure>
<p>It’s self-evident that sshd spends the whole time trying to acquire randomness
via <code class="language-plaintext highlighter-rouge">getrandom()</code> system call.</p>
<p>The second systemd attempt to get sshd up actually succeeds with the strace log
reading:</p>
<figure class="highlight"><pre><code class="language-sh" data-lang="sh"><span class="c"># cat sshd_strace_1555494538</span>
% <span class="nb">time </span>seconds usecs/call calls errors syscall
<span class="nt">------</span> <span class="nt">-----------</span> <span class="nt">-----------</span> <span class="nt">---------</span> <span class="nt">---------</span> <span class="nt">----------------</span>
99.94 11.543144 11543143 1 getrandom
0.02 0.001813 34 52 close
0.01 0.001594 12 128 mmap
0.01 0.000753 16 47 openat
0.01 0.000585 10 55 <span class="nb">read
</span>0.00 0.000564 13 41 mprotect
<span class="o">[</span>...]
<span class="nt">------</span> <span class="nt">-----------</span> <span class="nt">-----------</span> <span class="nt">---------</span> <span class="nt">---------</span> <span class="nt">----------------</span>
100.00 11.549977 444 7 total</code></pre></figure>
<p>Notice that the second attempt succeeds (12:49:10) exactly at the same time
<code class="language-plaintext highlighter-rouge">getrandom()</code> returns a result, which coincides exactly with the timestamp the
kernel’s entropy pool gets initialized:</p>
<figure class="highlight"><pre><code class="language-sh" data-lang="sh"><span class="c"># journalctl -k | grep random</span>
Apr 17 12:47:25 somevm kernel: random: get_random_bytes called from start_kernel+0x93/0x531 with <span class="nv">crng_init</span><span class="o">=</span>0
Apr 17 12:47:25 somevm kernel: random: fast init <span class="k">done
</span>Apr 17 12:47:25 somevm kernel: random: systemd: uninitialized urandom <span class="nb">read</span> <span class="o">(</span>16 bytes <span class="nb">read</span><span class="o">)</span>
Apr 17 12:47:25 somevm kernel: random: systemd: uninitialized urandom <span class="nb">read</span> <span class="o">(</span>16 bytes <span class="nb">read</span><span class="o">)</span>
Apr 17 12:47:25 somevm kernel: random: systemd: uninitialized urandom <span class="nb">read</span> <span class="o">(</span>16 bytes <span class="nb">read</span><span class="o">)</span>
Apr 17 12:49:10 somevm kernel: random: crng init <span class="k">done
</span>Apr 17 12:49:10 somevm kernel: random: 7 urandom warning<span class="o">(</span>s<span class="o">)</span> missed due to ratelimiting
<span class="c"># journalctl -u ssh.service</span>
<span class="nt">--</span> Logs begin at Wed 2019-04-17 12:47:25 EEST, end at Wed 2019-04-17 12:52:23 EEST. <span class="nt">--</span>
Apr 17 12:47:28 somevm systemd[1]: Starting OpenBSD Secure Shell server...
Apr 17 12:48:58 somevm systemd[1]: ssh.service: Start-pre operation timed out. Terminating.
Apr 17 12:48:58 somevm systemd[1]: ssh.service: Control process exited, <span class="nv">code</span><span class="o">=</span>killed, <span class="nv">status</span><span class="o">=</span>15/TERM
Apr 17 12:48:58 somevm systemd[1]: ssh.service: Failed with result <span class="s1">'timeout'</span><span class="nb">.</span>
Apr 17 12:48:58 somevm systemd[1]: Failed to start OpenBSD Secure Shell server.
Apr 17 12:48:58 somevm systemd[1]: ssh.service: Service <span class="nv">RestartSec</span><span class="o">=</span>100ms expired, scheduling restart.
Apr 17 12:48:58 somevm systemd[1]: ssh.service: Scheduled restart job, restart counter is at 1.
Apr 17 12:48:58 somevm systemd[1]: Stopped OpenBSD Secure Shell server.
Apr 17 12:48:58 somevm systemd[1]: Starting OpenBSD Secure Shell server...
Apr 17 12:49:10 somevm systemd[1]: Started OpenBSD Secure Shell server.</code></pre></figure>
<p>Quick sidenote: We were curious about why sshd is calling <code class="language-plaintext highlighter-rouge">getrandom()</code> even if
its just validating its configuration. A quick look at sshd’s source code,
shows that it seeds its RNG during startup, even if its just validating its
configuration:</p>
<figure class="highlight"><pre><code class="language-c" data-lang="c"><span class="kt">int</span>
<span class="nf">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">ac</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">av</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">[...]</span>
<span class="n">seed_rng</span><span class="p">();</span>
<span class="p">[...]</span>
<span class="k">if</span> <span class="p">(</span><span class="n">test_flag</span> <span class="o">></span> <span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
<span class="p">[...]</span>
<span class="n">parse_server_match_config</span><span class="p">(</span><span class="o">&</span><span class="n">options</span><span class="p">,</span> <span class="n">connection_info</span><span class="p">);</span>
<span class="n">dump_config</span><span class="p">(</span><span class="o">&</span><span class="n">options</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">[...]</span></code></pre></figure>
<p><code class="language-plaintext highlighter-rouge">seed_rng()</code> is invoking <code class="language-plaintext highlighter-rouge">RAND_status()</code>, an OpenSSL library function which,
finally, executes <code class="language-plaintext highlighter-rouge">getrandom()</code>.</p>
<h3 id="changes-for-getrandom-system-call">Changes for <code class="language-plaintext highlighter-rouge">getrandom()</code> system call</h3>
<p>So we’ve identified that <code class="language-plaintext highlighter-rouge">ssh.service</code> blocks waiting for <code class="language-plaintext highlighter-rouge">getrandom()</code> syscall.
Then our focus shifted to understanding why/when <code class="language-plaintext highlighter-rouge">getrandom()</code> blocks and how is
that related with the kernel’s CRNG.</p>
<p>First, it’s clear that whether <code class="language-plaintext highlighter-rouge">getrandom()</code> will read from <code class="language-plaintext highlighter-rouge">/dev/urandom</code> or
<code class="language-plaintext highlighter-rouge">/dev/random</code> and whether will it block or not is controlled by the relevant
flags: <code class="language-plaintext highlighter-rouge">GRND_RANDOM</code> and <code class="language-plaintext highlighter-rouge">GRND_NONBLOCK</code>(check <code class="language-plaintext highlighter-rouge">getrandom(2)</code> for more). A
quick search showed that neither OpenSSH nor OpenSSL (which OpenSSH relies on
for cryptography) do not set any of these flags, meaning <code class="language-plaintext highlighter-rouge">getrandom()</code> will
have its default behavior: will block until the kernel’s CRNG is ready.</p>
<p>If these flags are not set then either the system call or the CRNG did change
in the meantime. And this meant digging into kernel source code and git history… :D
Debian Stretch features kernels from the 4.9.x linux-stable tree while
Debian Buster features kernels from the 4.19.x series.</p>
<p>Pondering over the output of <code class="language-plaintext highlighter-rouge">git log -p v4.9..v4.19 -- drivers/char/random.c</code>
is really an enjoyful activity but we’ll spare you the time and directly point
you to commit
<a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=43838a23a05fbd13e47d750d3dfd77001536dd33">43838a23a05fbd13e47</a>
by Theodore Ts’o. This commit is entitled <code class="language-plaintext highlighter-rouge">random: fix crng_ready() test</code> and
was introduced in linux 4.17 as a response to multiple
security
<a href="https://bugs.chromium.org/p/project-zero/issues/detail?id=1559">issues</a>
reported by Google’s Project Zero. It basically changes the <code class="language-plaintext highlighter-rouge">crng_ready()</code>
function to be more strict about when linux’s CRNG is safe for cryptographic
use cases:</p>
<figure class="highlight"><pre><code class="language-c" data-lang="c"><span class="n">diff</span> <span class="o">--</span><span class="n">git</span> <span class="n">a</span><span class="o">/</span><span class="n">drivers</span><span class="o">/</span><span class="kt">char</span><span class="o">/</span><span class="n">random</span><span class="p">.</span><span class="n">c</span> <span class="n">b</span><span class="o">/</span><span class="n">drivers</span><span class="o">/</span><span class="kt">char</span><span class="o">/</span><span class="n">random</span><span class="p">.</span><span class="n">c</span>
<span class="n">index</span> <span class="n">e027e7fa1472</span><span class="p">..</span><span class="n">c8ec1e70abde</span> <span class="mi">100644</span>
<span class="o">---</span> <span class="n">a</span><span class="o">/</span><span class="n">drivers</span><span class="o">/</span><span class="kt">char</span><span class="o">/</span><span class="n">random</span><span class="p">.</span><span class="n">c</span>
<span class="o">+++</span> <span class="n">b</span><span class="o">/</span><span class="n">drivers</span><span class="o">/</span><span class="kt">char</span><span class="o">/</span><span class="n">random</span><span class="p">.</span><span class="n">c</span>
<span class="err">@@</span> <span class="o">-</span><span class="mi">427</span><span class="p">,</span><span class="mi">7</span> <span class="o">+</span><span class="mi">427</span><span class="p">,</span><span class="mi">7</span> <span class="err">@@</span> <span class="k">struct</span> <span class="n">crng_state</span> <span class="n">primary_crng</span> <span class="o">=</span> <span class="p">{</span>
<span class="o">*</span> <span class="n">its</span> <span class="n">value</span> <span class="p">(</span><span class="n">from</span> <span class="mi">0</span><span class="o">-></span><span class="mi">1</span><span class="o">-></span><span class="mi">2</span><span class="p">).</span>
<span class="err">*/</span>
<span class="k">static</span> <span class="kt">int</span> <span class="n">crng_init</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="o">-</span><span class="err">#</span><span class="n">define</span> <span class="n">crng_ready</span><span class="p">()</span> <span class="p">(</span><span class="n">likely</span><span class="p">(</span><span class="n">crng_init</span> <span class="o">></span> <span class="mi">0</span><span class="p">))</span>
<span class="o">+</span><span class="err">#</span><span class="n">define</span> <span class="n">crng_ready</span><span class="p">()</span> <span class="p">(</span><span class="n">likely</span><span class="p">(</span><span class="n">crng_init</span> <span class="o">></span> <span class="mi">1</span><span class="p">))</span></code></pre></figure>
<p>But how does this commit affect <code class="language-plaintext highlighter-rouge">getrandom()</code> syscall? The following block is
getrandom’s definition from linux v4.9.144 (just a kernel version in a Stretch
host), ie before <code class="language-plaintext highlighter-rouge">random: fix crng_ready() test</code> was applied.</p>
<figure class="highlight"><pre><code class="language-c" data-lang="c"><span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">crng_ready</span><span class="p">())</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">flags</span> <span class="o">&</span> <span class="n">GRND_NONBLOCK</span><span class="p">)</span>
<span class="k">return</span> <span class="o">-</span><span class="n">EAGAIN</span><span class="p">;</span>
<span class="n">crng_wait_ready</span><span class="p">();</span>
<span class="k">if</span> <span class="p">(</span><span class="n">signal_pending</span><span class="p">(</span><span class="n">current</span><span class="p">))</span>
<span class="k">return</span> <span class="o">-</span><span class="n">ERESTARTSYS</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">return</span> <span class="nf">urandom_read</span><span class="p">(</span><span class="nb">NULL</span><span class="p">,</span> <span class="n">buf</span><span class="p">,</span> <span class="n">count</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span></code></pre></figure>
<p>Upon early boot, <code class="language-plaintext highlighter-rouge">getrandom()</code> would treat <code class="language-plaintext highlighter-rouge">crng_init == 1</code> good enough and
would just return <code class="language-plaintext highlighter-rouge">urandom_read</code>, so would not block. This was was not
considered “secure” enough. After applying <code class="language-plaintext highlighter-rouge">random: fix crng_ready() test</code>
<code class="language-plaintext highlighter-rouge">getrandom()</code> behavior changed: it would block (unless called with
<code class="language-plaintext highlighter-rouge">GRND_NONBLOCK</code>) until CRNG was <em>really</em> cryptographically ready, i.e.
<code class="language-plaintext highlighter-rouge">crng_init == 2</code>.</p>
<h2 id="resolution">Resolution</h2>
<p>As soon as we pinpointed the reason ssh (and other userspace software) could
block early on boot when calling <code class="language-plaintext highlighter-rouge">getrandom()</code> we urged to evaluate possible
solutions. Our goal is to assist the virtual machine to get “good enough”
entropy early on when booting. Providing QEMU guests with quality entropy is
not a novel issue, rather it’s a recurring one when one needs to operate a
cryptographically intensive application within a virtual machine.</p>
<p>We discarded the option of running a userspace daemon, such as HAVEGED inside
every VM. Currently, as far we are concerned, there are no practical attacks
against HAVEGED, but it has received a lot of criticism for low-quality
entropy, state leaking, etc <sup id="fnref:8" role="doc-noteref"><a href="#fn:8" class="footnote" rel="footnote">8</a></sup> <sup id="fnref:9" role="doc-noteref"><a href="#fn:9" class="footnote" rel="footnote">9</a></sup>. Also, from a infrastructure perspective,
we aim to provide anything that’s necessary to VMs, without having to perform
modifications inside the guests. Users should be able to use our
virtualization infrastructure without having to modify images, due to an
“unwanted” side-effect on the host’s kernel.</p>
<p>Instead, we’d prefer a cleaner approach and turned our attention to VirtIO RNG
<sup id="fnref:10" role="doc-noteref"><a href="#fn:10" class="footnote" rel="footnote">10</a></sup>. VirtIO RNG is a paravirtualized device for QEMU, that exposes a hardware
RNG inside the guest. Enabling it for QEMU instances basically allows physical
hosts to inject randomness in virtual guests by exposing a special purpose
device, <code class="language-plaintext highlighter-rouge">/dev/hwrng</code>. VirtIO RNG is configurable and can be wired up on the
host to retrieve entropy from various sources, such as <code class="language-plaintext highlighter-rouge">/dev/{,u}random</code> or
even a hardware RNG. The downside of this solution for us was that it was not
immediately available in our virtual machines cluster manager, Ganeti. Such a
missing feature can also be seen as a contribution opportunity though! So Nikos
got down to implement what was missing for the KVM hypervisor in Ganeti <sup id="fnref:11" role="doc-noteref"><a href="#fn:11" class="footnote" rel="footnote">11</a></sup>.</p>
<p>In the meantime another possible solution emerged: <code class="language-plaintext highlighter-rouge">RDRAND</code>. This is a x86 CPU
instruction, available on modern Intel (Ivy Bridge and later) and AMD
processors, that returns random numbers as supplied by the hardware’s
cryptographically secure pseudorandom number generator <sup id="fnref:12" role="doc-noteref"><a href="#fn:12" class="footnote" rel="footnote">12</a></sup>. In other words one
may <em>trust</em> the physical CPU to fetch “cryptographically secure” numbers. Using
<code class="language-plaintext highlighter-rouge">RDRAND</code> is possible under certain conditions, which we luckily met:</p>
<ul>
<li>
<p>physical host’s CPUs has to support this instruction. In our case, all bare
metal servers consisting our Ganeti cluster did actually feature modern
enough Intel CPUs.</p>
</li>
<li>
<p>Linux kernel has to use randomness provided by the CPU. Indeed, this
functionality has been made available in Linux v4.19 by
<a href="https://lwn.net/ml/linux-kernel/20180718014344.1309-1-tytso@mit.edu/">Theodore Ts’o</a>
and has been enabled in
<a href="https://salsa.debian.org/kernel-team/linux/commit/9954895622f9a">Debian</a>
since <code class="language-plaintext highlighter-rouge">debian/4.19.20-1~9</code>.</p>
</li>
</ul>
<p>Apart from <code class="language-plaintext highlighter-rouge">RDRAND</code>, new Intel x86 CPUs expose yet another instruction, called
<code class="language-plaintext highlighter-rouge">RDSEED</code>. <code class="language-plaintext highlighter-rouge">RDSEED</code> returns numbers of “seed-grade entropy”, the output
of a true RNG that should be used by software seeding a pseudo-RNG. This would
provide even better quality of entropy to our hosts, together with a possible
speed gain. Unfortunately, not all hosts in our fleet support this instruction,
so we dismissed the idea.</p>
<p>Finally, we were able to expose <code class="language-plaintext highlighter-rouge">RDRAND</code> CPU flag to all our guests by simply
modifying ganeti cluster’s KVM <code class="language-plaintext highlighter-rouge">cpu_type</code> hypervisor parameter like so:</p>
<figure class="highlight"><pre><code class="language-sh" data-lang="sh">gnt-cluster modify <span class="nt">-H</span> kvm:cpu_type<span class="o">=</span>SandyBridge<span class="se">\\</span>,<span class="se">\\</span>+pcid<span class="se">\\</span>,<span class="se">\\</span>+invpcid<span class="se">\\</span>,<span class="se">\\</span>+rdrand</code></pre></figure>
<p>This allowed Buster guests to properly initialize their kernel CRNG and all
subsequent calls to <code class="language-plaintext highlighter-rouge">getrandom()</code> did not block.</p>
<p>Trusting the CPU to provide “cryptographically secure” random numbers may raise
some concerns, given that hardware vendors have been found to compromise their
products’ security/integrity when pressured or instructed by high-power,
high-influence institutions. ^_^ This is even highlighted by Theodore Ts’o in
the respective aforementioned commit. Our decision to use the <code class="language-plaintext highlighter-rouge">RDRAND</code>
instruction and trust the CPU was preceded by weighing various related
parameters: we already trust the CPU for all the things, being the dominant,
followed by the fact that Debian has enabled that by default.</p>
<h3 id="conclusion">Conclusion</h3>
<ul>
<li>
<p><code class="language-plaintext highlighter-rouge">RDRAND</code> and <code class="language-plaintext highlighter-rouge">RDSEED</code> helps the kernel quickly initialize its CRNG,
inducing non-blocking calls to <code class="language-plaintext highlighter-rouge">getrandom()</code>, thus no lag during boot.
<code class="language-plaintext highlighter-rouge">RDRAND</code> provides an acceptable <em>seed</em> for randomness, not necessarily a high
quality entropy flow. This should be acceptable for most applications/cases
where a pseudo-random generator like <code class="language-plaintext highlighter-rouge">urandom</code> is sufficient.</p>
</li>
<li>
<p>VirtIO RNG also solves the CRNG early boot starvation issue.</p>
</li>
<li>
<p>VirtIO RNG is the way to go when the guest machines needs high-quality (and
probably high volume of) entropy.</p>
</li>
<li>
<p>VirtIO RNG support was not available for Ganeti at the time of our
investigation, but we did work for such a feature. We thereby judged <code class="language-plaintext highlighter-rouge">RDRAND</code>
was an acceptable short-term solution and went for it.</p>
</li>
</ul>
<p>If you have any questions, ideas, thoughts or considerations, feel free to
leave a comment below.</p>
<h3 id="links">Links</h3>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>https://www.bsi.bund.de/SharedDocs/Downloads/DE/BSI/Publikationen/Studien/ZufallinVMS/Randomness-in-VMs.pdf <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>https://blogs.gentoo.org/marecki/2018/01/23/randomness-in-virtual-machines/ <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>https://elixir.bootlin.com/linux/v4.9.144/source/drivers/char/random.c#L52 <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:4" role="doc-endnote">
<p>https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=910504 <a href="#fnref:4" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:5" role="doc-endnote">
<p>https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=912087 <a href="#fnref:5" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:6" role="doc-endnote">
<p>https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html <a href="#fnref:6" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:7" role="doc-endnote">
<p>https://lwn.net/Articles/584225/ <a href="#fnref:7" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:8" role="doc-endnote">
<p>http://www.diva-portal.org/smash/get/diva2:1141835/FULLTEXT01.pdf <a href="#fnref:8" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:9" role="doc-endnote">
<p>https://lwn.net/Articles/525459/ <a href="#fnref:9" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:10" role="doc-endnote">
<p>https://wiki.qemu.org/Features/VirtIORNG <a href="#fnref:10" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:11" role="doc-endnote">
<p>https://github.com/nkorb/ganeti/commits/feature/virtio-rng <a href="#fnref:11" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:12" role="doc-endnote">
<p>https://software.intel.com/en-us/blogs/2012/11/17/the-difference-between-rdrand-and-rdseed <a href="#fnref:12" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
<p><a href="https://engineering.skroutz.gr/blog/entropy-changes-in-debian/">Entropy changes in Debian or 'why a VM boots in 5 minutes?'</a> was originally published by Alexandros Afentoulis, Nikos Kormpakis at <a href="https://engineering.skroutz.gr">Skroutz Engineering</a> on September 09, 2019.</p>https://engineering.skroutz.gr/blog/searching-at-skroutz-from-kafka-to-elasticsearch2019-08-26T00:00:00+00:002019-08-26T00:00:00+00:00George Papanikolaouhttps://engineering.skroutz.gr<h2 id="powering-search">Powering Search</h2>
<p>At Skroutz we make extensive use of
<a href="https://www.elastic.co/products/elasticsearch">Elasticsearch</a>. One of the
major use cases is powering the site’s search and filtering capabilities that
assist our users finding the product they are looking for. We are happy to
serve around 1.2M searches on an average day.</p>
<p>At the heart of search, lies Elasticsearch and its documents. Each document
corresponds to a categorized, manufactured item available for sale, namely
<em>Stock Keeping Unit</em>, or SKU for short. Searches require complex queries that
involve multiple attributes composed by several of its database record values
along with several fields calculated during serialization.
Some of those attributes are:</p>
<ul>
<li>SKU name</li>
<li>Category name</li>
<li>Manufacturer name</li>
<li>Minimum price</li>
<li>Current Availability</li>
</ul>
<p>Numerous changes, such as product price updates, are performed on our
relational database almost constantly. Modifications should be reflected to the
Elasticsearch index state with as little latency as possible, thus keeping the
search results up to date. The nature and origin of the changes varies, as we
collect the availability and price information from shops that we collaborate
with at regular intervals. In addition, our Content Teams continuously enrich
SKU, manufacturer or category information, which also may happen through
automated, complex pipelines, such as category classification operations.</p>
<p>It becomes apparent that we need a robust way to keep the database and the
Elasticsearch documents in sync. Our choice is asynchronous updates triggered
by hooking into
<a href="https://guides.rubyonrails.org/active_record_callbacks.html">ActiveRecord</a>, as
we are powered by <a href="https://rubyonrails.org/">Ruby on Rails</a>. We are writing to
the database synchronously since we consider it our <em>ground truth</em>. However
doing the same in Elasticsearch for every single event would add a major
performance overhead on each transaction, as the serialization process is
inherently expensive. Asynchronous updates allow for retries in case of
possible intermittent failures. The indexing operations are designed to be
idempotent and resilient to certain failure scenarios, so the sequence of
updates for a single document can be repeated or reordered, thus the index
state will eventually converge.</p>
<h2 id="the-beanstalk-era">The Beanstalk era</h2>
<p>Our legacy implementation used a popular tool called
<a href="https://github.com/beanstalkd/beanstalkd">beanstalk</a>; a work
queue daemon with a simple architecture. It accepts messages through the
network and holds everything in memory, while also employing a write-ahead log
for persistence. A <code class="language-plaintext highlighter-rouge">beanstalkd</code> process was co-located with every application
server of our fleet and every time an update occurred in the database, the
application enqueued a message to beanstalk. The worker process would then
consume the message and perform the necessary work.</p>
<figure>
<a href="../../../images/elastic_pipeline/beanstalk.png" class="image-popup">
<img src="../../../images/elastic_pipeline/beanstalk.png" alt="image" />
</a>
<figcaption>
<a href="../../images/elastic_pipeline/beanstalk.png">
Beanstalk pipeline architecture
</a>
</figcaption>
</figure>
<p>This pipeline has a few problems. This beanstalk ensemble is not centralized,
which translates to an uneven load distribution among workers. That
decentralized aspect also complicated our deployment process, as we had to
account for many hosts in case we wanted to retry or debug something. Consider
what happens if a change affects multiple documents. An update on an associated
entity (such as a category name change) would mean that we need to update any
affected fields for all SKUs associated with said entity. As with individual
updates, this associated entity update will be handled from an application
server so it would block the entire queue, while all the others workers are
idle. As mentioned before, we have to use denormalization in several cases in
order to make the SKU attributes searchable.</p>
<p>Another big concern of ours, was that updates for a single entity were not
ordered. An example will clarify the situation. Let’s imagine two update events
for the same SKU occurring simultaneously or very close to each other. It’s a
matter of chance which application server will handle each request, and it is
almost certain that they will end up at different servers, and thus different
beanstalk queues. If the processing times overlap, a race condition could
occur. This is highly unlikely to happen but we wanted to remove this
possibility entirely, since it adds some mental overhead, particularly as the
application scales.</p>
<p>Here is a diagram illustrating the race condition</p>
<figure>
<a href="../../../images/elastic_pipeline/beanstalk-flow.png" class="image-popup">
<img src="../../../images/elastic_pipeline/beanstalk-flow.png" alt="image" />
</a>
<figcaption>
<a href="../../images/elastic_pipeline/beanstalk-flow.png">
Beanstalk flow race condition
</a>
</figcaption>
</figure>
<p>This solution served us very well for many years, but due to our scaling and
operational needs, we decided it was time to move on to more sophisticated
pipelines.</p>
<h2 id="considerations-for-the-new-message-queue">Considerations for the new Message Queue</h2>
<p>Given that we usually have to process hundreds of thousands of updates daily, it was
necessary to be able to decouple them from the primary database updates and
also be able to keep track, monitor, and possibly automatically retry them in
case of an intermittent failure. Another concern of operational nature, is the
ability to be able to perform a point-in-time recovery process, which can
happen in case of a bug or if an index modification requires
<a href="https://www.elastic.co/guide/en/elasticsearch/reference/5.4/docs-reindex.html">re-indexing</a>.
In this case, we need to identify which documents were modified during a given
time range, and be able to perform the necessary update operations again,
so that the Elasticsearch state eventually converges.</p>
<p>As discussed, we had some problems on our hands:</p>
<ul>
<li>Eliminate race conditions (strict ordering)</li>
<li>Introduce distributed processing (horizontal scaling)</li>
<li>Introduce persistence</li>
<li>Introduce pause and rewind capabilities</li>
</ul>
<p>Regarding the concurrency issues, we could take advantage of <a href="https://www.elastic.co/blog/elasticsearch-versioning-support">Elasticsearch
versioning</a>.
Provided that we would always send the current version of the document along
with the update request, this technique would render our potential race
condition issues impossible. However, that would increase the contention on the
Elasticsearch cluster and our database because that would also require help
from it as the Elasticsearch document version would be stored there.</p>
<p>After some whiteboard sketches, we decided to go with <a href="http://kafka.apache.org">Apache
Kafka</a>, as the use case sounds well suited for it. We
already are <a href="https://engineering.skroutz.gr/blog/kafka-rails-integration/">huge
fans</a> of the
system and we have a production cluster deployed for <a href="https://engineering.skroutz.gr/blog/rewriting-web-analytics-tracking-in-go/">other company
projects</a>,
so this was a no-brainer.</p>
<h2 id="the-new-pipeline">The new pipeline</h2>
<p>Kafka is a distributed log at its
core, offering by default both distributed processing and strict ordering
guarantees. Both of these aspects are a result of an ingenious and pretty
simple decision. In Kafka, a stream of records is called a topic. A topic is
split into partitions, and the cluster allows only a single consumer to read
from a partition. To accomplish strict ordering and avoid race conditions, we
also need all messages that concern the same entity to be consistently stored
at the same partition. Since the client determines the partition of the topic
that a message will be stored in, this can be accomplished by using the
document ID (database primary key) as a key. Partitioning schemes may vary,
with the simplest being hashing the key value and applying a modulo operation,
with the divisor being the total number of partitions.</p>
<p>Furthermore, Kafka also offers <a href="https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines">substantial
throughput</a>,
by distributing partitions evenly across many machines (called brokers). All
published messages are persisted on disk, so there is no possibility of message
loss. Messages are not removed after being consumed and Kafka stores the
per-partition offset that each consumer group has reached. Retention period is
customizable and messages are available for several days after their
consumption.</p>
<p>This explanation could go on forever if we were to get into more intricate
details about Kafka, so we’ll refer you to the <a href="https://kafka.apache.org/documentation/">official
documentation</a>.</p>
<p>Our architecture can now distribute the load to multiple consumers while
also having persistent and centralized storage. It looks like this:</p>
<figure>
<a href="../../../images/elastic_pipeline/kafka.png" class="image-popup">
<img src="../../../images/elastic_pipeline/kafka.png" alt="image" />
</a>
<figcaption>
<a href="../../images/elastic_pipeline/kafka.png">
Kafka pipeline architecture
</a>
</figcaption>
</figure>
<p>This offers us a much more future-proof architecture that can withstand growth.
It gives us the ability to quickly add more resources to a bottlenecked
component. In case our load increases in the future, topics can easily be
repartitioned to allow for more consumers in a matter of minutes, thus allowing
us to add more workers to the pool. Kafka guarantees that after the rebalance,
the order is still strict and the updates are distributed and blazingly fast.</p>
<p>The use of Kafka also allows us to have more visibility and finer operational
control on the whole pipeline process. In the old architecture, all workers had
to be stopped for the process to be paused. However, in Kafka, the position of
each consumer (which is called <em>offset</em>) is maintained by the cluster and can
be rewound based either on a timestamp or on an explicit offset position.
Therefore, we are now one command away from rewinding the consumers to the
position they were, say, two hours ago. This is a tremendous gain, in cases of
bugs or maintenance windows.</p>
<h2 id="achieving-strict-ordering">Achieving strict ordering</h2>
<p>One of the biggest problems that we faced while implementing the aforementioned solution was bulk
updates. As described, there are some kinds of updates that concern multiple
documents, such as a category update. On our legacy pipeline, these updates
were handled by the Elasticsearch <a href="https://www.elastic.co/guide/en/elasticsearch/reference/5.4/docs-bulk.html">Bulk
API</a>
mainly for performance reasons.</p>
<p>However, since we wanted to preserve strict ordering, we needed to do some kind
of <em>unrolling</em> of those bulk updates into their respective document level
updates and enqueue those documents consistently using the same topic and
message key. We’ll take the category update as an example again. If the
category has <code class="language-plaintext highlighter-rouge">N</code> SKUs, we need a service to produce <code class="language-plaintext highlighter-rouge">N</code> messages,
one update message for each SKU.</p>
<p>Besides correctness, another reason to implement the unrolling process was to
ensure that processing time on the consumer remains low. Kafka is generally
optimized for small message processing times and consumers are required to
continuously verify that they are working, as a liveness check. Failing to
send heartbeats causes a session timeout. It is configurable by the
<code class="language-plaintext highlighter-rouge">session.timeout.ms</code> variable, but a high value is not recommended.</p>
<p>If a consumer is executing a long-running process, the broker can potentially
consider the consumer inactive and will trigger a rebalance, thus removing it
from the consumer group. That same message, however, will be picked again by
another consumer, after the rebalance, since the cluster thinks that the
message has not been consumed yet. One can understand that if the job is
inherently big, this can go on forever, triggering rebalances and timeouts
every time and effectively bringing the whole pipeline to a halt.</p>
<p>Implementing the above correctly was tricky because the unrolling
process itself can end up going over the Kafka processing limits. We ended up
with a “two-level unrolling” technique. Processing a bulk update message will
first split the entire document collection to batches of a predefined size
(e.g. 1000) and produce one message for each batch. When each batch message is
in turn consumed, it produces the corresponding update messages in the
document-level update topic.</p>
<p>For code simplicity, developer sanity, and correctness, we considered having a
dedicated topic for each different type of update, but we settled with two.
The first topic and its consumers handle the bulk updates and enqueue into the
second topic which actually performs the Elasticsearch write requests. Of
course, most flows in our application enqueue directly into the document-level
update topic.</p>
<h2 id="adaptive-throttling">Adaptive Throttling</h2>
<p>Early on during development, we encountered a problem. Now that the bulk
updates that come through are translated to document level updates, our system
could easily flood itself, because producing a message to Kafka is of the order
of a millisecond per message and during unrolling we can potentially produce
hundreds of thousands of messages.</p>
<p>Therefore, bulk updates are expected to be completed, at a later time
(depending on the number of affected SKUs). On the other hand, individual
updates should perform with low latency, as the changes are generally
expected to be visible in search results within seconds.</p>
<p>Kafka does not support priorities at all, and we could not implement a priority
system on top of it, because we would lose the strict ordering guarantee. We
needed a mechanism which would monitor and throttle the bulk consumer processes
specifically when there were more urgent updates that need to pass through.</p>
<p>We ended up using an external counter in order to coordinate that process. The
concept was that we would allow only a certain number of updates that originate
from a bulk update operation to be enqueued within a certain time interval.</p>
<p>The flow is as follows:</p>
<ol>
<li>A new bulk update is generated and is consumed.</li>
<li>It is <em>unrolled</em> into smaller batches, each one covering a different range
of the SKU primary-key space.</li>
<li>Batch messages are again consumed by the same consumer. If the counter is
zero, the consumer will increment it by the size of the batch. Otherwise, it
switches to a polling mode until it becomes zero, thus throttling the
process.</li>
<li>The consumer will then proceed to enqueue the document level messages.</li>
<li>The Document-level consumers will pick them up, and upon
completion the counter will be decremented by one.</li>
</ol>
<p>Eventually the counter will reach zero, when the batch is done, effectively
allowing the next batch to be enqueued. This enables time windows for other
updates to be enqueued and processed. Note that we also check whether we are
about to cross the Kafka <code class="language-plaintext highlighter-rouge">session.timeout</code> limit at step (2) above, since the
total processing time should not exceed this threshold. So there are two
termination conditions for the polling loop.</p>
<p>The concept is that a feedback loop is established between the two consumers,
allowing the batch consumer to have an insight on whether the document consumer
has the availability to process the next batch. Additionally, in cases where we
need to throttle more aggressively, we can reduce the batch size and the system
will adapt.</p>
<p><a href="https://redis.io/">Redis</a> was a strong candidate for such a counter since it
is accessible from all the consumers and can be easily monitored and operated
upon in case we needed to run ad-hoc commands for debugging reasons. Its
<a href="https://redis.io/commands/incr">atomic</a> operations and <a href="https://redis.io/commands/TTL">TTL
capabilities</a> were also important properties, as
we also have a TTL on the counter in case something goes wrong and becomes
stale.</p>
<h2 id="conclusion">Conclusion</h2>
<p>We are pretty satisfied with this new pipeline, and we enjoyed the ride,
learning a lot about Kafka and distributed systems in general. Apart from much
greater performance, we feel our new architecture will last for many years to
come, as it offers huge flexibility to both our developer and operations teams.</p>
<p>If you have any questions, ideas, thoughts or considerations, feel free to
leave a comment below.</p>
<p><a href="https://engineering.skroutz.gr/blog/searching-at-skroutz-from-kafka-to-elasticsearch/">Searching at Skroutz: from Kafka to Elasticsearch</a> was originally published by George Papanikolaou at <a href="https://engineering.skroutz.gr">Skroutz Engineering</a> on August 26, 2019.</p>https://engineering.skroutz.gr/blog/speeding-up-build-pipelines-with-mistry2019-08-23T00:00:00+00:002019-08-23T00:00:00+00:00Agis Anastasopouloshttps://engineering.skroutz.gr<p>Maintaining a high velocity in development teams requires us to continuously
improve our daily workflows. Build pipelines specifically,
make up for a big chunk of these workflows since they’re involved whether
we’re developing, testing or deploying our code.</p>
<p>At Skroutz it’s not unusual to perform over 30 deployments during the
course of a day, while the test suite needs to be run even more frequently.
And that’s for the main application only.</p>
<p>As our organization grew, certain build pipelines got slow to the point where
they became too disruptive. After all, each minute we’re
waiting for a deployment to finish means we can’t work on things that matter.</p>
<p>In this post we will see how these issues led us to create
<a href="https://github.com/skroutz/mistry">mistry</a>, an open source general-purpose
build server.</p>
<h2 id="background">Background</h2>
<p>Our infrastructure is hosted and maintained in-house, so it was a
straightforward process to determine where the majority of time was spent
during our most critical pipelines.</p>
<p>With proper instrumentation set up, we could start pinpointing significantly
slow processes in our daily workflows.</p>
<h4 id="asset-compilation">Asset compilation</h4>
<p><a href="https://guides.rubyonrails.org/asset_pipeline.html">Asset Pipeline</a> is
the Ruby on Rails component that takes care of minifying, concatenating,
obfuscating and compressing web assets (mostly JS and CSS files); a process
called <em>asset compilation</em>. The compiled asset files are those served to the
end users. This can be a slow process depending on the size of the
application.</p>
<p>In most conventional Rails setups, asset compilation
happens as part of the deployment process. To deploy the main application,
we use <a href="https://capistranorb.com/">Capistrano</a>:</p>
<figure class="highlight"><pre><code class="language-sh" data-lang="sh"><span class="nv">$ </span>cap production deploy</code></pre></figure>
<p>Capistrano then takes over and sequentially executes a bunch of
commands (copy the new code to application servers, restart services etc.)
One of the commands is the following:</p>
<figure class="highlight"><pre><code class="language-sh" data-lang="sh"><span class="nv">$ </span>rails assets:precompile</code></pre></figure>
<p>This compiles the asset files and saves them to a specific path in the
local file system. Eventually the files are copied to the
application servers ready to be served to the end users.</p>
<p>In our setup, deployment commands (including asset compilation) are executed
by a dedicated machine, unsurprisingly called the Deployer. At a high level, the
process is illustrated in the following diagram.</p>
<figure>
<a href="../../../images/mistry/before.png" class="image-popup">
<img src="../../../images/mistry/before.png" alt="image" />
</a>
<figcaption>
<a href="../../images/mistry/before.png">
Deployment flow before mistry
</a>
</figcaption>
</figure>
<p>Deployer is a black box for most development teams, which means there
is no visibility in the asset compilation process.
For example, one cannot easily inspect the assets for development
or debugging purposes.</p>
<p>Most important is the fact that asset compilation is <em>tightly coupled to
the deployment process</em>. This has important ramifications, one of them being the
fact that when a revision is deployed to staging and then to production,
assets have to be compiled separately <em>each time</em> for both
environments, even though the resulting files are identical.</p>
<h4 id="dependency-resolution">Dependency resolution</h4>
<p>Another process significantly slowing down our workflows was dependency
resolution.</p>
<p>In order for the main application to boot, its runtime dependencies must be
present in the system. This means that CI workers,
application servers and engineers must all go through dependency
resolution multiple times a day.</p>
<p>Dependencies are essentially Ruby libraries (a.k.a. gems) that are
managed by <a href="https://bundler.io/">Bundler</a>. Given some files that describe
the set of application dependencies along with their version constraints, Bundler
decides which gems are needed and downloads them.</p>
<p>A typical Rails monolith contains hundreds of dependencies, which makes
dependency resolution a slow process since it involves a lot of network I/O.</p>
<h2 id="the-premise">The premise</h2>
<p>By reflecting on the aforementioned processes, we spotted an opportunity of
saving significant amounts of time and resources in a non-disruptive manner;
that is, without major changes to our infrastructure or
workflows.</p>
<p>We noticed a common pattern among these pipelines: a command is executed
with a certain input, <em>we wait until it’s finished</em> and then use its output. The
key observation however, is that the <em>output is purely dependent on the input</em>.</p>
<ul>
<li>in asset compilation the input is the application source code
(everyone can compile the assets provided the code), while the output
is the actual assets (CSS, JS files).</li>
<li>in dependency resolution the input are the files that describe the
dependencies of the application and their versions, <a href="https://bundler.io/man/gemfile.5.html">Gemfile and Gemfile.lock</a>;
while the output is the resulting gem bundle (i.e. Ruby
source files).</li>
</ul>
<p>Given the above observations, we had some ideas in mind.</p>
<p>Since we know the command
will be executed sooner or later
(e.g. assets <em>will have</em> to be compiled when we eventually deploy), <strong>we can
execute it now and save its output for whenever it’s needed</strong>. So by
the time it’s actually needed, the output will be readily available,
saving a lot of time in the otherwise slow process.</p>
<p>For example, we can compile the assets right after a commit is
pushed to the master branch. This way deployment will not stall waiting for
the asset compilation; the assets will be ready and will be shipped right away
to the application servers.</p>
<p>Furthermore, the fact that the output is purely dependent on the input means
<strong>we can save outputs of individual command executions and reuse them when
identical commands (i.e. same input) are to be executed</strong>.</p>
<p>For example,
given a Gemfile and Gemfile.lock,
we can perform the dependency resolution once, save the resulting bundle and
reuse it between multiple machines that would otherwise have to go through the
same resolution process again.</p>
<p>Both of these optimizations could save us a lot of time and computational
resources.</p>
<h2 id="the-solution">The solution</h2>
<p>To bring the above ideas to life, we imagined some kind of build server able
to execute arbitrary commands inside isolated environments (we’ll call
these executions “builds”).</p>
<p>Builds produce a desired output (we’ll call them “artifacts”) that is
saved in the server and is readily available to anyone that needs it.</p>
<p>Builds can be scheduled by humans and machines alike and the resulting
artifacts can be downloaded from the server. Progress of builds can be
inspected via a web interface exposed by the server.</p>
<p>Together with the server we imagined an accompanying CLI client, offering a
drop-in replacement for the currently slow commands in our existing
pipelines. So <strong>instead of executing the actual command, we would execute the
CLI that schedules a build in the server, waits until it’s complete and then
downloads the resulting artifacts</strong>.</p>
<p>The end result would be the same as before: some files (the artifacts) are saved
in the system that executes the command. In the case of web assets,
the asset files are placed under <code class="language-plaintext highlighter-rouge">public/assets</code>. In the case of Bundler, the
gem files are placed under <code class="language-plaintext highlighter-rouge">vendor/bundle</code>.</p>
<p>This way changes in our workflows are kept to a minimum. For example, in the
deployment process only a single line would have to change, from:</p>
<figure class="highlight"><pre><code class="language-sh" data-lang="sh"><span class="c"># compiles assets and saves them to public/assets/</span>
<span class="nv">$ </span>rails assets:precompile</code></pre></figure>
<p>to:</p>
<figure class="highlight"><pre><code class="language-sh" data-lang="sh"><span class="c"># schedules a build to compile the assets, waits until it's finished and</span>
<span class="c"># downloads the resulting artifacts to public/assets/</span>
<span class="nv">$ </span>imaginary-cli build rails-assets <span class="nt">--path</span> public/assets/</code></pre></figure>
<p>After this seemingly small change however, the pipeline would be much more
efficient:</p>
<ul>
<li>work is performed <em>at most once</em> since results are reused between identical
command invocations.</li>
<li>work is performed eagerly so that <em>results are readily available by the time
they’re needed</em>.</li>
</ul>
<p>These optimizations minimize resource consumption in terms of CPU, memory and
network
bandwidth, but more important, they make the develop-test-ship cycle faster by
reducing the execution time of our core pipelines.</p>
<h2 id="implementation">Implementation</h2>
<p>After some brainstorming sessions we had the main idea sketched out. We moved
forward with a prototype implementation after setting the initial requirements:</p>
<ol>
<li>custom build recipes and execution environments should be supported
(we call these “projects”). Anyone should be able to add their
own project.</li>
<li>builds should run in isolation from one another and in a sandboxed
environment.</li>
<li>builds should be parameterized. For example, we should be able to compile the assets
of our Rails application for a <em>specific revision</em> (i.e. SHA1 of a commit).</li>
<li>builds should be optionally incremental (a.k.a. partial builds). The Rails Asset
Pipeline for example, caches intermediate files when compiling assets so that
subsequent compilations are faster. Similarly, Bundler skips gems that are
already present in the file system. To support such cases, the server should
optionally persist selected files across builds of the same project.</li>
</ol>
<p>Containers were a natural fit for the first two requirements. We
decided that build recipes would be provided in the form of Dockerfiles.
This makes builds essentially Docker images that are executed to produce
the desired artifacts. Containers provide us with the isolation we want, while
engineers can run the builds in their own machines for debugging purposes,
using the very same images the server uses.</p>
<p>We decided that the server would expose a JSON API for clients to interact with.
Together with the server (mistryd), a client CLI (mistry) would
be used to schedule builds by interacting with the JSON API.
When scheduling a build,
one has to specify the project (recipe) and optionally some build
parameters. After it’s scheduled, the CLI blocks until the build is finished and
finally downloads the resulting artifacts using rsync.</p>
<figure class="highlight"><pre><code class="language-sh" data-lang="sh"><span class="c"># schedule a build with a custom parameter (commit) using the CLI client and</span>
<span class="c"># download the artifacts when finished</span>
<span class="nv">$ </span>mistry build <span class="nt">--host</span> mistry.skroutz.gr <span class="nt">--project</span> rails-assets <span class="nt">--commit</span><span class="o">=</span>ab34af</code></pre></figure>
<p>We chose the <a href="https://en.wikipedia.org/wiki/Rsync">rsync protocol</a> for
transferring build artifacts. This means network usage is
minimized since files
are only downloaded if they are not present (or if they have changed) in the local
file system. This is important since we knew the majority of web assets
remain mostly unchanged between application revisions. The same is true about
dependencies, they don’t change very often between different commits.</p>
<h4 id="choosing-a-file-system">Choosing a file system</h4>
<p>We knew artifacts could potentially occupy a lot of disk space, since
for instance, assets would have to be saved in the
server for <em>each revision</em> of the application. As another example, in case of
dependency resolution a gem bundle can easily result in hundreds of megabytes.
Keeping different gem bundles in the server
would quickly result in excessive disk space consumption. Fortunately there
was a way to tackle this issue.</p>
<p>The key observation here is that
<em>many of these artifacts are identical between builds</em>. For example, as we
mentioned above only some
assets usually change (if at all) between revisions of the application. Also
most of the dependencies are not changed between revisions, which means a
large portion of the gem bundles remains unchanged.</p>
<p><a href="https://en.wikipedia.org/wiki/Copy-on-write">Copy-on-write</a> (CoW) file
systems to the rescue. For minimizing disk usage in such
usage patterns and also to support incremental builds, a file system with
copy-on-write semantics was a natural fit. In a CoW file system, even if multiple
copies of the same file exist (or large files with very few differences between
them), the data blocks <em>are not actually duplicated</em>. In our case where
most of the application assets and dependencies remain unchanged, this translates
to significant disk space savings.</p>
<p>In CoW file systems, <a href="https://en.wikipedia.org/wiki/Btrfs#Cloning">cloning</a>
files or entire directories is naturally a fast operation, since data blocks
are not actually copied in the traditional sense (i.e. they’re not duplicated).
This fits great for supporting incremental builds, since we can
copy almost instantly the artifacts of a previous build to serve as a starting
point for a new build.</p>
<p>We went with <a href="https://en.wikipedia.org/wiki/Btrfs">Btrfs</a> with which we were
already familiar, as our production file system. However we designed mistry to
support <a href="https://github.com/skroutz/mistry/wiki/File-system-adapters">pluggable file system adapters</a>.
In that sense, adding support for another file system like ZFS is
fairly straightforward.</p>
<h2 id="the-result">The result</h2>
<p>After a few iterations we had a working build server
that served all of our aforementioned needs.</p>
<p>By incorporating mistry in our build pipelines, deployment times were reduced
by up to 11 minutes (that’s how much compiling the assets previously took).
The migration was transparent and didn’t disrupt any
workflows of the engineering teams. Nothing has changed on the surface,
yet things <em>have</em> changed under the hood. During
deployment for example, Deployer does not actually
compile the assets anymore but merely fetches them from mistry.</p>
<figure>
<a href="../../../images/mistry/after.png" class="image-popup">
<img src="../../../images/mistry/after.png" alt="image" />
</a>
<figcaption>
<a href="../../images/mistry/after.png">
Deployment flow after mistry
</a>
</figcaption>
</figure>
<p>We call mistry a <em>general-purpose</em> build server because it can be used to
speed up different kinds of pipelines. Asset compilation and Bundler dependency
resolution happened to be the cases that affected <em>us</em> the most, but there are
many other potential use cases. For instance, we plan on using it to speed up
<code class="language-plaintext highlighter-rouge">yarn install</code> invocations and we recently started using it for generating our
static documentation pages.</p>
<p><a href="https://github.com/skroutz/mistry">mistry</a> is open
sourced under the GPLv3 license. There are still a lot of rough edges
(e.g. the web view is a bare-bones page without much functionality outside of
showing logs) but the core is fully functional. It can be deployed with
different kinds of file systems, although Btrfs is recommended for production
environments.</p>
<p>As a next step, we are planning to open source our build recipes for everyone
to use.</p>
<p>Documentation can be found in the <a href="https://github.com/skroutz/mistry/blob/master/README.md">README</a>
and in the <a href="https://github.com/skroutz/mistry/wiki">wiki</a>. Please let us know
if something is missing.</p>
<h2 id="conclusion">Conclusion</h2>
<p>This was the story of how we spotted the opportunity for improvements in our
daily workflows and built a tool to implement them.</p>
<p>We’ve been using mistry in production for a year and we are pretty happy with
it. There are a lot of <a href="https://github.com/skroutz/mistry/issues/">features and enhancements</a> to be done yet; contributions are more than welcome.</p>
<p>We encourage you to give <a href="https://github.com/skroutz/mistry/issues/">mistry</a>
a try if you believe it might be a good fit for your projects. Feel free to open
an <a href="https://github.com/skroutz/mistry/issues/new">issue</a> for bugs, questions
or ideas.</p>
<p>We’d be happy to hear any feedback in the comments section.</p>
<p><a href="https://engineering.skroutz.gr/blog/speeding-up-build-pipelines-with-mistry/">Speeding Up Our Build Pipelines</a> was originally published by Agis Anastasopoulos at <a href="https://engineering.skroutz.gr">Skroutz Engineering</a> on August 23, 2019.</p>