State of Performance Test Engineering (H1/2019)

Late in 2018 I stepped out of the familiar position of automation engineer, and into the unknown as an engineering manager. A new team was formed for me to manage, focusing on performance test engineering. Now here we are, just over six months in, and I’m excited to share some updates!

Team

The team consists of 10 engineers based in California, Toronto, Montreal, and Romania.

The following team members joined us in H1/2019:

  • Alexandru Ionescu
  • Marian Raicof
  • Alexandru Irimovici
  • Arnold Iakab
  • Ken Rubin

We also welcomed Greg Mierzwinski, who was converted from a contractor to full time employee.

Despite some early signs of interest, there were no community contributions in this period.

Tests

We currently support 4 frameworks:

  • Are We Slim Yet (AWSY) – memory consumption by the browser
  • build_metrics – build times, installer size, and other compiler-specific insights
  • Raptor – page load and browser benchmarks
  • Talos – various time-based performance KPIs on the browser

At the time of writing, there are 263 test suites (query) for the above frameworks.

The following are some highlights of tests introduced in H1/2019:

  • Raptor power tests for Android:
    • Idle foreground
    • Idle background
  • Raptor page load tests for Android:
    • Expanded from 4 sites to 26
  • Raptor cold page load tests for Android and desktop
  • Raptor page load tests for Reference Browser and Firefox Preview

Hardware

Our tests are running on the following hardware in continuous integration:

  • 35 Moto G5 devices (with two of these dedicated to power testing)
  • 38 Pixel 2 devices (with two of these dedicated to power testing)
  • 16 Windows UX laptops (2017 reference hardware)
  • 35 Windows ARM64 laptops (Lenovo Yoga C630)

Sheriffs

We have 2 performance sheriffs dedicating up to 50% of their time to this role. Outside of this, they assist with improving our tools and test harnesses. We also have 2 sheriffs in training to meet the needs of the additional tests we’re adding, and to provide support when needed.

Q1/2019

During the first quarter of 2019 our perfomance tools generated 863 alert summaries. This is an average of 9.7 alert summaries every day. Of the four test frameworks, build_metrics caused the most alert summaries, accounting for 35.1% of the total. The raptor framework was the second biggest contributor, with 29.9%.

"Alert

Of the alert summaries generated, 106 (12.3%) were improvements. 13 alerts associated with regressions were fixed, and 9 were backed out. The remaining alerts were either reassigned, marked as downstream, marked as invalid, are still under investigation, or were accepted.

Here are some highlights for some of our sheriffed frameworks:

Are We Slim Yet (AWSY)

  • On February 15th we detected up to 20.66% improvement in AWSY. This was caused by Paul Bone’s patch on bug 1433007, which allowed the nursery to use less than a single chunk.
  • The largest fixed regression detected from AWSY was noticed on March 20th, and had an impact of up to 9.53% regression to the heap unclassified. This was attributed to bug 1429796 and was fixed a week later by Myk Melez.
  • There are no open regressions for AWSY for this period!

Raptor

  • The improvement alert with the highest magnitude for Raptor was created on March 25th, which saw up to a massive 41.04% improvement to page load time on Android. The gains were attributed to bug 1537964, which raised the importance of services launched from the main process.
  • The largest regression alert that was ultimately fixed for Raptor was generated early on in the quarter on January 3rd. This showed up to 120.65% regression to the assorted-dom benchmark, caused by bug 1467124. A fix was quickly identified by the author, Jan de Mooij and pushed within 24 hours of the notification from our sheriff.
  • Our biggest regression that remains unresolved was created on January 23rd, and includes 20.38% page load regression for Reddit on desktop, and appears to have been caused by bug 1485216. Due to excessive noise, we have since disabled the Time to First Interactive (TTFI) measurement, which is the metric that caused this regression. This will make confirming any fix here more difficult.

Talos

  • For Talos, the alert showing the largest improvement was created on January 9th, which showed up to 38.14% improvement to tsvg_static, tsvgr_opacity, and tsvgx. It was attributed to bug 1518773, which was a WebRender update.
  • The largest fixed regression alert for Talos was spotted on January 23rd, and includes a 26.73% hit to tscrollx and tp5o_scroll. It was caused by bug 1522028, and was fixed by Glenn Watson via 1523210 within a couple of days of being notified by our sheriffs.
  • The worst open regression alert for Talos was created on February 5th, and contains a 28.66% slow down to the about-preferences test. It looks like this has stalled with a patch to enable ASAP mode for the test, which caused failures and was subsequently backed out. According to Mike Conley in this comment, “ASAP mode causes us to paint ASAP after the DOM has been dirtied, without waiting for vsync”.

Q2/2019

In the second quarter, our perfomance tools generated 831 alert summaries. This is an average of 9.23 alert summaries every day. Of the four test frameworks, raptor caused the most alert summaries, accounting for 42.2% of the total. The build_metrics framework was the second biggest contributor, with 32.7%. There have been additional tests landed for Raptor, new page recordings, and a focus on improving page load performance, which are all likely to be reasons for the increase in Raptor alerts in this period.

"Alert

Of the alert summaries generated, 98 (11.79%) were improvements. 19 alerts associated with regressions were fixed, and 6 were backed out. The remaining alerts were either reassigned, marked as downstream, marked as invalid, are still under investigation, or were accepted.

In Q1 we introduced a first_triaged timestamp in Perfherder (bug 1521025), which allows us to track how much time passes between the alert generation and the initial triage from a performance sheriff. Using this, we can see that our perf sheriffs triaged 56.1% of alerts within a day, and an additional 24.7% within three days.

"Days

In Q2 we introduced a bug_updated timestamp in Perfherder (bug 1536040) to more closely track the time between the alert and the correct bug being identified, which is where we handover to the regression author and test owners. The data doesn’t cover all of Q2, but so far we see that 63.3% of alerts complete this period within 3 days, and 87.4% complete within 1 week.

"Days

Here are some highlights for some of our sheriffed frameworks:

Are We Slim Yet (AWSY)

  • The largest improvement noticed for AWSY was on June 9th, where we saw up to a 5.98% decrease in heap unclassified and base content js. This was attributed to a patch from Emilio Cobos Álvarez in bug .
  • On May 30th, a 9.54% regression was noticed across several tests, caused by bug 1493420. Fortunately, this was promptly followed by a fix from Emilio Cobos Álvarez.
  • The open regression with the highest impact for AWSY was detected on June 12th. Whilst there were a large number of improvements in this alert, the 12.67% regression to images was also significant. The regression was caused by bug 1558763, which changed the value of a preference within Marionette. It looks like this may have been fixed, but our performance sheriffs have yet to verify this.

Raptor

  • The most significant improvement detected by Raptor was on May 22nd. Up to 20.42% boost to page load of several sites was seen, which was thanks to Andrew Creskey’s investigation and Michal Novotny’s patch in bug 1553166 to make disk cache sizing less restrictive on mobile.
  • On April 30th, a regression alert of 52.62% was reported against the Raptor page load test for bing.com on desktop. It turned out that bug 1514655 caused the unexpected regression, and Emilio Cobos Álvarez was able to post a fix within a day of the regression report!
  • Due to many page recordings being recreated, there are a lot of open alerts that require closer examination.

Talos

  • On April 13th we detected an improvement of up to 99.24% to the tp5n nonmain_startup_fileio test. This was thanks to Doug Thayer’s work on optimising prefetching of Windows DLLs in bug 1538279.
  • Talos detected a 3.78% regression to tsvgr_opacity on March 31st, which was caused by bug 1539026. After an initial fix was backed out, Lee Salzman was able to fix the regression.
  • The open regression alert with the highest magnitude was opened on April 29th, and reports a 38.25% decrease in the average frame interval while animating a simple WebGL scene. It was discovered by the glterrain test, and attributed to 1547775. It does look like this test returned to the pre-regression values, but we have yet to confirm if this was expected or if the regression remains.

Survey

I’d really like to find out how our team is doing, and hear if you have any feedback or suggestions for any way we can improve. Please take a couple of minutes to complete the very simple satisfaction survey I have created. Thanks!

Leave a Reply

Your email address will not be published. Required fields are marked *