TestingOnBots.md

Branch

Show log
Commit

Hash : ce2bc8e8
Author :
Date : 2024-05-31T17:29:24

Docs: Add a little guide for repro'ing failures on bots

Bug: angleproject:344019996
Change-Id: I89168b1ebdd62d2ac5775c300faeda6a8c06a0a7
Reviewed-on: https://chromium-review.googlesource.com/c/angle/angle/+/5587906
Reviewed-by: Yuly Novikov <ynovikov@chromium.org>
Reviewed-by: Amirali Abdolrashidi <abdolrashidi@google.com>

Download

Testing on bots

Sometimes a failure happens consistently (or flakily) on bots but is difficult to reproduce locally due to a different platform, driver version, etc. The same build can be triggered on a matching bot with additional arguments using the following steps. Triggering swarming tasks from a local build can also sometimes be useful (see scripts/trigger.py).

Navigate to the shard

Test shard failure

Note the task dimensions as well as the “CAS inputs” identifier

Swarming task info

Task dimensions is a filter that limits which bots in the swarming pool will pick up the task. For example, here we can limit to the same OS and GPU with -d os=Windows-10 and -d gpu=8086:9bc5-31.0.101.2127 (note: this numeric GPU id encodes both the vendor and the specific driver version)

Find additional args required to repro

The failure may or may not repro in isolation. Usually the test log will contain --gtest_filter= with the batch that was used when the failure occurred:

Test batch failure

If not reproducible in isolation, it’s usually easiest to start with the same batch to confirm the failure is reproduced and then trim the list down.
Sometimes additional args are required (can be found in logs of the original task).

Triggering a swarming task

You can trigger swarming tasks directly using tools/luci-go/swarming trigger from an ANGLE checkout.

ACLs: ANGLE realm (e.g. angle:try) is guarded by https://chrome-infra-auth.appspot.com/auth/groups/project-angle-owners. If that shows PermissionDenied, you could also try chromium:try.

For example, trigger that reproduced the failure in the example above - filter had to include the failing test and the test that ran right before it:

% tools/luci-go/swarming trigger \
  -digest=e11fb5a14596dce84e86a4776d65c5da26acda8e5b04257988cf2fa8ac4c5630/399 \
  -realm angle:try \
  -priority=20 \
  -server=https://chromium-swarm.appspot.com \
  -d os=Windows-10 \
  -d pool=chromium.tests.gpu \
  -d cpu=x86-64 \
  -d gpu=8086:9bc5-31.0.101.2127 \
  -service-account=chromium-tester@chops-service-accounts.iam.gserviceaccount.com \
  -env=ISOLATED_OUTDIR=\${ISOLATED_OUTDIR} \
  -relative-cwd=out/Release_x64 \
  -- vpython3 ../../testing/test_env.py \
  ./angle_end2end_tests.exe \
  --isolated-script-test-output=\${ISOLATED_OUTDIR}/output.json \
  --gtest_filter=EGLDisplayTest.InitializeMultipleTimesInDifferentThreads/ES2_D3D11_NoFixture:EGLPresentPathD3D11.ClientBufferPresentPathFast/ES2_D3D11_NoFixture

Additional notes:

It occasionally matters that bots run with --test-launcher-bot-mode - this sets mBotMode=true in ANGLE harness and enables running multiple windows in parallel (however, on some bots multi-processing is deactivated due to flakes, in which case you can find --max-processes arg in the logs). Naturally, multiple windows are not supported on Android bots. If the failure you’re investigating is on a platform which runs with multi-processing, you might want to try experimenting with these flags.
See scripts/trigger.py for triggering tasks from local builds - it first produces a CAS digest and then triggers a task using swarming trigger similar to the command above.
CAS digests from bot builds can also be useful, e.g. by uploading a change and triggering a builder to get a build on a platform you may not have access to, or by taking a digest of a previous CI failure.
-relative-cwd and binary can be figured out by clicking on “CAS inputs” and inspecting out; this can also be found in task logs.

Source

doc/TestingOnBots.md

# Testing on bots

Sometimes a failure happens consistently (or flakily) on bots but is difficult to reproduce locally
due to a different platform, driver version, etc. The same build can be triggered on a matching bot
with additional arguments using the following steps. Triggering swarming tasks from a local build
can also sometimes be useful (see [`scripts/trigger.py`](../scripts/trigger.py)).

## Navigate to the shard

![Test shard failure](img/TestShardFailure.png)

## Note the task dimensions as well as the "CAS inputs" identifier

![Swarming task info](img/SwarmingTaskInfo.png)

* Task dimensions is a filter that limits which bots in the swarming pool will pick up the task. For
  example, here we can limit to the same OS and GPU with `-d os=Windows-10` and
  `-d gpu=8086:9bc5-31.0.101.2127` (note: this numeric GPU id encodes both the vendor and the
  specific driver version)

## Find additional args required to repro

The failure may or may not repro in isolation. Usually the test log will contain `--gtest_filter=`
with the batch that was used when the failure occurred:

![Test batch failure](img/TestBatchFailure.png)

* If not reproducible in isolation, it's usually easiest to start with the same batch to confirm the
  failure is reproduced and then trim the list down.

* Sometimes additional args are required (can be found in logs of the original task).

## Triggering a swarming task

You can trigger swarming tasks directly using `tools/luci-go/swarming trigger` from an ANGLE
checkout.

ACLs: ANGLE realm (e.g. `angle:try`) is guarded by
https://chrome-infra-auth.appspot.com/auth/groups/project-angle-owners. If that shows
`PermissionDenied`, you could also try `chromium:try`.

For example, trigger that reproduced the failure in the example above - filter had to include the
failing test and the test that ran right before it:

```
% tools/luci-go/swarming trigger \
  -digest=e11fb5a14596dce84e86a4776d65c5da26acda8e5b04257988cf2fa8ac4c5630/399 \
  -realm angle:try \
  -priority=20 \
  -server=https://chromium-swarm.appspot.com \
  -d os=Windows-10 \
  -d pool=chromium.tests.gpu \
  -d cpu=x86-64 \
  -d gpu=8086:9bc5-31.0.101.2127 \
  -service-account=chromium-tester@chops-service-accounts.iam.gserviceaccount.com \
  -env=ISOLATED_OUTDIR=\${ISOLATED_OUTDIR} \
  -relative-cwd=out/Release_x64 \
  -- vpython3 ../../testing/test_env.py \
  ./angle_end2end_tests.exe \
  --isolated-script-test-output=\${ISOLATED_OUTDIR}/output.json \
  --gtest_filter=EGLDisplayTest.InitializeMultipleTimesInDifferentThreads/ES2_D3D11_NoFixture:EGLPresentPathD3D11.ClientBufferPresentPathFast/ES2_D3D11_NoFixture
```

Additional notes:

* It occasionally matters that bots run with `--test-launcher-bot-mode` - this sets `mBotMode=true`
  in ANGLE harness and enables running multiple windows in parallel (however, on some bots
  multi-processing is deactivated due to flakes, in which case you can find `--max-processes` arg
  in the logs). Naturally, multiple windows are not supported on Android bots. If the failure
  you're investigating is on a platform which runs with multi-processing, you might want to try
  experimenting with these flags.

* See [`scripts/trigger.py`](../scripts/trigger.py) for triggering tasks from local builds - it
  first produces a CAS digest and then triggers a task using swarming trigger similar to the
  command above.

* CAS digests from bot builds can also be useful, e.g. by uploading a change and triggering a
  builder to get a build on a platform you may not have access to, or by taking a digest of a
  previous CI failure.

* `-relative-cwd` and binary can be figured out by clicking on "CAS inputs" and inspecting `out`;
  this can also be found in task logs.