Branch
Hash :
ce2bc8e8
Author :
Date :
2024-05-31T17:29:24
Docs: Add a little guide for repro'ing failures on bots Bug: angleproject:344019996 Change-Id: I89168b1ebdd62d2ac5775c300faeda6a8c06a0a7 Reviewed-on: https://chromium-review.googlesource.com/c/angle/angle/+/5587906 Reviewed-by: Yuly Novikov <ynovikov@chromium.org> Reviewed-by: Amirali Abdolrashidi <abdolrashidi@google.com>
Sometimes a failure happens consistently (or flakily) on bots but is difficult to reproduce locally
due to a different platform, driver version, etc. The same build can be triggered on a matching bot
with additional arguments using the following steps. Triggering swarming tasks from a local build
can also sometimes be useful (see scripts/trigger.py).
-d os=Windows-10 and
-d gpu=8086:9bc5-31.0.101.2127 (note: this numeric GPU id encodes both the vendor and the
specific driver version)
The failure may or may not repro in isolation. Usually the test log will contain --gtest_filter=
with the batch that was used when the failure occurred:
If not reproducible in isolation, it’s usually easiest to start with the same batch to confirm the failure is reproduced and then trim the list down.
Sometimes additional args are required (can be found in logs of the original task).
You can trigger swarming tasks directly using tools/luci-go/swarming trigger from an ANGLE
checkout.
ACLs: ANGLE realm (e.g. angle:try) is guarded by
https://chrome-infra-auth.appspot.com/auth/groups/project-angle-owners. If that shows
PermissionDenied, you could also try chromium:try.
For example, trigger that reproduced the failure in the example above - filter had to include the failing test and the test that ran right before it:
% tools/luci-go/swarming trigger \
-digest=e11fb5a14596dce84e86a4776d65c5da26acda8e5b04257988cf2fa8ac4c5630/399 \
-realm angle:try \
-priority=20 \
-server=https://chromium-swarm.appspot.com \
-d os=Windows-10 \
-d pool=chromium.tests.gpu \
-d cpu=x86-64 \
-d gpu=8086:9bc5-31.0.101.2127 \
-service-account=chromium-tester@chops-service-accounts.iam.gserviceaccount.com \
-env=ISOLATED_OUTDIR=\${ISOLATED_OUTDIR} \
-relative-cwd=out/Release_x64 \
-- vpython3 ../../testing/test_env.py \
./angle_end2end_tests.exe \
--isolated-script-test-output=\${ISOLATED_OUTDIR}/output.json \
--gtest_filter=EGLDisplayTest.InitializeMultipleTimesInDifferentThreads/ES2_D3D11_NoFixture:EGLPresentPathD3D11.ClientBufferPresentPathFast/ES2_D3D11_NoFixture
Additional notes:
It occasionally matters that bots run with --test-launcher-bot-mode - this sets mBotMode=true
in ANGLE harness and enables running multiple windows in parallel (however, on some bots
multi-processing is deactivated due to flakes, in which case you can find --max-processes arg
in the logs). Naturally, multiple windows are not supported on Android bots. If the failure
you’re investigating is on a platform which runs with multi-processing, you might want to try
experimenting with these flags.
See scripts/trigger.py for triggering tasks from local builds - it
first produces a CAS digest and then triggers a task using swarming trigger similar to the
command above.
CAS digests from bot builds can also be useful, e.g. by uploading a change and triggering a builder to get a build on a platform you may not have access to, or by taking a digest of a previous CI failure.
-relative-cwd and binary can be figured out by clicking on “CAS inputs” and inspecting out;
this can also be found in task logs.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82
# Testing on bots
Sometimes a failure happens consistently (or flakily) on bots but is difficult to reproduce locally
due to a different platform, driver version, etc. The same build can be triggered on a matching bot
with additional arguments using the following steps. Triggering swarming tasks from a local build
can also sometimes be useful (see [`scripts/trigger.py`](../scripts/trigger.py)).
## Navigate to the shard

## Note the task dimensions as well as the "CAS inputs" identifier

* Task dimensions is a filter that limits which bots in the swarming pool will pick up the task. For
example, here we can limit to the same OS and GPU with `-d os=Windows-10` and
`-d gpu=8086:9bc5-31.0.101.2127` (note: this numeric GPU id encodes both the vendor and the
specific driver version)
## Find additional args required to repro
The failure may or may not repro in isolation. Usually the test log will contain `--gtest_filter=`
with the batch that was used when the failure occurred:

* If not reproducible in isolation, it's usually easiest to start with the same batch to confirm the
failure is reproduced and then trim the list down.
* Sometimes additional args are required (can be found in logs of the original task).
## Triggering a swarming task
You can trigger swarming tasks directly using `tools/luci-go/swarming trigger` from an ANGLE
checkout.
ACLs: ANGLE realm (e.g. `angle:try`) is guarded by
https://chrome-infra-auth.appspot.com/auth/groups/project-angle-owners. If that shows
`PermissionDenied`, you could also try `chromium:try`.
For example, trigger that reproduced the failure in the example above - filter had to include the
failing test and the test that ran right before it:
```
% tools/luci-go/swarming trigger \
-digest=e11fb5a14596dce84e86a4776d65c5da26acda8e5b04257988cf2fa8ac4c5630/399 \
-realm angle:try \
-priority=20 \
-server=https://chromium-swarm.appspot.com \
-d os=Windows-10 \
-d pool=chromium.tests.gpu \
-d cpu=x86-64 \
-d gpu=8086:9bc5-31.0.101.2127 \
-service-account=chromium-tester@chops-service-accounts.iam.gserviceaccount.com \
-env=ISOLATED_OUTDIR=\${ISOLATED_OUTDIR} \
-relative-cwd=out/Release_x64 \
-- vpython3 ../../testing/test_env.py \
./angle_end2end_tests.exe \
--isolated-script-test-output=\${ISOLATED_OUTDIR}/output.json \
--gtest_filter=EGLDisplayTest.InitializeMultipleTimesInDifferentThreads/ES2_D3D11_NoFixture:EGLPresentPathD3D11.ClientBufferPresentPathFast/ES2_D3D11_NoFixture
```
Additional notes:
* It occasionally matters that bots run with `--test-launcher-bot-mode` - this sets `mBotMode=true`
in ANGLE harness and enables running multiple windows in parallel (however, on some bots
multi-processing is deactivated due to flakes, in which case you can find `--max-processes` arg
in the logs). Naturally, multiple windows are not supported on Android bots. If the failure
you're investigating is on a platform which runs with multi-processing, you might want to try
experimenting with these flags.
* See [`scripts/trigger.py`](../scripts/trigger.py) for triggering tasks from local builds - it
first produces a CAS digest and then triggers a task using swarming trigger similar to the
command above.
* CAS digests from bot builds can also be useful, e.g. by uploading a change and triggering a
builder to get a build on a platform you may not have access to, or by taking a digest of a
previous CI failure.
* `-relative-cwd` and binary can be figured out by clicking on "CAS inputs" and inspecting `out`;
this can also be found in task logs.