Start running the doc8 RST linter in the pep8 job. This will catch obvious errors like line too long, etc. Tried to fix most of the errors (there were a lot!), but added some ignore directives so we can address some files later. Did have to move a couple of files around as a double include was causing some issues, but content is unchanged. Change-Id: I336a9ee7729045da095be22ea0f58ee596967174
247 lines
12 KiB
ReStructuredText
247 lines
12 KiB
ReStructuredText
|
|
Gate Failure Triage
|
|
===================
|
|
|
|
This page provides guidelines for spotting and assessing neutron gate failures.
|
|
Some hints for triaging failures are also provided.
|
|
|
|
Spotting Gate Failures
|
|
----------------------
|
|
This can be achieved using several tools:
|
|
|
|
* `Grafana dashboard <https://grafana.opendev.org/d/f913631585/neutron-failure-rate>`_
|
|
* `OpenSearch <https://opensearch.logs.openstack.org/_dashboards/app/discover?security_tenant=global>`_
|
|
|
|
For checking gate failures with opensearch please see `documentation <https://docs.openstack.org/project-team-guide/testing.html#checking-status-of-other-job-results>`_.
|
|
The following query will return failures for a specific job:
|
|
|
|
> build_status:FAILURE AND message:Finished AND
|
|
build_name:"check-tempest-dsvm-neutron" AND build_queue:"gate"
|
|
|
|
And divided by the total number of jobs executed:
|
|
|
|
> message:Finished AND build_name:"check-tempest-dsvm-neutron" AND
|
|
build_queue:"gate"
|
|
|
|
It will return the failure rate in the selected period for a given job. It is
|
|
important to remark that failures in the check queue might be misleading as
|
|
the problem causing the failure is most of the time in the patch being checked.
|
|
Therefore it is always advisable to work on failures occurred in the gate
|
|
queue. However, these failures are a precious resource for assessing frequency
|
|
and determining root cause of failures which manifest in the gate queue.
|
|
|
|
The step above will provide a quick outlook of where things stand. When the
|
|
failure rate raises above 10% for a job in 24 hours, it's time to be on alert.
|
|
25% is amber alert. 33% is red alert. Anything above 50% means that probably
|
|
somebody from the infra team has already a contract out on you. Whether you
|
|
are relaxed, in alert mode, or freaking out because you see a red dot on your
|
|
chest, it is always a good idea to check on daily bases the elastic-recheck
|
|
pages.
|
|
|
|
Under the
|
|
`gate pipeline <http://status.openstack.org/elastic-recheck/gate.html>`_
|
|
tab, you can see gate failure rates for already known bugs. The bugs in this
|
|
page are ordered by decreasing failure rates (for the past 24 hours). If one
|
|
of the bugs affecting Neutron is among those on top of that list, you should
|
|
check that the corresponding bug is already assigned and somebody is working
|
|
on it. If not, and there is not a good reason for that, it should be ensured
|
|
somebody gets a crack at it as soon as possible. The other part of the
|
|
story is to check for `uncategorized <http://status.openstack.org/elastic-recheck/data/uncategorized.html>`_
|
|
failures. This is where failures for new (unknown) gate breaking bugs end up;
|
|
on the other hand also infra error causing job failures end up here. It should
|
|
be duty of the diligent Neutron developer to ensure the classification rate
|
|
for neutron jobs is as close as possible to 100%. To this aim, the diligent
|
|
Neutron developer should adopt the procedure outlined in the following
|
|
sections.
|
|
|
|
.. _troubleshooting-tempest-jobs:
|
|
|
|
Troubleshooting Tempest jobs
|
|
----------------------------
|
|
1. Open logs for failed jobs and look for logs/testr_results.html.gz.
|
|
2. If that file is missing, check console.html and see where the job failed.
|
|
1. If there is a failure in devstack-gate-cleanup-host.txt it's likely to
|
|
be an infra issue.
|
|
2. If the failure is in devstacklog.txt it could a devstack, neutron, or
|
|
infra issue.
|
|
3. However, most of the time the failure is in one of the tempest tests. Take
|
|
note of the error message and go to opensearch.
|
|
4. On opensearch, search for occurrences of this error message, and try to
|
|
identify the root cause for the failure (see below).
|
|
5. File a bug for this failure, and push an
|
|
:ref:`Elastic Recheck Query <elastic-recheck-query>` for it.
|
|
6. If you are confident with the area of this bug, and you have time, assign
|
|
it to yourself; otherwise look for an assignee or talk to the Neutron's
|
|
bug deputy to find an assignee.
|
|
|
|
Troubleshooting functional/fullstack job
|
|
----------------------------------------
|
|
1. Go to the job link provided by Zuul CI.
|
|
2. Look at logs/testr_results.html.gz for which particular test failed.
|
|
3. More logs from a particular test are stored at
|
|
logs/dsvm-functional-logs/<path_of_the_test> (or dsvm-fullstack-logs
|
|
for fullstack job).
|
|
4. Find the error in the logs and search for similar errors in existing
|
|
launchpad bugs. If no bugs were reported, create a new bug report. Don't
|
|
forget to put a snippet of the trace into the new launchpad bug. If the
|
|
log file for a particular job doesn't contain any trace, pick the one
|
|
from testr_results.html.gz.
|
|
5. Create an :ref:`Elastic Recheck Query <elastic-recheck-query>`
|
|
|
|
|
|
.. _troubleshooting-grenade-jobs:
|
|
|
|
Troubleshooting Grenade jobs
|
|
----------------------------
|
|
Grenade is used in the Neutron gate to test every patch proposed to Neutron to
|
|
ensure it will not break the upgrade process.
|
|
Upgrading from the N-1 to the N branch is constantly being tested. So if you
|
|
send patch to the Neutron ``master`` branch Grenade jobs will first deploy
|
|
Neutron from the last stable release and then upgrade it to the master branch
|
|
with your patch.
|
|
Details about how Grenade works are available in
|
|
`the documentation <https://docs.openstack.org/grenade/latest/readme.html>`_.
|
|
|
|
In Neutron CI jobs that use Grenade are run in the multinode jobs configuration
|
|
which means that we have deployed OpenStack on 2 VMs:
|
|
|
|
- one called ``controller`` which is in an "all in one"
|
|
node so it runs neutron-server, as well as the neutron-ovs-agent and
|
|
nova-compute services,
|
|
|
|
- one called ``compute1`` which runs only services like nova-compute and
|
|
neutron-ovs-agent.
|
|
|
|
Neutron supports that neutron-server in N version will always work with the
|
|
agents which runs in N-1 version. To test such scenario all our Grenade jobs
|
|
upgrade OpenStack services only on the ``controller`` node. Services which run
|
|
on the ``compute1`` node are always run with the "old" release during that job.
|
|
|
|
Debugging of failures in the Grenade job is very similar to debugging any
|
|
other Tempest based job.
|
|
The difference is that in the logs of the Grenade job, there is always
|
|
"logs/old" and "logs/new" directories which contain Devstack logs from each run
|
|
of the Devstack's stack.sh script.
|
|
In the "logs/grenade.sh_log.txt" file there is a full log of the grenade.sh run
|
|
and you should always start checking failures from that file.
|
|
Logs of the Neutron services for "old" and "new" versions are in the same
|
|
files, like, for example, "logs/screen-q-svc.txt" for neutron-server logs.
|
|
You will find in that log when the service was restarted - that is the moment
|
|
when it was upgraded by Grenade and it is now running the new version.
|
|
|
|
Advanced Troubleshooting of Gate Jobs
|
|
-------------------------------------
|
|
As a first step of troubleshooting a failing gate job, you should always check
|
|
the logs of the job as described above.
|
|
Unfortunately, sometimes when a tempest/functional/fullstack job is
|
|
failing, it might be hard to reproduce it in a local environment, and might
|
|
also be hard to understand the reason of such a failure from only reading
|
|
the logs of the failed job. In such cases there are some additional ways
|
|
to debug the job directly on the test node in a ``live`` setting.
|
|
|
|
This can be done in two ways:
|
|
|
|
1. Using the `remote_pdb <https://pypi.org/project/remote-pdb>`_ python
|
|
module and ``telnet`` to directly access the python debugger while in the
|
|
failed test.
|
|
|
|
To achieve this, you need to send a ``Do not merge`` patch to gerrit with
|
|
changes as described below:
|
|
|
|
* Add an iptables rule to accept incoming telnet connections to remote_pdb.
|
|
This can be done in one of the ansible roles used in the test job.
|
|
Like for example in ``neutron/roles/configure_functional_tests`` file
|
|
for functional tests::
|
|
|
|
sudo iptables -I openstack-INPUT -p tcp -m state --state NEW -m tcp --dport 44444 -j ACCEPT
|
|
|
|
* Increase the ``OS_TEST_TIMEOUT`` value to make the test wait longer when
|
|
remote_pdb is active to make debugging easier. This change can also be
|
|
done in the ansible role mentioned above::
|
|
|
|
export OS_TEST_TIMEOUT=999999
|
|
|
|
Please note that the overall job will be limited by the job timeout,
|
|
and that cannot be changed from within the job.
|
|
|
|
* To make it easier to find the IP address of the test node, you should
|
|
add to the ansible role so it prints the IPs configured on the test node.
|
|
For example::
|
|
|
|
hostname -I
|
|
|
|
* Add the package ``remote_pdb`` to the ``test-requirements.txt`` file.
|
|
That way it will be automatically installed in the venv of the test
|
|
before it is run::
|
|
|
|
$ tail -1 test-requirements.txt
|
|
remote_pdb
|
|
|
|
* Finally, you need to import and call the remote_pdb module in the part
|
|
of your test code where you want to start the debugger::
|
|
|
|
$ diff --git a/neutron/tests/fullstack/test_connectivity.py b/neutron/tests/fullstack/test_connectivity.py
|
|
index c8650b0..260207b 100644
|
|
--- a/neutron/tests/fullstack/test_connectivity.py
|
|
+++ b/neutron/tests/fullstack/test_connectivity.py
|
|
@@ -189,6 +189,8 @@ class
|
|
TestLinuxBridgeConnectivitySameNetwork(BaseConnectivitySameNetworkTest):
|
|
]
|
|
|
|
def test_connectivity(self):
|
|
+ import remote_pdb; remote_pdb.set_trace('0.0.0.0', port=44444)
|
|
+
|
|
self._test_connectivity()
|
|
|
|
Please note that discovery of public IP addresses is necessary because by
|
|
default remote_pdb will only bind to the ``127.0.0.1`` IP address.
|
|
Above is just an example of one of possible method, there could be other
|
|
ways to do this as well.
|
|
|
|
When all the above changes are done, you must commit them and go to the
|
|
`Zuul status page <https://zuul.openstack.org>`_ to find the status of the
|
|
tests for your ``Do not merge`` patch. Open the console log for your job
|
|
and wait there until ``remote_pdb`` is started.
|
|
You then need to find the IP address of the test node in the console log.
|
|
This is necessary to connect via ``telnet`` and start debugging. It will be
|
|
something like::
|
|
|
|
RemotePdb session open at 172.99.68.50:44444, waiting for connection ...
|
|
|
|
An example of such a ``Do not merge`` patch described above can be found at
|
|
`<https://review.opendev.org/#/c/558259/>`_.
|
|
|
|
Please note that after adding new packages to the ``requirements.txt`` file,
|
|
the ``requirements-check`` job for your test patch will fail, but it is not
|
|
important for debugging.
|
|
|
|
2. If root access to the test node is necessary, for example, to check if VMs
|
|
have really been spawned, or if router/dhcp namespaces have been configured
|
|
properly, etc., you can ask a member of the infra-team to hold the
|
|
job for troubleshooting. You can ask someone to help with that on the
|
|
``openstack-infra`` IRC channel. In that case, the infra-team will need to
|
|
add your SSH key to the test node, and configure things so that if the job
|
|
fails, the node will not be destroyed. You will then be able to SSH to it
|
|
and debug things further. Please remember to tell the infra-team when you
|
|
finish debugging so they can unlock and destroy the node being held.
|
|
|
|
The above two solutions can be used together. For example, you should be
|
|
able to connect to the test node with both methods:
|
|
|
|
* using ``remote_pdb`` to connect via ``telnet``;
|
|
* using ``SSH`` to connect as a root to the test node.
|
|
|
|
You can then ask the infra-team to add your key to the specific node on
|
|
which you have already started your ``remote_pdb`` session.
|
|
|
|
Root Causing a Gate Failure
|
|
---------------------------
|
|
Time-based identification, i.e. find the naughty patch by log scavenging.
|
|
|
|
.. _elastic-recheck-query:
|
|
|
|
Filing An Elastic Recheck Query
|
|
-------------------------------
|
|
The `elastic recheck <http://status.openstack.org/elastic-recheck/>`_ page has all the current open ER queries.
|
|
To file one, please see the `ER Wiki <https://wiki.openstack.org/wiki/ElasticRecheck>`_.
|