Observability tools¶

Warning

This feature is experimental, and could have breaking changes or even be removed without notice. Try it out, let us know what you think, but don’t rely on it just yet!

Motivation¶

Understanding what your code is doing - for example, why your test failed - is often a frustrating exercise in adding some more instrumentation or logging (or print() calls) and running it again. The idea of observability is to let you answer questions you didn’t think of in advance. In slogan form,

Debugging should be a data analysis problem.

By default, Hypothesis only reports the minimal failing example… but sometimes you might want to know something about all the examples. Printing them to the terminal with verbose output might be nice, but isn’t always enough. This feature gives you an analysis-ready dataframe with useful columns and one row per test case, with columns from arguments to code coverage to pass/fail status.

This is deliberately a much lighter-weight and task-specific system than e.g. OpenTelemetry. It’s also less detailed than time-travel debuggers such as rr or pytrace, because there’s no good way to compare multiple traces from these tools and their Python support is relatively immature.

Configuration¶

If you set the HYPOTHESIS_EXPERIMENTAL_OBSERVABILITY environment variable, Hypothesis will log various observations to jsonlines files in the .hypothesis/observed/ directory. You can load and explore these with e.g. pd.read_json(".hypothesis/observed/*_testcases.jsonl", lines=True), or by using the sqlite-utils and datasette libraries:

sqlite-utils insert testcases.db testcases .hypothesis/observed/*_testcases.jsonl --nl --flatten
datasette serve testcases.db

If you are experiencing a significant slow-down, you can try setting HYPOTHESIS_EXPERIMENTAL_OBSERVABILITY_NOCOVER instead; this will disable coverage information collection. This should not be necessary on Python 3.12 or later.

Collecting more information¶

If you want to record more information about your test cases than the arguments and outcome - for example, was x a binary tree? what was the difference between the expected and the actual value? how many queries did it take to find a solution? - Hypothesis makes this easy.

event() accepts a string label, and optionally a string or int or float observation associated with it. All events are collected and summarized in Test statistics, as well as included on a per-test-case basis in our observations.

target() is a special case of numeric-valued events: as well as recording them in observations, Hypothesis will try to maximize the targeted value. Knowing that, you can use this to guide the search for failing inputs.

Data Format¶

We dump observations in json lines format, with each line describing either a test case or an information message. The tables below are derived from this machine-readable JSON schema, to provide both readable and verifiable specifications.

Note that we use json.dumps() and can therefore emit non-standard JSON which includes infinities and NaN. This is valid in JSON5, and supported by some JSON parsers including Gson in Java, JSON.parse() in Ruby, and of course in Python.

Test case¶

Describes the inputs to and result of running some test function on a particular input. The test might have passed, failed, or been abandoned part way through (e.g. because we failed a `.filter()` condition).
properties
type	A tag which labels this observation as data about a specific test case.
	const	test_case
status	Whether the test passed, failed, or was aborted before completion (e.g. due to use of `.filter()`). Note that if we gave_up partway, values such as arguments and features may be incomplete.
	enum	passed, failed, gave_up
status_reason	If non-empty, the reason for which the test failed or was abandoned. For Hypothesis, this is usually the exception type and location.
	type	string
representation	The string representation of the input.
	type	string
arguments	A structured json-encoded representation of the input. Hypothesis provides a dictionary of argument names to json-ified values, including interactive draws from the `data()` strategy. If ‘status’ is ‘gave_up’, this may be absent or incomplete. In other libraries this can be any object.
	type	object
how_generated	How the input was generated, if known. In Hypothesis this might be an explicit example, generated during a particular phase with some backend, or by replaying the minimal failing example.
	type	string / null
features	Runtime observations which might help explain what this test case did. Hypothesis includes target() scores, tags from event(), and so on.
	type	object
coverage	Mapping of filename to list of covered line numbers, if coverage information is available, or None if not. Hypothesis deliberately omits stdlib and site-packages code.
	type	object / null
	additionalProperties	type	array
		items	type	integer
			minimum	1
		uniqueItems	True
timing	The time in seconds taken by non-overlapping parts of this test case. Hypothesis reports execute:test, and generate:{argname} for each argument.
	type	object
	additionalProperties	type	number
		minimum	0
metadata	Arbitrary metadata which might be of interest, but does not semantically fit in ‘features’. For example, Hypothesis includes the traceback for failing tests here.
	type	object
property	The name or representation of the test function we’re running.
	type	string
run_start	unix timestamp at which we started running this test function, so that later analysis can group test cases by run.
	type	number

Information message¶

Info, alert, and error messages correspond to a group of test cases or the overall run, and are intended for humans rather than machine analysis.
properties
type	A tag which labels this observation as general information to show the user. Hypothesis uses info messages to report statistics; alert or error messages can be provided by plugins.
	enum	info, alert, error
title	The title of this message
	type	string
content	The body of the message. May use markdown.
	type	string
property	The name or representation of the test function we’re running. For Hypothesis, usually the Pytest nodeid.
	type	string
run_start	unix timestamp at which we started running this test function, so that later analysis can group test cases by run.
	type	number