Chapter 5 Testing

When we write programs, we want to make sure they work correctly. One way to do this is by running the program ourselves and trying different input data. This is called manual testing. If we just test by hand using random examples we think of, that’s called ad hoc manual testing. This can work for small programs, but it’s easy to miss important cases or forget what we’ve already tested. To test our programs more carefully, we can use a systematic approach. We plan specific test scenarios—a list of input data and the results we expect—and try them one by one.

As our programs get bigger, repeating all these tests by hand takes time and effort. To make this easier, we can write code that tests our code. This is called automated testing. With automated tests, the computer runs all the test cases for us and shows if something goes wrong. This helps us make changes with confidence, knowing our tests will catch any mistakes.

Some programmers go one step further and write the tests before writing the code. This approach is called Test-Driven Development (TDD). First, they write a test that fails, then write just enough code to make it pass. Each step adds one test case and one small improvement or extension to the code. This helps them stay focused and build reliable programs, one piece at a time.

Testing not only increases the likelihood that our programs are functioning correctly, but it also provides a safety net for enhancing other aspects of our software. When we refactor our code to make it more readable or efficient, we’re always taking a risk that we might introduce new bugs. However, with a robust set of tests in place, we can refactor with confidence. We can clean up our code, knowing that our tests will verify that the program’s behavior remains consistent.

Moreover, as we embrace new technologies like generative artificial intelligence (GenAI, AI) in the programming process, testing becomes even more crucial. It acts as a safeguard, ensuring that the programs generated by AI align with our expectations. We do this by crafting an executable specification in the form of tests. These tests serve as a blueprint that the generated code must satisfy, providing us with immediate feedback on whether the AI-generated solutions are adequate.

5.1 Test cases

When we want to test our program, we need to know how we check if our code is doing what we expect. This is where test cases come in. A test case is a single check we use to see if our code works correctly. It’s made up of an input data (what we give the program) and the expected output (what we think the program should do with that input data).

Imagine we’ve defined a simple function in Python to check if a person is an adult:

def is_adult(age):
    return age >= 18

To ensure our is_adult() function is performing correctly, we need to test it. Python provides a handy tool for this called the doctest module. Doctest allows us to write test cases that mimic the interactive Python shell. A test case is essentially an example of how we expect our function to behave. We start a test case with >>>, which is the Python prompt, followed by the function call. On the next line, we specify the expected result.

Here’s how we can create a test case for our is_adult() function. We want to verify that calling is_adult(50) indeed returns True.

"""
>>> is_adult(50)
True
"""

We typically place these test cases within the function’s docstring, which is the string between the triple quotes right below the function definition. This keeps our tests close to the function they are testing, making it more convenient for anyone reading or updating the code in the future.

Here’s the add function with the test case included in its docstring:

def is_adult(age):        # the function's signature
    """                   # the beginning of the docstring/doctest string 
    >>> is_adult(50)      # the test case (this and the next line)
    True                 
    """                   # the end of the docstring/doctest string 
    return age >= 18      # the function's body

It’s worth mentioning that documentation (called docstrings) and testing (called doctests) are not actually part of the program’s code that runs. Instead, Python (or certain tools) reads them when you ask it to, for example, when generating documentation or running tests. In your editor, they look like long strings – and that’s exactly what they are – so they usually show up in the same color as regular string literals. This helps you tell them apart from the main code that actually executes.

When we run a test case, we compare the actual result from the program to the expected result. If they match, the test passes. If not, the test fails, and we know there’s a problem to fix. To better understand how test cases work, let’s look at a couple of simple examples. The example test passes as it is obviously correct.

But let’s explore scenarios where a test fails due to an error in our program or in our test specification.

We expect is_adult(18) to give us True, but our function is returning False. In this case, the actual result (False) is different from the expected result (True), so the test case fails. Since we are sure that what we expect is correct, we know there’s a mistake in our function that we need to fix.
But what if we expect is_adult(18) to be False? In this case, our function might be working correctly and returning True, but the test case still fails. This is because we made a mistake in the expected output.

Let’s summarize these test cases in a table:

Test case	A	B	C
To be executed	`is_adult(50)`	`is_adult(18)`	`is_adult(18)`
Expected result	`True`	`True`	`False` (mistake)
Obtained result	`True`	`False` (mistake)	`True`
Test case state	passed/OK	failed	failed

Put all the pieces together and run them in your IDE, such as PyCharm, to see the test cases in action and better understand how they work. If this feels too difficult at this stage, don’t worry – you can simply copy the provided examples and run them to see what h:appens. Watching how the tests behave is a great first step to learning how testing works.

Solution. To run our example, we need the function along with our example test case:

def is_adult(age):
    """
    >>> is_adult(50)
    True  
    """ 
    return age >= 18

Additionally, we have to import doctest an activate test mode:

if __name__ == "__main__":
    import doctest
    doctest.testmod(verbose=True)

When you execute this program, you should get the following output in the terminal or its visual respresentation in an IDE.

## Finding tests in Example A
## Trying:
##     is_adult(50)
## Expecting:
##     True
## ok

Solution. Let us assume that there is a mistake in our function.


def is_adult(age):
    """
    >>> is_adult(18)
    True  
    """ 
    return age > 18  # <-- mistake

if __name__ == "__main__":
    import doctest
    doctest.testmod(verbose=True)

Now, the same test case will fail.

## Finding tests in Example B
## Trying:
##     is_adult(18)
## Expecting:
##     True
## **********************************************************************
## Line 2, in Example B
## Failed example:
##     is_adult(18)
## Expected:
##     True
## Got:
##     False

Usually, we are more interested in the tests that fail rather than all the tests that pass. To keep the output short and focused, we can turn off the detailed (verbose) report with the following adaptation.

if __name__ == "__main__":
    import doctest
    doctest.testmod() # with the default setting, i.e. verbose=False

When you run this, Python quietly checks all the examples in your docstrings behind the scenes. It won’t print anything if everything passes — it only speaks up when something goes wrong. For our example, we get a short, clean report that shows only the important part.

## **********************************************************************
## Line 2, in Example B
## Failed example:
##     is_adult(18)
## Expected:
##     True
## Got:
##     False

Solution. Let us assume that there is a mistake in our test specification.

def is_adult(age):
    """
    >>> is_adult(18)
    False  

    ^ mistake 
    """ 
    return age >= 18

This test case will fail, too.

## **********************************************************************
## Line 2, in Example C
## Failed example:
##     is_adult(18)
## Expected:
##     False
## Got:
##     True

While we usually trust our expectations to be accurate, it’s essential to approach test case writing with caution. Errors can sneak into our test cases just as they can into our code. That’s why, especially in the development of safety-critical software, it’s common practice to have separate teams for programming and testing. This division helps ensure an unbiased and thorough examination of the software’s functionality.

Similarly, when it comes to integrating GenAI into your future projects, it’s wise to maintain a hands-on approach to testing. By writing your own test cases and using GenAI to generate the corresponding program code, you place yourself in a stronger position to catch any discrepancies. This strategy keeps you firmly in control of the software’s quality assurance.

5.2 Test suite

When we have many test cases, it’s helpful to group them together in a test suite. A test suite is a collection of test cases that are run together to check if a program is working correctly. This is especially useful when we make changes or add new features—running the test suite can quickly show us if we broke something by mistake. It also helps us keep track of all the important checks our program needs.

There are different ways to build and organize a test suite, depending on how we want to test our program. The most well-known approaches are black-box and white-box testing. Most programmers use a mix of both to help catch different kinds of problems and make sure their programs are working well inside and out.

5.3 Black-box testing

One common approach is called black-box testing. In black-box testing, we don’t look at the code itself—we just try different input data and check if the outputs are correct. It’s like using a calculator: you type in numbers, press a button, and check the answer, without knowing what’s happening inside.

To choose good test data in black-box testing, we can use some helpful strategies. One of them is called equivalence partitioning. The idea is to group data that should behave the same way. For example, if a program accepts ages from 18 to 100, then we might pick one value from inside the valid range (like 50), and one from outside (like 12 or 150). We assume that testing one value from each group is enough to check how the program handles that kind of input data.

Another useful idea is boundary value analysis. Sometimes, bugs happen right at the edges—like just before or just after a limit. In our age example, we’d want to test values like 17, 18, 100, and 101. These are right on the boundary of what’s allowed. Testing the edges helps us catch common mistakes that might not show up with typical input data.

5.4 White-box testing

Another approach is called white-box testing. In white-box testing, we do look inside the code. We try to test all the possible paths the program can take (like every if condition) to make sure each part of the code behaves as expected.

In white-box testing, we often try to test as much of the code as possible. This is called coverage. For example, we may want to make sure that every function in the program is called at least once (function coverage), or that every line of code runs during testing (statement coverage). If the code has an if statement, we try to test both the true and false branches (decision coverage). If the condition is more complex—like if x > 0 and y < 10—we try values that make each part true and false (condition coverage). The more of the code we test, the more likely we are to catch bugs before they cause problems.

After we’ve covered the basics of white-box testing, let’s talk about a cool way to check how good our tests really are. It’s called mutation testing. Think of it like giving your code a little “what if” scenario. We make tiny tweaks to the code—these are our mutations. They’re like the common mistakes we all make when coding, such as mixing up a > with a >= or forgetting to add 1. After we’ve tweaked our code with these little changes, we run our tests again. If our tests are on point, they’ll notice these tricky changes and fail. That’s a good thing here—it means they’ve caught the error or “killed the mutant”. But if the tests don’t fail, the sneaky error has slipped through or “the mutant survived”. That tells us we need to improve our tests. It’s like a game of hide and seek with bugs, and it’s a super way to make sure our tests are really looking out for all those potential hiccups in our code.

5.5 Exercises

To practice complete interactive tutorials in online version of the textbook and exercises provided in the learning management system.

5.6 Glossary

automated testing: Using software to automatically run and repeat tests on your code. This method is efficient and can save time because the computer does the work for you.
black-box testing: A method where you test the program based on the expected output for given data, without considering the internal workings of the code. It’s like evaluating a machine based on its function, not how it’s built.
boundary value analysis: This technique involves testing at the extreme ends of input data ranges, where errors are often found. It’s a way to ensure that the limits you set in your program are working correctly.
coverage: A measure of how much of your code is tested by your test cases. High coverage means that more of the code’s pathways and conditions have been checked for correctness.
equivalence partitioning: Grouping data that should be treated similarly by the program and testing a sample from each group. This approach helps to efficiently identify errors in handling different types of data.
manual testing: The process of a person manually running tests on the software. This hands-on approach allows the tester to experience the program as an end-user, but it can be time-consuming.
mutation testing: A method where the code is deliberately altered with small changes to check if the existing tests can detect the modifications. It’s a way to validate the sensitivity of your tests.
testing: The act of running a program with the intention of finding errors and verifying that it behaves as intended. It’s a critical step in ensuring the quality of the software.
refactoring: The process of restructuring existing code without changing its external behavior. It’s aimed at improving the code’s readability and reducing complexity.
test case: An individual unit of testing that includes a set of data, execution conditions, and expected results. It’s a specific scenario designed to verify a particular function of the code.
test-driven development (TDD): A software development approach where tests are written before the code they are meant to validate. It ensures that the code meets its design and behaves as intended.
test suite: A collection of test cases that are designed to be executed together to validate the functionality of a software program. It’s a comprehensive evaluation of the software’s performance.

Editorial note

Versions:

In 2018, the testing topic was covered in a lecture.
In 2020, the fist version of the chapter was written.
In 2022, the content was significantly reduced and simplified based on students’ feedback.
In 2025, the content was again reduced and simplified mimicking the writing style of Py4E book using a LLM (ChatGPT 5) and splitting the content to the chapter and tutorials. Special thanks to Malin Schulz for her feedback on the previous version and to Laura Laura Grabher-Meyer for reviewing the current version and providing valuable suggestions.