Chapter 5 Testing

This chapter explains general principles of automated testing and gives technical instructions of integrating test cases into program development.

Remark: there are lecture videos with this content.

5.1 Automated Software Testing

Testing is an integral part of software development. It is one of the practices that distinguish a programmer being an engineer in contrast to a script kiddie. Initially, it may look for you as an additional effort without direct contribution to project progress, but, hopefully, with time, you will discover the usability of tests.

5.1.1 Introduction

Let us start with motivation and intuition behind testing.

Motivation

To understand the importance of automated testing let us take a global perspective. Nowadays, we are surrounded by software and depend on the correct functioning of programs. Software is used in administration, finances, medicine, telecommunication, transportation and many other domains. We assume that it is working correctly, but it may have critical software bugs with unpleasant or even dangerous consequences. You may lose your money, if somebody was not diligently testing an online shop software and your banking data got stolen. You may even lose your life, if the software of a plane was insufficiently controlled for quality and it crashed! It should not be a case, as fortunately, avionics has high quality standards.

I guess you agree that testing is definitely crucial for safety- or mission- critical parts of software and even for any serious software. There is a general consensus about it and there are legal regulations for the extent of the quality assurance procedures for critical systems (e.g. SIL). Testing, together with other quality assurance procedures, increases our convenience that a piece of software is working correctly.

The importance of testing was recognised by big players in the IT domain. As an example, I would like to cite the summary of the Testing Overview chapter from the book on Software Engineering at Google: Lessons Learned from Programming Over Time.

The adoption of developer-driven automated testing has been one of the most transformational software engineering practices at Google. It has enabled us to build larger systems with larger teams faster than we ever thought possible. It has helped us keep up with the increasing pace of technological change. Over the past 15 years, we have successfully transformed our engineering culture to elevate testing into a cultural norm. Despite the company growing by a factor of almost 100 times since the journey began, our commitment to quality and testing is stronger today than it has ever been.

Intuition

Before we go into definitions related to testing and technical examples, I would like to give you a few analogies from daily life and a general concept of a test.

A test is an assessment intended to measure a test-taker’s knowledge, skill, aptitude, physical fitness, or classification in many other topics [Wikipedia].

We deal with tests in our daily lives, education, medicine or science. In a broad view, by testing (putting under test) we can understand any process letting us know if an artifact or service is meeting our expectations, if an activity enables us to achieve a particular goal, or if a statement is true. Testing is required for validation and verification, the feedback we get from testing enables development.

If you want to read about daily life analogies, open the next box, otherwise, just continue with the regular text below it.

Learning to Speak

In the first years of our lives, we are learning to speak in our native languages to express our needs and communicate with other people. In this process, even if unconscious, we constantly deal with testing: we need to pass the understandability test by an interlocutor. If the interlocutor, parent or sibling, can not understand us, we fail the test: we are unable to communicate our idea or request to this person. Even if failing a test may be discouraging, it is useful, as it leads to improvement of our skills — we learn on our mistakes. If the interlocutor gives us detailed and frequent feedback, we can learn effectively and quickly. Passing a test is always satisfying, but if the quality of the test is poor it may give a false conviction that all is fine.

Cooking a meal

Let us go into our kitchens while cooking a meal and seasoning it to our taste. Usually it is an iterative process of adding spices and next trying if it tastes as desired. If it is good, we are done (a taste test passed), if not we add more spices (a taste test failed). Cooking without tasting would be challenging.

When we think about this example, we may discover that depending on the person the same meal maybe considered as tasty or not. Similarly, with testing we need to take into account who is assessing our system and also its purpose.

Picking a cloth

To see one more example showing how fitness to purpose is important, let us look into our wardrobes. Let us imagine picking a cloth for a special occasion. At first, we think about the general dress code depending on the purpose/framework of the event. Btw. also our personal purpose may influence our choice, we may either go with the dress code or consciously break it. We would try different outfits and look in the mirror to test them. We are also restricted by the content of you wardrobe (shoes, accessories, …) to make a decision if they fit together. Going with another person, we may also consider his or her outfit when making a decision. If this person sometimes wears black, we should not assume that it will also happen next time, we should check the actual outfit.

Testing is typically an iterative process, when we check if something is useful for a given purpose or if several things fit together. Critical thinking is extremely important in testing. Ideally, when we are testing you should make no assumptions on anything. In the case of software testing, you should be critical about programs, data and also about tests themself… With this kind of skeptical and explorative mind, testing can be a satisfying activity.

5.1.2 Overview

Now, we will go progressively into more detailed concepts directly related to software testing. Please mind that the following definitions were simplified to match what we use in our courses.

Let us start with the general definition of software testing.

Definition 5.1 Software testing is the process of evaluating a program
to identify defects or errors in its functionality, performance, and usability, with the goal of ensuring that it meets the requirements and expectations of its users.

As you can see, software testing may consider different aspects. We will just focus on functional correctness of our programs to ensure that their outcome is meeting our expectations.

For efficiency of the program development, we will aim for automated software testing.

Definition 5.2 Automated software testing uses a package to automatically execute tests and compare actual results with expected results.

It involves creating and running automated test cases (the definition will follow) to speed up the testing process and improve accuracy, consistency, and reliability of the software testing. We will learn an example package used for automated testing and how to define test cases.

Please mind that automated testing is done without any user interaction. This means that the computer program runs tests automatically without needing a person to type anything or take any action. Otherwise, testing would take more time and user actions may have an unpredicted impact on the test results.

There are different testing workflows. Recently, iterative workflows are more popular. They involve testing from the beginning of the the software development process, some even before programming. One of such approaches is test-driven development.

Definition 5.3 Test-driven development (TDD) is a software development process that relies on the repetition of a very short development cycle: requirements are turned into very specific test cases, then the code is improved so that the tests pass.

We recommend to use this approach to avoid developing untestable programs, which might need to be restructured to enable testing.

Using automated testing, not only increase the probability of programs working correctly, it also reduces the fear of modifying the structure of the programs and breaking them. In this sense, testing makes code refactoring easier.

Definition 5.4 Code refactoring is the process of restructuring an existing program without changing its behavior. Refactoring is intended to improve the design/structure of the program, while preserving its functionality.

Testing and refactoring help to get programs to work correctly and be clean. Testing helps to achieve a higher level of correctness in our programs, while refactoring helps to get clean code.

For testing, we need to define test cases put together in test suites. With them, we may actually define what we expect from our programs and can automatically check if these expectations are met.

5.1.3 Test Case

For the (automated) testing, we need to define test cases. They help to ensure that the program works as intended.

Definition 5.5 A test case is a set of instructions that describes the steps to be taken, inputs to be used, and expected outcomes to be observed when testing a specific program.

We will define test cases for our functions. Typically, we will just specify, a function to called (steps) and its arguments (input) and what we expected as result (outcome). After testing, for every single test case, we will see if our function

  • passed it, meaning the expected result is the same as the actual result, or
  • failed it, meaning the expected result differs from the actual result.

We will need to analyze test reports to see which tests our programs failed and which passed to make the decision on further programming steps.

Test Case Example

For better understanding of passing and failing test cases, we will look at oversimplified examples.

Let’s start with something obviously correct: we define a function add() which should just add two numbers given as arguments upon its call. If we call it with 1 and 2 it should return 3. If we want to define a test case for it we need to specify what we are testing/executing: add(1, 2), and what we are expecting (3). If we run a taste case, the result from the statement under test (3) will be compared with the expected result (3). For this test case, they will be the same, so the test case state will be OK/passed.

Now, let’s check something obviously wrong.

We expect that add(1, 2) is equal to 3, but our function is returning 2. For this test case, the result from the statement under test will differ from the expected result, therefore the test case state will be failed. As we are sure that the expected result is correct, there must be a mistake in the implementation of our function.

We expect that add(1, 2) is equal to 2, but our function is returning 3. This time, the test failed, but not because our function is not working correctly, but because of the mistake in the expected output. Usually, our expectations are correct, however, we need to be aware that it is possible to make mistakes in test specification, too. Therefore, as mentioned earlier, be critical.

Summarizing and comparing the test cases, we have:

test case: add(1, 2) ?= 3 add(1, 2) ?= 3 add(1, 2) ?= 2
to be executed add(1, 2) add(1, 2) add(1, 2)
expected result 3 3 2 (mistake)
obtained result 3 2 (mistake) 3
test case state passed/OK failed failed

If you want to look ahead and see how it technically may look like in Python open the next box.

We will learn later how to define test cases in Python using doctest, but let’s briefly look at the aforementioned example test cases.

Passing the test case

First, we need to define our add() function.

# definition of function in Python

def add(a, b):
    return a + b

In doctest’s test case definition, the first line(s) starting with >>>, are to define our statement(s) under test. We can put here whatever Python statement(s) we wish to be executed. And in the next line(s), we specify what we expect.

# test case in Python (doctest)
"""
>>> add(1, 2)
3  
"""
## Finding tests in Example
## Trying:
##     add(1, 2)
## Expecting:
##     3
## ok

Failing the test case due to a mistake in the code

Let us assume that there is a mistake in our function.

# definition of function in Python

def add(a, b):
    return b  # <-- mistake 

Now, the same test case will fail.

# test case in Python (doctest)
"""
>>> add(1, 2)
3  
"""
## Finding tests in Example
## Trying:
##     add(1, 2)
## Expecting:
##     3
## **********************************************************************
## Line 2, in Example
## Failed example:
##     add(1, 2)
## Expected:
##     3
## Got:
##     2

Failing the test case due to a mistake in the expected result

Let us assume that our function was corrected (debugged) and it works at the beginning.

Now, we use another test case:

# test case in Python (doctest)
"""
>>> add(1, 2)
2  

^ mistake 
"""
## Finding tests in Example
## Trying:
##     add(1, 2)
## Expecting:
##     2
## **********************************************************************
## Line 2, in Example
## Failed example:
##     add(1, 2)
## Expected:
##     2
## Got:
##     3

This time it fails, because we make a mistake in the expected result.

5.1.4 Test Suite

For systematic testing, we need a collection of test cases, it is called a test suite.

Definition 5.6 A test suite, is a collection of test cases that are intended to be used to test a program to show that it has some specified set of behaviors.

Creating a test suite is the most critical activity in testing. When defining a test suite, we have to think about what our goal is, what behavior we want to test and to what extent.

An important aspect which determines how we approach defining a test suite is the fact that we have access to the source code of a program under test. If we can and want to look into the source code, we talk about white-box or glass-box testing. If we define a test suite without looking into the source code, we talk about black-box testing. We will use a mixture of white and black box testing, which is called gray-box testing, but initially we will focus on the cases with clear distinction. Depending on the type of testing, we have different techniques for defining a test suite.

White-Box Testing

In our course, you are programming and testing, therefore you have access to the source code, when testing. We have this situation also in projects where there is no strict separation between programmers and testers. It can be related to a small size of a project or used methodology, e.g. TDD.

Definition 5.7 In white-box testing, we have access to the source code of a program under test. It means, we know the structure of the program, see the statements, including control flow statements and used conditions.

As in white-box testing, we can see the source code and therefore can apply techniques to systematically cover the entire program with test cases. We can measure the extent of testing with test coverage.

Definition 5.8 Test coverage is a measure used in white-box testing to describe the degree to which the source code of a program is executed when a particular test suite runs.

A program with high test coverage, measured as a percentage, has had more of its source code executed during testing, which suggests it has a lower chance of containing undetected software bugs compared to a program with low test coverage.

There are many coverage criteria, let’s focus on a few basic ones:

  • Function coverage aims to have each function in the program called at least once,
  • Statement coverage aims to have each statement in the program executed at least once,
  • Decision coverage aims to have each decision outcome to be exercised,
  • Condition coverage is used for conditions with logical expressions and aims to have each sub-expression evaluated once to true and once to false.

As in our course, we are at the level of defining functions, the first criteria is too weak for us. The statement and decision coverage, we will explain on example with a simple condition, while the condition coverage on an example with conditions containing logical expressions.

Example with a simple condition

Let us consider a function returning a given number as a string with a sign.

def get_number_as_str_with_sign(number):
    result = ""
    if number >= 0:
       result += "+"
    result += str(number)
    return result 

We will define a test suite with two test cases for it.

"""
>>> get_number_as_str_with_sign(-1)
-1
>>> get_number_as_str_with_sign(0)
+0 
"""

Considering statement coverage, we have the following reasoning.

  • In our function, we have 6 statements (lines).
  • With the first test case (-1), we will execute 5 of 6 statements. It means, the coverage is 83%.
  • With the second test case (0), we will execute all 6 statements. It means, the coverage is 100%.
  • Obviously, with our test suite, we will have 100% coverage.

Considering decision coverage, we have the following reasoning.

  • In our function, we have one decision point with two decision outcomes: the given number is greater equal to zero or not.
  • With the first test case (-1), we will decide that it is not and thus the coverage is 50% (1/2).
  • With the second test case (0), we will decide that it is and thus the coverage is 50% (1/2).
  • Obviously, with our test suite, we will have 100% coverage (we covered both options).

In this example, we get different coverage degrees for particular test cases and criteria, but for our test suite, it was the same (100%).

This is not always the case. Just consider the same example, but with the test suite consisting of one test case, namely with 0. Just with this one test case, we will get the full statement coverage (100%), but only 50% of decision coverage.

Example with a logical condition

Let us consider a function returning gender and age sensitive greetings.

def gender_greetings(female, adult):
    if female and adult:
        result = "Dear Ladies"
    else:
        result = "Dear All"
    return result 

First, we will define a test suite with two test cases for it. With this test suite, we will get the full statement and full decision coverage (check it).

"""
>>> gender_greetings(True, True)
Dear Ladies
>>> gender_greetings(True, False)
Dear All
"""

However, for the condition coverage only 50%. To get the full condition coverage, we would need each sub-expression to be evaluated to true and false. In our example, we need four test cases: (female and not female ) x (adult and not adult). Therefore, we need the following test suite.

"""
>>> gender_greetings(True, True)
Dear Ladies
>>> gender_greetings(True, False)
Dear All
>>> gender_greetings(False, True)
Dear All
>>> gender_greetings(False, False)
Dear All
"""

In testing, not only the degree of coverage is important but also values that we take for testing. To understand it, we need to leave the white-box…

Black-Box Testing

…and dive into the black-box. In general, black-box testing can be done if we do not have access to the source code (e.g. commercial, close code software) or we want on purpose separate the programming and testing activities/teams (e.g. in mission critical systems). However, techniques from it could be also used in our course.

Definition 5.9 In black-box testing, we have no access to source code, so we need to test it from an external, end-user perspective. Black-box testing is possible also without having programming skills as it focuses on the behavior of the program.

As we can not look into the source code, we have to observe the behavior of the program under test from outside. One possible method to analyze the behavior in a systematic way is to use the equivalence partitioning and boundary value analysis. Let’s take a closer look at both techniques.

Equivalence partitioning

First, we will take a look at the equivalence partitioning.

Definition 5.10 Equivalence partitioning divides the input data of a program into partitions of equivalent data from which test cases can be derived. In principle, test cases are designed to cover each partition at least once.

The input of the program should be divided into equivalence classes. For input values from the same class, we would obtain the same result.

For example, let us consider a function which determines if a grade is positive or negative. We would have two equivalence classes: 1 - 4 as positive grades and additionally, 5 as the negative grade (in the Austrian education system). We could add one more class, invalid grade for all numbers below 1 and above 5. Optionally, we could have “invalid too low” and “invalid too high”.

Equivalence classes for grading examples
Equivalence classes for grading examples

For testing, we could take whatever value from a class and obtain the same result. It means that having multiple test cases for the same class usually does not increase the probability of program correctness.

In our example, we could pick up -1, 3, 5, 8 for testing.

Boundary value analysis

When considering the usefulness of test cases, not all values from a class are equally effective when testing. The values closer to boundaries between classes are ideal candidates for testing. The technique that takes it into account is called boundary value analysis.

Definition 5.11 Boundary-value analysis is a technique in which tests are designed to include representatives of boundary values in a range. The test values on either side of the boundary between equivalence classes are called boundary values.

Boundary values are sensitive to small changes in the source code and therefore more effective for testing purposes.

For example, from the positive grade class we could pick 1 and 4, as the closest to the “invalid too low” and “negative” classes, respectively. Going this direction we would need to pick the following values: 0, 1, 4, 5, 6 for testing.

It is helpful to visualize values, equivalence classes and boundaries to pick up the values for testing.

Test values for grading examples
Test values for grading examples

If we imagine changing behavior of the program in the way that the boundaries would slightly move, our values should detect these changes. I believe it is intuitive, but may be a bit abstract.

In our example, for a positive grade we may have the following condition: 1 <= value <= 4. For our testing values: 1 and 4, we will get True. Now, if we make a mistake in the condition and use 1 < value <= 4, the test case with 1 will detect it, because we will obtain False, where we expected True. Of course, any other number from this class (2, 3, 4) would not detect this change. On the other hand, 4 is protecting the other side of the condition (see the table below).

testing value 1 <= value <= 4 1 < value <= 4 1 <= value < 4
0 False False False
1 True False True
2 True True True
3 True True True
4 True True False
5 False False False

It is relatively easy to use the equivalence partition and boundaries values analysis, when we deal with one value being a whole number. For a continuous scale, we would need to decide on the accuracy with which we would pick the closest value. Accuracy should be something meaningful for our domain.

For example, if we would have age in years as a real number, we could set the accuracy to one day, which would mean 1/366 (0.0027) years.

And it gets even more complicated with functions with more arguments to be used in test cases.

Let’s look now at an example with two input values. We will consider an assessment rule, where you should take two exams and could score between 0 and 100 in each of them. An exam is passed when more than 50 points are scored.

For a single exam, we have the following equivalence partition:

Equivalence classes and testing values for assessment example (1D)
Equivalence classes and testing values for assessment example (1D)

Additionally, to pass a course, we may need to pass both exams. Taking into account both exams will extend our input space to two dimensions.

Remark on the graphical representation: Points not filled are excluded from a particular area.

Equivalence classes for assessment example (2D)
Equivalence classes for assessment example (2D)

Now, we will need to take care to pick up test values to cover all corners. To reduce the overall number of tests, we can do it also indirectly, i.e. values from two test cases may create a virtual one, e.g. with (50, 100) and (100, 50) also (50, 50) is covered.

Equivalence classes and testing values for assessment example (2D)
Equivalence classes and testing values for assessment example (2D)

5.1.5 Glossary

Collection of the definitions from the chapter.

  • Software testing is the process of evaluating a program
    to identify defects or errors in its functionality, performance, and usability, with the goal of ensuring that it meets the requirements and expectations of its users.

  • Automated software testing uses a package to automatically execute tests and compare actual results with expected results.

  • Test-driven development (TDD) is a software development process that relies on the repetition of a very short development cycle: requirements are turned into very specific test cases, then the code is improved so that the tests pass.

  • Code refactoring is the process of restructuring an existing program without changing its behavior. Refactoring is intended to improve the design/structure of the program, while preserving its functionality.

  • A test case is a set of instructions that describes the steps to be taken, inputs to be used, and expected outcomes to be observed when testing a specific program.

  • A test suite, is a collection of test cases that are intended to be used to test a program to show that it has some specified set of behaviors.

  • In white-box testing, we have access to the source code of a program under test. It means, we know the structure of the program, see the statements, including control flow statements and used conditions.

  • Test coverage is a measure used in white-box testing to describe the degree to which the source code of a program is executed when a particular test suite runs.

  • In black-box testing, we have no access to source code, so we need to test it from an external, end-user perspective. Black-box testing is possible also without having programming skills as it focuses on the behavior of the program.

  • Equivalence partitioning divides the input data of a program into partitions of equivalent data from which test cases can be derived. In principle, test cases are designed to cover each partition at least once.

  • Boundary-value analysis is a technique in which tests are designed to include representatives of boundary values in a range. The test values on either side of the boundary between equivalence classes are called boundary values.

5.2 Doctest in Python

In Python, there are multiple packages to define unit tests or to conduct advanced testing procedures. In my opinion, the easiest to use, for beginner programmers, is doctest, which allows defining interactive Python examples to be included in documentation and used as tests.

Remark: there are lecture videos with this content. For easier understanding, it is highly recommended to watch the video on doctest before you continue reading.

5.2.1 Configuration

Before we dive into technical details, let us think about the usage scenarios with user-defined functions.

  • On one hand, we develop functions: program and test them.
  • On the other hand, we want to use or reuse them in other programs.
    In this situation, we will import function definitions, but we do not want to test them.

To have this distinction, we can use conditional statement on value of the built-in variable __name__ (see the lecture on functions). In the file, where we define functions, we have to add this conditional execution of tests. Only if __name__ == "__main__", which means that we are running the file with function definitions, we will activate doctest and execute test cases.

## Here will come function definitions including test cases.
## As for now, we will just add a definition of a function:

def add(a, b):
    return a + b

if __name__ == "__main__":
    import doctest     # importing doctest module
    doctest.testmod()  # activating test mode

Remark: Typically, all modules are imported at the top of a file. Here, we need doctest only for particular usage scenario, so we can make an exception and import this module inside the if-branch.

If we can determine how much information we want to see when we run the test cases by setting the verbose parameter either to True or False, when calling testmod().

If verbose is set to True, we will get a full report on test cases. For the example cases from the previous chapter, I used verbose mode to obtain full report, as here:

## test case in Python (doctest)
"""
>>> add(1, 2)
3  
>>> add(1, 2)
2  
"""
## Finding tests in Example
## Trying:
##     add(1, 2)
## Expecting:
##     3
## ok
## Trying:
##     add(1, 2)
## Expecting:
##     2
## **********************************************************************
## Line 4, in Example
## Failed example:
##     add(1, 2)
## Expected:
##     2
## Got:
##     3

If verbose is set to False, we will only get a report on failed cases. For the same example test cases, we will get the following report:

## **********************************************************************
## Line 4, in Example
## Failed example:
##     add(1, 2)
## Expected:
##     2
## Got:
##     3

If all test cases are OK, we get no report at all. Therefore, to be sure that we run the test we can set verbose to True.

Alternatively, we can define test cases in separate file(s) instead of having them in file(s) where our function(s) are defined. If we want to execute our tests outside a function we will use doctest.testfile().

If you want to try it yourself, please note that in this case, you need to insert a blank line after your last specified result and before closing the multiline string with triple quotation marks.

5.2.2 Test case

Let us take a closer look at definition, interpretation and result of a test case in a very simple example.

Test case definition

A test case in doctest looks like a step in an interactive Python interpreter. Python’s statements come after the >>> prompt and below them we have the interpreter’s response. The difference is that the test cases are defined within the docstring string, so it means they are within a multiline string in triple quotation marks.

Let us start with the same example as in the previous chapter.

"""
>>> add(1, 2)
3  
"""

Test case interpretation

If doctest is imported and testmod() is called, a Python interpreter will look for documentation strings containing >>>. For each such string a new environment (frame) is created with all globally visible names (variables, functions). Therefore, separately defined tests do not interfere.

A sequence of lines starting with >>> is evaluated and the environment is updated.

In our example, there is only one line, namely add(1, 2). It is evaluated and 3 is stored as the obtained result.

A sequence of lines not starting with >>> is interpreted as the expected result. The sequence ends with the next line starting with >>>, a blank line or the end of the documentation string.

In our example, there is only one line, namely 3, which is stored as the expected result.

Next the comparison of obtained and expected result is done. Results are compared as strings, not as objects. Therefore, they must be character exact. If they are equal, the test case will have status pass, otherwise fail.

In our example, 3 (as the obtained result) will be compared with 3 (as the expected result). As they are equal, the test case will be OK (pass).

Test case result / report

The doctest will show a report for a single test case or multiple test cases. For multiple test cases a summary can be shown as well, containing a number of failed test cases and the total number of test cases. Interactive reports are also shown by integrated development environments, e.g. PyCharm, where you can click on the test case and jump to the its definition.

In our example, the test case was OK (passed), as it can be seen in the below verbose report.

## Finding tests in Example
## Trying:
##     add(1, 2)
## Expecting:
##     3
## ok

5.2.3 Example test cases

In this section, we will use oversimplified, user-defined functions just to illustrate the testing mechanism.

Multiple statements

Similarly, like in an interactive Python interpreter, we can execute several statements before we obtain a result.

"""
>>> a = 1
>>> b = 2
>>> add(a, b)
3  
"""
## Finding tests in Example
## Trying:
##     a = 1
## Expecting nothing
## ok
## Trying:
##     b = 2
## Expecting nothing
## ok
## Trying:
##     add(a, b)
## Expecting:
##     3
## ok

From the verbose report, we can see that after the first two lines we just left no empty line, so it means we expected nothing, and indeed we got nothing from the interpreter. When we have multiple lines for one test case, in the report we see that we have several tests passed. In this example, the only relevant information is that the last result was as expected. When you count the number of test cases (e.g. in the exams), you should only consider the actual test cases, here one.

In general, all names defined within the testing session are visible in its scope. Therefore, the testing session is similar to a Python interpreter session. For examples and explanations see the exercise set (see U06_Tests/BestPractices/Session).

Multi-line expected result

Similarly, we can have a multiline expected result. The end of the expected result is either an empty line or a new test case. For example, we may test a function printing several lines (we need to indicate empty lines with <BLANKLINE> to not end the expected result too early):

"""
>>> letter_template("Joanna")
Hello,
<BLANKLINE>
Best regards,
Joanna
"""
## Finding tests in Example
## Trying:
##     letter_template("Joanna")
## Expecting:
##     Hello,
##     <BLANKLINE>
##     Best regards,
##     Joanna
## ok

One important remark here, doctest compares expected results and obtained results as strings. By default, it is character exact, so if we have leading or trailing spaces the result will differ. For example, when expecting "Hello,", but getting "Hello, ", the test will fail.

It is possible to change normalize whitespaces to have a less sensitive form of comparison, see NORMALIZE_WHITESPACE option flag.

Expecting an exception

We can also test if a function will raise an exception. To test an exception, we need to let doctest know that there will be a traceback information, so we need to

  • include the traceback header line as the first line of the expected result,
  • the details of the traceback may be skipped or replaced by a placeholder,
  • include the last line from the traceback with error type and error message.

Let us restrict our add() function to accept integers only.

def add(a, b):
    if type(a) == int and type(b) == int:
        return a + b
    raise TypeError("Both arguments must be integers.")

Now, we can test for user defined exceptions.

"""
>>> add(1, 2.0)
Traceback (most recent call last):
    ...
TypeError: Both arguments must be integers.
"""
## Finding tests in Example
## Trying:
##     add(1, 2.0)
## Expecting:
##     Traceback (most recent call last):
##         ...
##     TypeError: Both arguments must be integers.
## ok

Literal comparison of results

It is important to know that the expected and obtained results are compared as strings, character by character.

Even if 3 == 3.0, the following test case would fail if we would expect 3.0.

"""
>>> add(1, 2)
3.0  
"""
## Finding tests in Example
## Trying:
##     add(1, 2)
## Expecting:
##     3.0
## **********************************************************************
## Line 2, in Example
## Failed example:
##     add(1, 2)
## Expected:
##     3.0
## Got:
##     3

However, we may include comparison explicitly as the test case statement

"""
>>> add(1, 2) == 3.0  
True 
"""

and it will pass.

## Finding tests in Example
## Trying:
##     add(1, 2) == 3.0
## Expecting:
##     True
## ok

This is especially important to remember, when comparing unordered collections, as dictionaries. The next box goes a bit ahead, so you may come back to this example later.

This is a bit tricky while comparing unordered collections as, for example, dictionaries. To illustrate this let us define a simple function that merges together two dictionaries with unique keys.

def merge_dictionaries(a_dict_1, a_dict_2):
    keys = list(a_dict_1.keys()) + list(a_dict_2.keys())
    assert len(keys) == len(set(keys)), "Keys in dictionaries are not unique"
    a_dict_1.update(a_dict_2)
    return a_dict_1

If the obtained and expected results are equal dictionaries, but in a different order, our test case will fail. For example:

"""
>>> merge_dictionaries({1: "first"}, {2: "second"})
{2: "second", 1: "first"}
"""
## Finding tests in Example
## Trying:
##     merge_dictionaries({1: "first"}, {2: "second"})
## Expecting:
##     {2: "second", 1: "first"}
## **********************************************************************
## Line 2, in Example
## Failed example:
##     merge_dictionaries({1: "first"}, {2: "second"})
## Expected:
##     {2: "second", 1: "first"}
## Got:
##     {1: 'first', 2: 'second'}

A way around would be to let Python compare the structures and just check the result of this comparison. For the above example, we can define the alternative test case as follows:

"""
>>> merge_dictionaries({1: "first"}, {2: "second"}) == {2: "second", 1: "first"}
True
"""
## Finding tests in Example
## Trying:
##     merge_dictionaries({1: "first"}, {2: "second"}) == {2: "second", 1: "first"}
## Expecting:
##     True
## ok

5.2.4 TDD Tutorial

Now let us take a more sophisticated example and define a conversion function from meters to feet. To understand the development process of this function, we will do it step by step.

First, we need to read the description (docstring) and start defining test cases and the body of the function.

def meters2feet(a_distance):
    """
    Calculates feet based on a distance (parameter) given in meters.

    :param a_distance: distance in meters
    :type a_distance: float

    :return: distance in feet
    :rtype: float
    """
    pass

Hint: for better learning effect, implement this solution yourself following the steps described below. Use the template given in the exercises set (see U06_Tests/Preparation/Feet).

Core functionality

We can add a functional test for our (empty) function.

Remark: if you want to run the source code presented for particular steps, you need to add if __name__ == "__main__": ... at the end of the source code. It was described in the Configuration section above. The complete source code is included in the final step at the end of this section.

The first step is to define test case for 1.0 (float).

Solution. Here we have our first test case.

def meters2feet(a_distance):
    """
    Calculates feet based on a distance (parameter) given in meters.

    :param a_distance: distance in meters
    :type a_distance: float

    :return: distance in feet
    :rtype: float

    >>> meters2feet(1.0) # <--- added in this step -------------------
    3.28084

    """
    pass
## Finding tests in meters2feet
## Trying:
##     meters2feet(1.0)
## Expecting:
##     3.28084
## **********************************************************************
## Line 2, in meters2feet
## Failed example:
##     meters2feet(1.0)
## Expected:
##     3.28084
## Got nothing

The function will fail the test case (0/1), because nothing is implemented in the body of the function. Next, we will do the minimal effort to pass the test case, just print a hard coded number (in most cases, a very bad practice!).

Solution. Here we have a code just to pass the first test case.

def meters2feet(a_distance):
    """
    Calculates feet based on a distance (parameter) given in meters.

    :param a_distance: distance in meters
    :type a_distance: float

    :return: distance in feet
    :rtype: float

    >>> meters2feet(1.0)
    3.28084

    """
    print(3.28084) # <--- added in this step -------------------------
## Finding tests in meters2feet
## Trying:
##     meters2feet(1.0)
## Expecting:
##     3.28084
## ok

Obviously, the function will pass the test case now (1/1). However, we are by far not done yet. Our function is rather useless and having one test case is usually not enough. Let’s add more functional tests and include a calculation. Let us add a test case for other values (int) and update the code to pass them.

Solution. Now, we have two more test cases and code to pass all three test cases.

def meters2feet(a_distance):
    """
    Calculates feet based on a distance (parameter) given in meters.

    :param a_distance: distance in meters
    :type a_distance: float

    :return: distance in feet
    :rtype: float

    Examples:

    specific values

    >>> meters2feet(0) # <--- added in this step ---------------------
    0.0

    >>> meters2feet(1/3.28084) # <--- added in this step -------------
    1.0

    >>> meters2feet(1.0)
    3.28084

    """
    print(3.28084 * a_distance) # <--- modified in this step ---------
## Finding tests in meters2feet
## Trying:
##     meters2feet(0)
## Expecting:
##     0.0
## ok
## Trying:
##     meters2feet(1/3.28084)
## Expecting:
##     1.0
## ok
## Trying:
##     meters2feet(1.0)
## Expecting:
##     3.28084
## ok

It seems that we are doing well (3/3). But we are just printing our result, not returning it, so we will be not able to use converted value in our program (outside the function). We need to add a test case for the type of returned value. It should be float as specified in docstring.

Solution. Now, we have one more test case to check if the type of returned value is a floating point number.

def meters2feet(a_distance):
    """
    Calculates feet based on a distance (parameter) given in meters.

    :param a_distance: distance in meters
    :type a_distance: float

    :return: distance in feet
    :rtype: float

    Examples:

    specific values

    >>> meters2feet(0)
    0.0

    >>> meters2feet(1/3.28084)
    1.0

    >>> meters2feet(1.0)
    3.28084

    type of returned value

    >>> type(meters2feet(1)) == float # <--- added in this step ----
    True

    """
    print(3.28084 * a_distance)
## Finding tests in meters2feet
## Trying:
##     meters2feet(0)
## Expecting:
##     0.0
## ok
## Trying:
##     meters2feet(1/3.28084)
## Expecting:
##     1.0
## ok
## Trying:
##     meters2feet(1.0)
## Expecting:
##     3.28084
## ok
## Trying:
##     type(meters2feet(1)) == float
## Expecting:
##     True
## **********************************************************************
## Line 8, in meters2feet
## Failed example:
##     type(meters2feet(1)) == float
## Expected:
##     True
## Got:
##     3.28084
##     False

After adding this new test case, our function will fail again (3/4). We need to return calculated value instead of printing it.

Solution. Now, we replace print with return.

def meters2feet(a_distance):
    """
    Calculates feet based on a distance (parameter) given in meters.

    :param a_distance: distance in meters
    :type a_distance: float

    :return: distance in feet
    :rtype: float

    Examples:

    specific values

    >>> meters2feet(0)
    0.0

    >>> meters2feet(1/3.28084)
    1.0

    >>> meters2feet(1.0)
    3.28084

    type of returned value

    >>> type(meters2feet(1)) == float
    True

    """
    return 3.28084 * a_distance # <--- modified in this step ---------
## Finding tests in meters2feet
## Trying:
##     meters2feet(0)
## Expecting:
##     0.0
## ok
## Trying:
##     meters2feet(1/3.28084)
## Expecting:
##     1.0
## ok
## Trying:
##     meters2feet(1.0)
## Expecting:
##     3.28084
## ok
## Trying:
##     type(meters2feet(1)) == float
## Expecting:
##     True
## ok

Now, we passed all test cases defined so far (4/4).

Error handling

If we want to have domain-specific messages for wrong arguments of our function, we need to define error handling.

We will start with an invalid value, defining both a test case and extending the function to raise ValueError with the following message The distance must be a non-negative number.

Solution. We are extending the code and the test case to include a reasonably small negative number.

def meters2feet(a_distance):
    """
    Calculates feet based on a distance (parameter) given in meters.

    :param a_distance: distance in meters
    :type a_distance: float

    :return: distance in feet
    :rtype: float

    Examples:

    invalid argument value
    >>> meters2feet(-0.1) # <--- added in this step ------------------
    Traceback (most recent call last):
    ValueError: The distance must be a non-negative number.

    specific values

    >>> meters2feet(0)
    0.0

    >>> meters2feet(1/3.28084)
    1.0

    >>> meters2feet(1.0)
    3.28084

    type of returned value

    >>> type(meters2feet(1)) == float
    True

    """
    # --- added in this step -----------------------------------------
    if a_distance < 0:
       raise ValueError("The distance must be a non-negative number.")
    # ----------------------------------------------------------------
    return 3.28084 * a_distance
## Finding tests in meters2feet
## Trying:
##     meters2feet(-0.1)
## Expecting:
##     Traceback (most recent call last):
##     ValueError: The distance must be a non-negative number.
## ok
## Trying:
##     meters2feet(0)
## Expecting:
##     0.0
## ok
## Trying:
##     meters2feet(1/3.28084)
## Expecting:
##     1.0
## ok
## Trying:
##     meters2feet(1.0)
## Expecting:
##     3.28084
## ok
## Trying:
##     type(meters2feet(1)) == float
## Expecting:
##     True
## ok

Now, we pass all test cases again (5/5).

Next, we will deal with the wrong type. We will accept both numeric types: float and integer.

Solution. Let us try to do it this way.

def meters2feet(a_distance):
    """
    Calculates feet based on a distance (parameter) given in meters.

    :param a_distance: distance in meters
    :type a_distance: float

    :return: distance in feet
    :rtype: float

    Examples:

    invalid argument type   
    >>> meters2feet("1") # <--- added in this step -------------------
    Traceback (most recent call last):
    TypeError: The distance must be a number.

    invalid argument value
    >>> meters2feet(-0.1) 
    Traceback (most recent call last):
    ValueError: The distance must be a non-negative number.

    specific values

    >>> meters2feet(0)
    0.0

    >>> meters2feet(1/3.28084)
    1.0

    >>> meters2feet(1.0)
    3.28084

    type of returned value

    >>> type(meters2feet(1)) == float
    True

    """
    if a_distance < 0:
        raise ValueError("The distance must be a non-negative number.")
    # --- added in this step -----------------------------------------
    if type(a_distance) != float and type(a_distance) != int:
        raise TypeError("The distance must be a number.")
    # ----------------------------------------------------------------
    return 3.28084 * a_distance
## Finding tests in meters2feet
## Trying:
##     meters2feet("1")
## Expecting:
##     Traceback (most recent call last):
##     TypeError: The distance must be a number.
## **********************************************************************
## Line 2, in meters2feet
## Failed example:
##     meters2feet("1")
## Expected:
##     Traceback (most recent call last):
##     TypeError: The distance must be a number.
## Got:
##     Traceback (most recent call last):
##       File "/usr/lib/python3.12/doctest.py", line 1368, in __run
##         exec(compile(example.source, filename, "single",
##       File "<doctest meters2feet[0]>", line 1, in <module>
##         meters2feet("1")
##       File "<string>", line 40, in meters2feet
##     TypeError: '<' not supported between instances of 'str' and 'int'
## Trying:
##     meters2feet(-0.1)
## Expecting:
##     Traceback (most recent call last):
##     ValueError: The distance must be a non-negative number.
## ok
## Trying:
##     meters2feet(0)
## Expecting:
##     0.0
## ok
## Trying:
##     meters2feet(1/3.28084)
## Expecting:
##     1.0
## ok
## Trying:
##     meters2feet(1.0)
## Expecting:
##     3.28084
## ok
## Trying:
##     type(meters2feet(1)) == float
## Expecting:
##     True
## ok

Our function now fails in one test case (5/6). Any idea why?

Solution. The order should be always, at first checking types, next values (content). For the final solution we need to swap the order in the function definition.

Complete solution

The following definition we can consider as the complete solution for this example (6/6). If you would like to extend it, you could add a test case and modify the function.

Solution. Now, we have a version with desired functionality.

def meters2feet(a_distance):
    """
    Calculates feet based on a distance (parameter) given in meters.

    :param a_distance: distance in meters
    :type a_distance: float

    :return: distance in feet
    :rtype: float

    Examples:

    invalid argument type   
    >>> meters2feet("1")
    Traceback (most recent call last):
    TypeError: The distance must be a number.

    invalid argument value
    >>> meters2feet(-0.1)
    Traceback (most recent call last):
    ValueError: The distance must be a non-negative number.

    specific values

    >>> meters2feet(0)
    0.0

    >>> meters2feet(1/3.28084)
    1.0

    >>> meters2feet(1.0)
    3.28084

    type of returned value

    >>> type(meters2feet(1)) == float
    True

    """
    # --- order swapped in this step ---------------------------------
    if type(a_distance) != float and type(a_distance) != int:
       raise TypeError("The distance must be a number.")
    if a_distance < 0:
       raise ValueError("The distance must be a non-negative number.")
    # ----------------------------------------------------------------
    return 3.28084 * a_distance

if __name__ == "__main__":
    import doctest     # importing doctest module
    doctest.testmod()  # activating test mode
## Finding tests in meters2feet
## Trying:
##     meters2feet("1")
## Expecting:
##     Traceback (most recent call last):
##     TypeError: The distance must be a number.
## ok
## Trying:
##     meters2feet(-0.1)
## Expecting:
##     Traceback (most recent call last):
##     ValueError: The distance must be a non-negative number.
## ok
## Trying:
##     meters2feet(0)
## Expecting:
##     0.0
## ok
## Trying:
##     meters2feet(1/3.28084)
## Expecting:
##     1.0
## ok
## Trying:
##     meters2feet(1.0)
## Expecting:
##     3.28084
## ok
## Trying:
##     type(meters2feet(1)) == float
## Expecting:
##     True
## ok

Clean solution

Having test cases defined, we may start refactoring our solution to get a cleaner one. At each moment, we may run the test cases and be assured that our program is correct to the same extent as before refactoring. We will get rid of the magic number in the code, by defining METER2FEET_RATIO.

Solution. This a clean solution, which is still passing all our test cases.

METER2FEET_RATIO = 3.28084  # <--- added in this step --------------------------

def meters2feet(a_distance):
    """
    Calculates feet based on a distance (parameter) given in meters.

    :param a_distance: distance in meters
    :type a_distance: float

    :return: distance in feet
    :rtype: float

    Examples:

    invalid argument type   
    >>> meters2feet("1")
    Traceback (most recent call last):
    TypeError: The distance must be a number.

    invalid argument value
    >>> meters2feet(-0.1)
    Traceback (most recent call last):
    ValueError: The distance must be a non-negative number.

    specific values

    >>> meters2feet(0)
    0.0

    >>> meters2feet(1/METER2FEET_RATIO)  # <--- updated in this step -----------
    1.0

    >>> meters2feet(1.0)
    3.28084

    type of returned value

    >>> type(meters2feet(1)) == float
    True

    """
    if type(a_distance) != float and type(a_distance) != int:
       raise TypeError("The distance must be a number.")
    if a_distance < 0:
       raise ValueError("The distance must be a non-negative number.")
    
    return METER2FEET_RATIO * a_distance # <--- updated in this step -----------

if __name__ == "__main__":
    import doctest     # importing doctest module
    doctest.testmod()  # activating test mode
## Finding tests in meters2feet
## Trying:
##     meters2feet("1")
## Expecting:
##     Traceback (most recent call last):
##     TypeError: The distance must be a number.
## ok
## Trying:
##     meters2feet(-0.1)
## Expecting:
##     Traceback (most recent call last):
##     ValueError: The distance must be a non-negative number.
## ok
## Trying:
##     meters2feet(0)
## Expecting:
##     0.0
## ok
## Trying:
##     meters2feet(1/METER2FEET_RATIO)
## Expecting:
##     1.0
## ok
## Trying:
##     meters2feet(1.0)
## Expecting:
##     3.28084
## ok
## Trying:
##     type(meters2feet(1)) == float
## Expecting:
##     True
## ok

5.2.5 Exercises

In the exercise set for this chapter, there are two exercises for practicing testing before the classroom session. Both exercises are continuation of exercises form Python for Everybody book. In both of them, you have code given and your task is to define test cases.

  • Exercise 1: Testing the program calculating a payment (see U06_Tests/Preparation/ComputePay).
  • Exercise 2: Testing the program calculating a grade (see U06_Tests/Preparation/ComputeGrade).

5.3 Defensive Programming

Defensive programming is an approach intended to ensure the continuing function of a program under unforeseen circumstances. One of the goals is to make a program behave in a predictable manner despite unexpected inputs or user actions. To achieve that we need to include additional conditional execution (sanity checks) and/or deal with possible errors/exceptions (error handling).

5.3.1 Motivation

Implementing mechanisms required for defensive programming requires additional resources (time). Therefore, the first question we would need to ask is when it does pay to do this. It depends on the usage scenario of our program, in particular sources of data and communication with users.

  • Sources of data

    • Unpredictable sources of data should be checked for correctness. Such unpredictable sources of data may be user input or calls from third-party programs. We have absolutely no guarantee what data a user will feed to our program, neither do we have control over arguments somebody else is using for calling our functions. In such situations it is better to check if the data is appropriate for our purpose. If invalid data may cause problems, we should definitely check its correctness.

    • Secure sources of data do not need to be checked. If we are the only user of our program, or another team member or trusted person, and we have only internal communication within our program, we may assume that data passed around is correct.

  • Communication with a user

    • If you need messages for more informative error messages you should use exception handling. If we rely on built-in error handling and standard programming language error messages they may be unclear to a user who does not know this particular programming language. If we define our own error handling, we can make them domain-specific.

    • If you need to end your program gracefully and safely. If you don’t want users see just tons of traceback lines he or she can not understand and risk a damage or lost of data, you need to foresee situations when your program may crash.

    • If none of the above is relevant to you, you could rely on standard exception handling.

For example, our program would run for a week to collect and process some data. After it is finished, it would ask a user how many records of data should be saved. And the user will give a text instead of a number.

When our program needs to use this user input as a whole number, we will get an error message, for example,

records = []       
                                           # ... computing a lot of values 
number_of_records = "text"                 # <-- from user input 
number_of_records = int(number_of_records) 

ValueError: invalid literal for int() with base 10: ‘text’

If we do not deal with this issue, the program will stop and no data will be saved. Loss of (computation) time! In this situation, we should definitely check user input.

Moreover, we should write more informative error message, for example, assuming that we collected 666 records:

ERROR: incorrect number: it should be a positive whole number smaller than 666. 
       Please try again.

In general, we need to answer to ourselves if a case may be problematic and if the answer is yes we should program precautions, otherwise, we do not need to care about it.

5.3.2 Sanity checks

If you want to add sanity checks to your program you can use precondition, postconditions or invariants.

Precondition is a condition that must always be true just prior to the execution of some section of code. In the function’s precondition you can check all constraints on the function’s arguments (types, values).

For example, before you start a conversion of temperature from Celsius to Kelvin, you may check if given temperature is a number and it is not lower than the absolute zero.

Postcondition is a condition that must always be true just after the execution of some section of code. In the function’s postcondition you can check all constraints on the function’s result (e.g. range).

In the same example, we could check if the temperature in Kelvin is not lower than zero.

It is not so much relevant to this course, but for completeness I will also introduce the definition of an invariant. Invariant is a condition that must always be true during the execution of a program, or during some portion of it (e.g. inside a loop).

In aforementioned examples, I always mentioned having conditions inside functions. This is one approach, the other would be to have checks outside functions. What is better depends on a usage scenario, in general, we should aim to have only one check.

If we have user input only in one place and we know the restrictions on this input, we could check type and content correctness there. If we have several ways how we can pass data to our function, it is better to check the data (once) inside the function. Checking data inside a function has usually two advantages: the check typically happens only once and it is close to specification and usage.

5.3.3 Mechanism

If you want to make sanity checks you can use either if-else or try-except blocks. Whenever possible, we should use if-else statements as they are less costly (computation time).

For example, we can extend add() by checking if arguments are Boolean.

def add(a, b):
    if type(a) == int and type(b) == int:
        return a + b
    return None   # superfluous, added for readability 


print(add(1, True))
## None

If we return None, the error may pass unnoticed if we do not check the type of returned value after the call of add().

Alternatively, we can raise an exception.

def add(a, b):
    if type(a) == int and type(b) == int:
        return a + b
    raise TypeError("Both arguments must be integers.")
print(add(1, True))
## Traceback (most recent call last):
##     ...
## TypeError: Both arguments must be integers.

Yet another option would be to use assert and obtain AssertionError if arguments are not of Boolean type.

def add(a, b):
    assert type(a) == int and type(b) == int, "Both arguments must be integers." 
    return a + b
print(add(True, 1))
## Traceback (most recent call last):
##     ...
## AssertionError: Both arguments must be integers.

The second and third solutions are similar but not the same, because the type of raised error is different. Moreover, assertion errors are not meant to catch nor to test and they can be automatically switched off when running your script.

More examples in the exercises set:

  • Examples with silent passing errors, with exceptions/assertions and with different levels of trust (see U06_Tests/Examples/Defensive).