Writing the most reliable driver ever (Part 1)

DISCLAIMER: This article is as an entertainment piece NOT expert advice. See full disclaimer at the bottom of this article.

I’m an avid diver and Engineer and love tech. So in my spare time I’ve been slowly cracking away at building my own dive computer. For those who aren’t familiar, a dive computer keeps track of how long you’ve been at a particular depth. It then does some fancy calculations to determine if the dissolved nitrogen in your blood is going to boil as your resurface. If it’s not obvious, having bubbles of nitrogen gas flowing through your arteries is pretty bad. At best these bubbles in your blood will make you very sick, at worst very dead. No diver should rely 100% on a dive computer and should be capable of safely resurfacing if their dive computer fails. But it is nice to have a computer keeping track of your dive profile so that you don’t have to do a bunch of calculations underwater. In fact it is pretty common, and usually encouraged to carry two independent dive computers. For a few reasons;

  • If one fails you can use the other while you resurface.
  • You can compare the calculations/depth between the two computers. If they disagree you know that one of them has malfunctioned and you need to resurface.
  • You can use dive computers with different algorithms, and choose the most conservative algorithm from the two computers.

Usually I dive with my Suunto D5 computer/watch and then rent a secondary computer with the rest of my scuba gear. The Suunto D5 is a fantastic watch for recreational use, and has some use in more technical mixed gas diving as well. While the computer that I have is perfectly suitable for diving, I am just naturally curious about what goes into building one. So the dive computer that I am building will become my secondary/backup if my Suunto fails.

So given that I love diving, and death usually prevents people from diving, I a want to be confident that using my dive computer isn’t going to kill me. This is why I’ve chosen the rust programming language to do all those fancy calculations.

Rust is a fantastic low level language that prevents a whole host of common programming bugs. As Rust doesn’t need a runtime, it’s also suitable for writing bare-metal embedded kernels. But let’s be clear, this language doesn’t eliminate all bugs. So I’m going to go through the process I went through to eliminate as many bugs as I possibly could. Using one of the sensors in the dive computer as an example.

One of the integral components of a dive computer is a pressure sensor. In this case I’ve chosen the MS5837 pressure/temperature sensor from Texas Instruments. This is a tiny little surface mount component, that comes with a small groove to fit an O-ring to seal against water ingress. It’s also rated to a depth of 300m which makes it suitable for even the most adventurous scuba expeditions.

The MS5837 is a relatively simple I2C device with 11 commands and 8 registers. The workflow for using this pressure sensor looks something like this;

flowchart TB A[Uninitialised]-->|Reset| B[Initialised] B-->|Get calibration data| C[Ready] C-->|Get raw temperature/pressure| D[Raw Samples] D-->|Calculate temperature| E[Real temperature] E-->|Calculate temperature compensated pressure| F[Real pressure and temperature] F-->C

So let’s go through how I developed this driver, and how I hardened the driver so that I could be sure it wasn’t going to crash.

Developing the driver

The first step that anyone should take when attempting to write code that needs to be reliable is to make sure that your code is tested. I like to do this from the beginning following a process called Test Driven Development (TDD). This process involves;

  1. Writing a test to ensure that your code is going to function as required.
  2. Writing minimal code to make your code compiles.
  3. Updating your code to ensure that the test passes.
  4. Cleaning up your code to ensure that your code is understandable.
  5. Rinse repeat.

This process encourages writing modular reusable code that is tested from the beginning. So let’s go through that process here.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
mod tests {
    use super::*;
    use embedded_hal_mock::i2c::{Mock as I2cMock, Transaction as I2cTransaction};

    #[test]
    fn reset() {
        let i2c = I2cMock::new(&[I2cTransaction::write_read(0x76, vec![0x1E], vec![])]);
        let mut ms5837 = new(i2c);
        ms5837.reset().unwrap();
        let mut i2c = ms5837.release();
        // Finalise expectations
        i2c.done();
    }
}

So with this initial test we are test we are ensuring that the reset command works as expected. The mocked i2c handle asserts that the reset command 0x1E is sent to i2c address of the pressure sensor 0x76. Of course running this test will fail, as we haven’t yet implemented the driver. So let’s go ahead and do that.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
use embedded_hal::blocking::i2c::WriteRead;

enum Error<I2C_ERR> {
  I2cError(I2C_ERR),
  // We'll add more errors here later...
}

const I2C_ADDRESS: u8 = 0x76;

struct Uninitialised<I2C: WriteRead>{
    i2c: I2C,
}

fn new<I2C: WriteRead>(i2c: I2C) -> Uninitialised<I2C> {
  Uninitialised{i2c}
}

enum Command {
  Reset,
  // We'll be adding the reset of the commands later.
}

impl From<Command> for u8 {
  fn from(val: Command) -> Self {
    use Command::*;
    match val {
      Reset => 0x1E,
    }
  }
}

impl<I2C: I2cMarker> Uninitialised<I2C> {
    /// Reset the ms5837 internal state machine.
    fn reset(&mut self) -> Result<(), Error<I2C::Error>> {
        self.i2c
            .write_read(I2C_ADDRESS, &[Command::Reset.into()], &mut [])
            .map_err(Error::I2cError)
    }

    fn release(self) -> I2C {
      self.i2c
    }
}

Running cargo test should now complete successfully. Sweet so now our driver is capable of resetting the pressure sensor.

Testing and coverage

So first thing is first we need to evaluate how well our unit-tests did at exploring all the paths through our code. For this, I like to use a tool called tarpaulin. You can install this tool using the command;

cargo install cargo-tarpaulin

Tarpaulin is a pretty simple tool that is quite similar to the builtin cargo test. The major difference is that running cargo tarpaulin -v will build your tests instrumenting the code using LLVM’s coverage sanitizer. It then runs the test and interprets the coverage data. Giving you a nice summary of how much of your code is run during your tests as well as a list of every line that wasn’t run during testing. The output might look something like this;

NOTE: This is the output from the testing the driver as of the 1/07/2022.

1
2
3
4
5
6
7
8
Jul 01 17:13:32.755  INFO cargo_tarpaulin::report: Coverage Results:
|| Uncovered Lines:
|| src/lib.rs: 292, 302, 306, 321-323, 349, 380-382, 385-387, 390-392, 416-417, 428, 430, 432, 435, 440, 452, 454, 456, 459, 464, 469-472, 474-477, 482-484, 486-487, 489-491, 493-506, 509-510, 547-548, 550, 583-584
|| Tested/Total Lines:
|| src/crc4.c: 15/15 +0.00%
|| src/lib.rs: 118/183 +0.00%
|| 
67.17% coverage, 133/198 lines covered, +0% change in coverage

So we are sitting at about 67% code coverage for our driver at the moment. So what should we be aiming for here?

This depends on what you are going to use your code for. In a lot of cases aiming for 100% code coverage ends up being an anti-pattern as your unit-tests for setters/getters ends up being tightly coupled to your code making things obscenely difficult to change down the line. Plus your tests for setters/getters aren’t doing anything useful. It’s important to remember that a % code coverage is a fundamentally reductive metric. Code coverage is better considered by the context around that specific code section and with line by line coverage. In my opinion around 80-90% code coverage ends up being the sweat spot for most use cases.

However, this is not most use-cases, we are trying to prevent my arteries from turning into a bubble bath, so we are writing the most reliable driver we possibly can. We are going to just deal with the tight coupling that comes with 100% code coverage, and fix up that code coverage.

At this point you might be thinking, what code hasn’t been tested? The answer is usually something to do with error handling. If you don’t mock out an error how could you possibly know how your code is going to behave if there is an error. In fact there is some code in your original example that wasn’t captured in our unit test so let’s go ahead and fix that.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
mod tests {
    use super::*;
    use embedded_hal_mock::{
        i2c::{Mock as I2cMock, Transaction as I2cTransaction},
        MockError,
    };
    use std::io::ErrorKind;

    #[test]
    fn reset() {
        //...
    }

    #[test]
    fn reset_err() {
        let i2c = I2cMock::new(
            &[I2cTransaction::write_read(I2C_ADDRESS, vec![0x1E], vec![])
            // NEW: Tell the mocked implementation to return an error.
                .with_error(MockError::Io(ErrorKind::Other))],
        );
        let mut ms5837 = new(i2c);
        // Expect an i2c error to propogate back.
        let _ = ms5837.reset().unwrap_err();
        let mut i2c = ms5837.release();
        // Finalise expectations
        i2c.done();
    }
}

At this point running tarpaulin will result in 100% code coverage. Now we just have to write the rest of the driver! Repeating this process.

Quick tip: Take a look at cargo-watch it’s a massive time saver. To install run cargo install cargo-watch. This tool will keep track of when the files in your project change and the rerun the specified commands. So I typically use a command like this.

cargo watch --clear -x check -x build -x 'tarpaulin -v'

I prefer chaining multiple stages like this in terms of the approximate amount of time each step takes. e.g. check is a lot faster that build/coverage analysis.

This tool gives me immediate feedback as to the general quality/soundness of my code.

Something unexpected!!

So while I’m not going to go into the nitty-gritty implementation details of how I’m developing this driver I’m going to share some of the more surprising problems that I ran into when developing in rust.

Numerical calculations can panic 😲

So about a week ago I was trucking away at writing some code that does calculations to convert the raw temperature and pressure into absolute units. This has to be done at every pressure measurement as the pressure needs to be compensated for the pressure. To my surprise while testing this functionality, my test crashed entirely. Not a regular test failure, a full blown panic with a backtrace. The code looks something like this;

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
    fn normalise_raw_measurement(&self, temperature: u32, pressure: u32) -> TemperaturePressure {
        let d_temperature =
            temperature - (self.calibration_data.reference_temperature as i64) * 2i32.pow(8);

        let temperature = 2000
            + d_temperature * (self.calibration_data.temperature_coefficient_of_temperature as i32)
                / 2i32.pow(23);

        let temperature_offset = (self.calibration_data.pressure_offset as i32) * 2i32.pow(16)
            + (self
                .calibration_data
                .temperature_coefficient_of_pressure_offset as i32)
                * d_temperature
                / 2i32.pow(7);
        let temperature_sensitivity = (self.calibration_data.pressure_sensitivity as i32)
            * 2i32.pow(15)
            + (self
                .calibration_data
                .temperature_coefficient_of_pressure_sensitivity as i32)
                * d_temperature 
                / 2i32.pow(8);
        let pressure = (pressure as i32) * temperature_sensitivity - temperature_offset;

        TemperaturePressure {
            pressure: (pressure as f32) / 10.0,
            temperature: (temperature as f32) / 10.0,
        }
    }

NOTE: I am part way through refactoring the driver to use only integer math as the microcontroller I’m using doesn’t have an FPU and soft-floating point ops are needlessly expensive. That’s why I’m mixing float/integer math.

Do you know why this code would panic? I certainly didn’t. It took me some Googling to find out that when rust is compiled in ‘Debug’ mode integer calculations are checked for overflowing and divide by zero. This is great as it meant that I caught a potential bug before I was 20-30m below 🤿.

To be very clear, a scenario where an integer overflows and shows that I am at a depth of 0m or -67000m, when I’m at 25m is pretty unlikely. But in the implementation above it was possible. The consequences of this could be pretty bad. In other words your dive computer could continue working as if everything was fine. But everything won’t be fine as the computer will be miscalculating your decompression time, potentially leading to bubbly-blood.

Fuzz testing

Inspired to find and fix all the instances where integer math bugs could occur I started looking into ‘cargo-fuzz’. For those not familiar with fuzz testing here is a short excerpt from Wikipedia;

In programming and software development, fuzzing or fuzz testing is an automated software testing technique that involves providing invalid, unexpected, or random data as inputs to a computer program.

Cargo fuzz uses libfuzzer under the hood, and is a coverage guided fuzzer. This differs slightly compared to a regular fuzzer. A coverage guided fuzzer will attempt to generate random data and will use the realtime coverage information to keep/mutate the previous input to maximize the code-coverage. This makes finding bugs significantly faster as the fuzzer will “learn” how to produce inputs that will execute new branches in the program flow.

It’s important to realize here that, coverage guided fuzzing usually won’t exhaustively test every possible code path, with every possible input. It is heuristic guided approach to testing. Writing a fuzz test won’t prove that your code won’t ever crash.

If you aren’t familiar with cargo-fuzz, I recommend reading the book. At least a cursory understanding of cargo-fuzz will help you understand the next section.

So let’s go ahead and write a fuzz test for conversion code. To start off we need to initialise our repository for fuzz testing;

cargo fuzz init

This will create a “fuzz” directly alongside your library that allows you to test your public interface. Now if you look carefully at the code above you’ll see a problem with this already. At the time of writing you can’t easily fuzz private APIs, so we can’t directly fuzz the conversion code without making it public.

So how can we get the fuzzing engine to test the conversion code? Well we point the fuzzing engine at the public API, and then check that we are getting test coverage through that section e.g. using cargo fuzz coverage <target>.

So let’s go ahead and do that. In our case, the only pathway to get fuzzed inputs into the conversion function is via I2C. So let’s create a fake/fuzzed I2C driver that simply passes through the fuzzed input.

Most rust hardware drivers (including this one) make use of the traits defined in the embedded-hal crate. The I2C trait used in this driver looks something like this;

1
2
3
4
5
6
7
8
9
pub trait WriteRead<A: AddressMode = SevenBitAddress> {
    type Error;
    fn write_read(
        &mut self, 
        address: A, 
        bytes: &[u8], 
        buffer: &mut [u8]
    ) -> Result<(), Self::Error>;
}

So let’s go ahead and create an implement this trait. First let’s create a type containing the fuzzed data;

1
2
3
4
5
6
7
8
9
struct I2cFuzz<'a> {
  fuzzy_data: std::slice::Iter<'a, u8>,
}

impl<'a> I2cFuzz<'a> {
  fn new<'a>(fuzzy_data: &'a [u8]) {
    FuzzedI2c{fuzzy_data: fuzzy_data.iter()}
  }
}

The next step is to implement the WriteRead I2c trait for this type;

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
impl<'a, A: AddressMode> WriteRead<A> for I2cFuzz<'a, E> {
  // We only care if there is an error or not, so we will use the unit type
  // for thi implementation.
  type Error = ();
  fn write_read(
        &mut self,
        // We ignore the address as we only care about inputs to the driver.
        address: A,
        // We ignore the write buffer as we only care about inputs into the driver.
        _ignore_write_buffer: &[u8],
        read_buffer: &mut [u8],
    ) -> Result<(), Self::Error> {
        // Return an error ~50% of the time, based on the fuzzed data.
        if *self.fuzzy_data.next().ok_or(())? > u8::MAX / 2 {
          return Err(());
        }
        // Fuzzed data is not infinite in length, copy the fuzzed data into
        // the read buffer while there is still fuzzed data, otherwise return
        // an error.
        for element in read_buffer.iter_mut() {
          *element = *data.next().ok_or(())?;
        }
        Ok(())
    }
}

Great so now we have our fuzzed implementation of the i2c driver. Now we just need to write the fuzz test. So let’s add a new one;

cargo fuzz add read_temperature_and_pressure

This will add a new fuzzing target fuzz/fuzz_targets/read_temperature_and_pressure.rs. Out of the box this file will look something like this;

1
2
3
4
5
6
7
#[no_main]

use libfuzzer_sys::fuzz_target;

fuzz_target!(|data: &[u8]| {
  // ...
});

So let’s fill in this test template with some code to excersize our driver;

1
2
3
4
5
6
7
8
9
fuzz_target!(|data: &[u8]| {
  let mut i2c = I2cFuzz(data);
  let pressure_sensor = ms5837::new(i2c);
  if let Ok(mut pressure_sensor) = pressure_sensor.init() {
    // We ignore the result as it is likely garbage constructed from the fuzzed 
    // data. We don't care about the result/error just if it crashes or not.
    let _  = pressure_sensor.read_temperature_and_pressure();
  }
});

Finally we can run the fuzz test;

cargo fuzz run read_temperature_and_pressure

After a short period of time we get a crash due to an integer overflow reproducing our original crash.

Now examining the inputs that result in the crash, I can safely say that the pressure sensor should never output those values. However there shouldn’t be a case where my driver can randomly crash based on some arbitrary math logic. Instead a reliable driver would simply return an error e.g. something along the lines of ‘InvalidTemperature’.

So how do we fix this problem, and then how can we be 100% sure that the program can never crash. It turns out the simplest way to do this is to remove the panic/crash function. Doing so will result in a compiler error if there is any way that app code could crash.

No panics

So one stackoverflow question later and things are a lot clearer, the integer operations e.g. Add, Multiple, Divide etc. are actually implemented as traits. The default implementations for numerical types is different in ‘Debug’ vs ‘Release’ builds. With checks only being performed under the ‘Debug’ builds.

It turns out that there is a way to do ‘checked’ integer operations that will optionally return a value, but won’t panic if there isn’t a valid result.

There is even a no_panic crate which will produce a linker error if there is any possible way that your code could panic.

What’s next?

Tune in to part-2 of this series, where we are going to;

  • Reach 100% code coverage library wide.
  • Refactor our library to have absolutely zero code that could ever panic.

Disclaimer

NOTE: This disclaimer is a modified version of the MIT software licence.

THE ‘INFORMATION IN THIS ARTICLE’ IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE ‘INFORMATION IN THIS ARTICLE’ OR THE USE OR OTHER DEALINGS IN THE ‘INFORMATION IN THIS ARTICLE’.

Built with Hugo
Theme Stack designed by Jimmy