Intro

This is a follow up on my previous post on JSON parsing. Previously, I made a lot of cleaning on the data like disallowing missing or heterogeneous data to have an easier example to test on. But as the real-life data which you gather yourself or using in your experiments can be complex, it requires more flexibility in case of strongly typed parsing. That’s why I wanted to take the previous example as it is and make the system work on the way the data is collected.

Dataset

My previous post probably already have enough information on the structure of the dataset so I’m not going to repeat most of it here again.

One remark I can point out is that geometry can have MultiPoligon as a type which adds the coordinates one more dimension which can be seen below.

{
    ...
    "geometry": {
        "type": "Polygon",
        "coordinates": [
            [
                [
                    -122.422003528252475,
                    37.808480096967251,
                    0.0
                ],
                ...
            ]
    }
},
{
    "geometry": {
        "type": "MultiPolygon",
        "coordinates": [
            [
                [
                    [
                        -122.469813344635554,
                        37.786749747602492,
                        0.0
                    ],
                    ...
                ]
            ]
    }
}

In addition to that, some property fields are either missing or simple null.

Parsing with Possible Missing Fields

Serde has a good website where the documentation and the examples lay out the overview of the crate very nicely. For instance, if we look at here, we can directly see that #[serde(default)] can be used on fields which are possible missing that would make the parsing less of a pain for them by defining a default value. In the case of the fields is still present and the value is just null, we can use Rust internal type mechanism of Option<T> to have None values for null values while Some(T) values can be directly parsed from the T type.

Parsing Heterogeneous Types

This can be done Rust enum types which can hold a type of multiple possible types. For instance, Option<T> and Result<T, E> in Rust’s standard library is also enums which makes it easier to identify nullable fields or error handling without using an explicit mechanism.

For parsing heterogeneous data, there are multiple options to choose from. Since the type the data structure is different for the exact given field (coordinates) of the geometry data we can:

  1. Parse the coordinates as an untagged enum of two different types,

Or since the type information is supplied geometry where it defines and dictates what to expect for the coordinates we can:

  1. Parse the geometry as an adjacently tagged enum.

The examples of both representations can be found on serde’s documentation.

Untagged Enum Parsing Method

Untagged enums are useful in cases where the data comes in different formats without any sort of useful information to identify which type is currently coming.

Our geometry struct and the untagged enum of coordinates would look like this:

#[derive(Serialize, Deserialize)]
struct Geometry {
    #[serde(rename = "type")]
    type_: String,
    coordinates: Coords,
}

#[derive(Serialize, Deserialize)]
#[serde(untagged)]
enum Coords {
    Polygon(Vec<Vec<[f64; 3]>>),
    MultiPolygon(Vec<Vec<Vec<[f64; 3]>>>),
}

This parsing system dictates that there are two types of information can be expected from the coordinates field with untagged inner structure.

Let’s incorporate the same benchmarking system to calculate all average coordinates as the previous post. We can achieve this by applying a match clause on the Coords enum:

// [dependencies]
// serde_json = "1.0.53"
// serde = { version = "1.0.111", features = ["derive"] }
fn main() {
    let data = read_file("citylots.json");
    let v: Features = serde_json::from_str(data.as_str()).unwrap();
    let mut total = 0;
    let mut sum_x = 0.0;
    let mut sum_y = 0.0;
    let mut sum_z = 0.0;
    for i in v.features {
        if let Some(geo) = i.geometry {
            match geo.coordinates {
                Coords::Polygon(ref x) => {
                    for inner in x {
                        for coords in inner {
                            sum_x += coords[0];
                            sum_y += coords[1];
                            sum_z += coords[2];
                            total += 1;
                        }
                    }
                }
                Coords::MultiPolygon(ref x) => {
                    for poly in x {
                        for inner in poly {
                            for coords in inner {
                                sum_x += coords[0];
                                sum_y += coords[1];
                                sum_z += coords[2];
                                total += 1;
                            }
                        }
                    }
                }
            }
        }
    }
    println!("Number of rows: {}", total);
    sum_x = sum_x / (total as f64);
    sum_y = sum_y / (total as f64);
    sum_z = sum_z / (total as f64);
    println!("Averages: {} {} {}", sum_x, sum_y, sum_z);
}

The result is below:

Number of rows: 2625063
Averages: -122.42988623936554 37.75451006828541 0

Notice that, the number of rows we are now able parse is increased by 600K compared to the cleaned version of the data.

The runtime stats:

User time (seconds): 1.62
System time (seconds): 0.29
Percent of CPU this job got: 75%
Maximum resident set size (kbytes): 399096

Adjacently Tagged Enum Parsing Method

Using untagged enums might not be the best way to handle in the current shape of this data though since, we are not benefiting from the provided type information. We can make the geometry structure to an enum itself to achieve this:

#[derive(Serialize, Deserialize)]
#[serde(tag = "type", content = "coordinates")]
enum Geometry {
    Polygon(Vec<Vec<[f64; 3]>>),
    MultiPolygon(Vec<Vec<Vec<[f64; 3]>>>)
}

By making geometry itself an enum type, we can parse the inner structure next to the type information to the coordinates. By doing so, we can maybe benefit from the information available on the type field and help the parsing to be a tiny bit more efficient.

The match clause captures geometry instead of the coordinates this time:

// [dependencies]
// serde_json = "1.0.53"
// serde = { version = "1.0.111", features = ["derive"] }
fn main() {
    let data = read_file("citylots.json");
    let v: Features = serde_json::from_str(data.as_str()).unwrap();
    let mut total = 0;
    let mut sum_x = 0.0;
    let mut sum_y = 0.0;
    let mut sum_z = 0.0;
    for i in v.features {
        if let Some(geo) = i.geometry {
            match *geo {
                Geometry::Polygon(ref x) => {
                    for inner in x {
                        for coords in inner {
                            sum_x += coords[0];
                            sum_y += coords[1];
                            sum_z += coords[2];
                            total += 1;
                        }
                    }
                }
                Geometry::MultiPolygon(ref x) => {
                    for poly in x {
                        for inner in poly {
                            for coords in inner {
                                sum_x += coords[0];
                                sum_y += coords[1];
                                sum_z += coords[2];
                                total += 1;
                            }
                        }
                    }
                }
            }
        }
    }
    println!("Number of rows: {}", total);
    sum_x = sum_x / (total as f64);
    sum_y = sum_y / (total as f64);
    sum_z = sum_z / (total as f64);
    println!("Averages: {} {} {}", sum_x, sum_y, sum_z);
}

The results show this performance improvement even though it costs a bit more memory:

User time (seconds): 0.81
System time (seconds): 0.19
Percent of CPU this job got: 64%
Maximum resident set size (kbytes): 408524

Final words

Parsing uneven data in a strongly-typed manner is possible in Rust by using enum structures and serde’s amazing flexibility to be able to annotate inner structures to derive specific properties. A heterogeneous structure can be recognised with untagged enums in case any information is missing or simply it’s just easier to handle that way. However, using the available information from the data wherever it is possible is always useful and in this case as well, incorporating the type information by adjacent tagging results in a bit performance gain.