Intro

I store my research experiments in a JSON file and It is started to become a bit large to become clunky to play around. I am kind of considering to move everything to a simple database solution like SQLite. However, that didn’t stop me to investigate for the efficient ways of parsing JSON files. I’m currently using python scripts to collect the data and update the JSON structure. So, I’m curious to see if a Rust solution can make a significant difference in this.

Dataset

Since my research data is complex in its structure and also a bit messy, I decided to make a small demonstration using a similar size dataset. So I have found an old San Fransisco City Lots dataset on github which is around 200 MB. It consists of 2.5 millions rows of property features (I guess) and it seems worthy to try it out.

Each entry on the dataset has fields like type, properties and geometry. The geometry field has a type field and it is either Polygon or MultiPoligon. Depending on the geometry type, geometry has a coordinates list which consists of the polygon or the multiple polygons.

Some of the data is not complete and some property fields and the geometry can be missing from each feature.

To simplify this simple parsing experiment, I cleaned the data with any missing fields and with MultiPoligon geometry to make all of the data some shape everywhere no matter what. This reduced the total JSON file size to 150 MB.

So, the general structure of the JSON looks like this:

{
    "type": "FeatureCollection",
    "features": [
        {
            "type": "Feature",
            "properties": {
                "MAPBLKLOT": "0001001",
                ...
            },
            "geometry": {
                "type": "Polygon",
                "coordinates": [
                    [
                        [
                            -122.422003528252475,
                            37.808480096967251,
                            0.0
                        ],
                        ...
                    ]
                ]
            }
        },
        ...
    ]
}

To give parsing a meaning, I decided to create a dummy task which is to calculate the mean of all the coordinates of each three coordinates.

Parsing in Python

Python code is pretty straight-forward and self explanatory:

import json
with open("citylots_cleaned.json", "r") as f:
    v = json.load(f)
sum_x = 0
sum_y = 0
sum_z = 0
total = 0
for i in v["features"]:
    for inner in i["geometry"]["coordinates"]:
        for coords in inner:
            sum_x += float(coords[0])
            sum_y += float(coords[1])
            sum_z += float(coords[2])
            total += 1
print("Number of rows: {} ".format(total))
print("Averages: {} {} {}".format(sum_x/total, sum_y/total, sum_z/total))

While iterating through the main features field, we can access each geometry coordinates and collect the coordinates into to hardcoded three variables. Afterwards, we can calculate the averages and print the results.

Number of rows: 2073944
Averages: -122.43660122610137 37.75514278478826 0.0

This scripts completed this task on my machine with Intel i7-6700HQ (8) @ 2.591GHz for 8 seconds using maximum ~900 MB RAM.

User time (seconds): 7.07
System time (seconds): 0.67
Percent of CPU this job got: 94%
Maximum resident set size (kbytes): 895940

Parsing in Rust

Before parsing the JSON structure, reading the data is not that easy as a python with statement, so I’ve written this read block which gives the contents of the file as a String.

use std::fs::File;
use std::io::BufReader;
use std::io::Read;
pub fn read_file(filepath: &str) -> String {
    let file = File::open(filepath).expect("could not open file");
    let mut buffered_reader = BufReader::new(file);
    let mut contents = String::new();
    buffered_reader
        .read_to_string(&mut contents)
        .expect("could not read file into the string");
    contents
}

Using serde_json, we have 2 options:

  1. Parsing as untyped data. After parsing the data into an untyped JSON Value, verify the integrity on runtime,
  2. Parsing as strongly typed data. Define the structure to be parsed into at compile time and try to parse the data on runtime. If the data doesn’t match the defined structure, abort.

Parsing as Untyped JSON

// [dependencies]
// serde_json = "1.0.53"
use serde_json::Value;
fn main() {
    let data = read_file("citylots_cleaned.json");
    let v: Value = serde_json::from_str(data.as_str()).unwrap();
    let mut total = 0;
    let mut sum_x = 0.0;
    let mut sum_y = 0.0;
    let mut sum_z = 0.0;
    for i in v["features"].as_array().unwrap() {
        for j in i["geometry"]["coordinates"].as_array().unwrap() {
            for k in j.as_array().unwrap() {
                sum_x += k[0].as_f64().unwrap();
                sum_y += k[1].as_f64().unwrap();
                sum_z += k[2].as_f64().unwrap();
                total += 1;
            }
        }
    }
    println!("Number of rows: {}", total);
    sum_x = sum_x / (total as f64);
    sum_y = sum_y / (total as f64);
    sum_z = sum_z / (total as f64);
    println!("Averages: {} {} {}", sum_x, sum_y, sum_z);
}

After reading the data, we parse the data into serde_json::Value. From here we can iterate through its fields. However, we require to compile-time checks to take the fields as arrays or as floats by using the appropriate keyword as_array and as_as_f64 and unwrap the optionals.

This took mighty cargo to compile this code on release in 16 seconds (Rust 1.44).

User time (seconds): 1.68
System time (seconds): 0.54
Percent of CPU this job got: 82%
Maximum resident set size (kbytes): 1070520

The operation time is reduced to 2 seconds but the memory usage is even higher than the python equivalent. This clearly shows the operating on untyped JSON in serde_json brings its costs as well as its benefits.

Parsing as Strongly Typed JSON

We can define the struct we are going to read into and also use serde derive macros as well to make everything smoother.

use serde::{Deserialize, Serialize};
#[derive(Serialize, Deserialize)]
struct Features {
    #[serde(rename = "type")]
    type_name : String,
    features: Vec<Lot>
}
#[derive(Serialize, Deserialize)]
struct Lot {
    #[serde(rename = "type")]
    type_name: String,
    properties: Box<Property>,
    geometry: Box<Geometry>
}
#[allow(non_snake_case)]
#[derive(Serialize, Deserialize)]
struct Property {
    MAPBLKLOT: String,
    ...
}
#[derive(Serialize, Deserialize)]
struct Geometry {
    #[serde(rename = "type")]
    type_name: String,
    coordinates: Vec<Vec<Vec<f64>>>
}

Note that, some Rust registered keywords (such as type) cannot be used as struct field identifieres so we should choose something else and tell serde to rename them.

// [dependencies]
// serde_json = "1.0.53"
// serde = { version = "1.0.111", features = ["derive"] }
fn main() {
    let data = read_file("citylots_cleaned.json");
    let v: Features = serde_json::from_str(data.as_str()).unwrap();
    let mut total = 0;
    let mut sum_x = 0.0;
    let mut sum_y = 0.0;
    let mut sum_z = 0.0;
    for i in &v.features {
        for inner in &i.geometry.coordinates {
            for coords in inner {
                sum_x += coords[0];
                sum_y += coords[1];
                sum_z += coords[2];
                total += 1;
            }
        }
    }
    println!("Number of rows:{}", total);
    sum_x = sum_x / (total as f64);
    sum_y = sum_y / (total as f64);
    sum_z = sum_z / (total as f64);
    println!("Averages: {} {} {}", sum_x, sum_y, sum_z);
}

By deserializing directly into Features, we are giving the guarantee at compile time how much memory can the whole process can take and eliminate the integrity verifications.

Compiling the code serde with derive macros took 50 seconds.

User time (seconds): 0.92
System time (seconds): 0.21
Percent of CPU this job got: 70%
Maximum resident set size (kbytes): 460820

The result shows that the process got a bit faster and the memory requirement cut off quite a bit to ~450 MB.

Actually, I was being quite generous by specifying the inner coordinates as Vec<f64> but frankly, we can reduce that into a slice with a fixed size 3 (ie. [f64; 3]):

#[derive(Serialize, Deserialize)]
struct Geometry {
    #[serde(rename = "type")]
    type_name: String,
    coordinates: Vec<Vec<[f64; 3]>>
}

This improves the runtime and the memory efficiency furthermore.

User time (seconds): 0.65
System time (seconds): 0.21
Percent of CPU this job got: 64%
Maximum resident set size (kbytes): 362052

Final Words

Parsing and processing large JSON files in Rust is quite rewarding for runtime. Additionally, parsing into strongly typed data makes the whole process very memory efficient.

For this example project, I have simplified the heterogeneous data by eliminating the MultiPoligon geometries and also removed any null fields. Dealing with null fields is quite simple by using an optional. The heterogeneous parsing might be a bit more challenging in Rust but I’m sure it should be doable using enum without a problem.