Intro

During my PhD studies, I’m playing with itemsets a lot. Itemsets are by the name sets consists of items and if we represent the items as integers (enums are also a possibility but integers are the lowest level representation possible and let’s stick with it). If we represent each item as an integer, itemsets become subsets of the whole items. In literature, people take a dataset which consists of a sequence of itemsets (transactions) and apply pattern mining to identify other itemsets to represent connection in the data. This might be the fastest introduction to the field but if you are familiar with basic set concepts, it shouldn’t be difficult to get the initial idea.

In literature, the datasets people work on can vary in terms of the number of items. However, I haven’t seen any cases where this number of items exceed 1000 though. The data I’m usually fiddling around can even have less than 300 items most of the time. This opens an interesting possible representation for the itemset.

When we call itemsets the name itemsets, semantically we mean Set<Integer>. But it usually represents them as an ordered Vec<Integer> most of the time. But since I just mentioned that the distinct item count can be low as u8 (8-bit unsigned integer, 0-255) level, we have another option which is BitVec.

Representation in BitVec

Some specific details can change from implementation to implementation but the core idea of the BitVec is the same. We can efficiently pack some information directly bit by bit.

Imagine the set of integers:

{ 0, 1, 3, 4}

To store this itemset as Vec<Integer> we need at least 4 bytes since we can go for the cheapest small Vec<u8>. However, if we know the maximum possible integer possible (the number of items) for the set can take, we can pack this even more efficiently. If that number is 8, we can represent the set as:

{ 1, 1, 0, 1, 1, 0, 0, 0 }

which is only 1 byte since each item is represented as a bit. This representation can be also called occurrence representation whereas the previous Vec<u8> representation was explicit to directly indicate the included numbers.

There are positives and negatives of using each representation in terms of operation cost on them or memory efficiency. Even from this small example, you can see that the memory can be more efficient BitVec.

Additionally, you can also realise that using a normal Vec representation leads to dynamic memory usage since the number of items can vary of each vec whereas the memory footprint of the BitVec is always the same. This can be a good or bad thing depending on how long your itemsets are.

Testing Vec<u8> vs BitVec on Memory and Disk Usage

To test these two data structures, I’ll generate some hefty number of itemsets. With Rust 1.44 I’ll be using the standard vec and the bit_vec crate.

I’ll limit myself to use 256 items and I’ll be using a dynamic number of items per set which is between the two numbers [min_size..max_size]. For this sample test, I’m gonna use rand crate without any distributions to generate random values between 0 to 255.

Vec version:

// [dependencies]
// rand = "0.7.3"
use rand::Rng;
fn generate_itemset() -> Vec<u8> {
    let mut rng = rand::thread_rng();
    let card: u8 = rng.gen_range(MIN_SIZE, MAX_SIZE);
    let mut v = Vec::with_capacity(card.into());
    for _ in 0..card {
        'find_i: loop {
            let i: u8 = rng.gen_range(0, 255);
            if !v.contains(&i) {
                v.push(i);
                break 'find_i;
            }
        }
    }
    v
}

BitVec version:

// [dependencies]
// rand = "0.7.3"
// bit-vec = { version = "0.6.2", features = ["serde"]}
use rand::Rng;
use bit_vec::BitVec;
fn generate_itemset_bv() -> BitVec {
    let mut rng = rand::thread_rng();
    let card: u8 = rng.gen_range(MIN_SIZE, MAX_SIZE);
    let mut bv = BitVec::from_elem(256, false);
    for _ in 0..card {
        'find_i: loop {
            let i: u8 = rng.gen_range(0, 255);
            if !bv.get(i.into()).unwrap() {
                bv.set(i.into(), true);
                break 'find_i;
            }
        }
    }
    bv
}

And the naivest HashSet version would be:

// [dependencies]
// rand = "0.7.3"
use rand::Rng;
use std::collections::HashSet;
fn generate_itemset_hashset() -> HashSet<u8> {
    let mut rng = rand::thread_rng();
    let card: u8 = rng.gen_range(MIN_SIZE, MAX_SIZE);
    let mut v = HashSet::with_capacity(card.into());
    for _ in 0..card {
        'find_i: loop {
            let i: u8 = rng.gen_range(0, 255);
            if !v.contains(&i) {
                v.insert(i);
                break 'find_i;
            }
        }
    }
    v
}

To test their behaviour I’ll be generating 1M of itemsets and will be dumping them in a Vec (no I won’t dump them in a BitVec):

// [dependencies]
// serde_json = "1.0.55"
// serde = "1.0.114"
use std::{fs::File, io::BufWriter, io::Write};
use std::env;
macro_rules! generate_itemsets {
    ($n:expr, $func:ident, $file_name:expr) => {
        let mut v = Vec::with_capacity($n);
        for _ in 0..$n {
            v.push($func());
        }
        let json_str = serde_json::to_string(&v).unwrap();
        write_to_file($file_name, json_str);
    };
}
fn main() {
    let n = 1000000;
    let args: Vec<String> = env::args().collect();
    assert_eq!(args.len(), 2);
    if args[1].contains("bitvec") {
        generate_itemsets!(n, generate_itemset_bv, "bitvec.json");
    } else if args[1].contains("vec") {
        generate_itemsets!(n, generate_itemset, "vec.json");
    } else if args[1].contains("hashset") {
        generate_itemsets!(n, generate_itemset_hashset, "hashset.json");
    }
}
pub fn write_to_file(filepath: &str, content: String) {
    let f = File::create(filepath).expect("Unable to create file");
    let mut buffered_writer = BufWriter::new(f);
    buffered_writer
        .write_all(content.as_bytes())
        .expect("Unable to write data");
}

To not write the same code again, declarative macros are used to generate some code here. Macros are safe in Rust (as expected) that they can never generate invalid code. The reason for this that Rust compiler expands the macros after constructing the program’s AST and the compiler checks the macro afterwards to expand. To do so, it starts from its usage to understand the prior knowledge of the program to understand what to expect and rejects invalid code.

I won’t be going too much into detail about this now but, as you can see, $n is an expression which takes its value from the variable n and it is used to determine how many times we should call the $func (which is an identifier). In the end, we can serialize the vec as json to the determined file ($file_name). Since this macro doesn’t return anything, it is okay to expand it to a sequence of statements without additional curly braces and not make it an expression.

On Sparse Itemsets

const MIN_SIZE: u8 = 1;
const MAX_SIZE: u8 = 10;

If we have sparse itemsets in average, using BitVec is not expected to be more efficient than Vec<u8>.

I have run each of the results a couple (>3) times and it is roughly the same. The runtimes are quite small so comparing them here doesn’t make so much sense.

Vec<u8> results:

User time (seconds): 0.21
System time (seconds): 0.04
Percent of CPU this job got: 67%
Maximum resident set size (kbytes): 77012
Resulting file size: 19 MB

BitVec results:

User time (seconds): 0.38
System time (seconds): 0.05
Percent of CPU this job got: 60%
Maximum resident set size (kbytes): 138768
Resulting file size: 58 MB

HashSet results:

User time (seconds): 0.39
System time (seconds): 0.09
Percent of CPU this job got: 78%
Maximum resident set size (kbytes): 121408
Resulting file size: 19 MB

On Dense Itemsets

const MIN_SIZE: u8 = 245;
const MAX_SIZE: u8 = 255;

We went a bit extreme but it will be nice to see how much can we gain from BitVec.

Vec results:

User time (seconds): 46.92
System time (seconds): 1.00
Percent of CPU this job got: 81%
Maximum resident set size (kbytes): 1156540
Result file size: 850 MB

BitVec results:

User time (seconds): 7.15
System time (seconds): 0.21
Percent of CPU this job got: 93%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:07.89
Maximum resident set size (kbytes): 192776
Result file size: 108 MB

HashSet results (ohh boy):

User time (seconds): 34.82
System time (seconds): 1.90
Percent of CPU this job got: 72%
Maximum resident set size (kbytes): 1960720
Result file size: 850 MB

Analysis

As we can see, the gain from BitVec is unmissable. We indicated the BitVec memory usage would be constant but there is some increase as we can see. This is due to having a string representation for the json dump. If we eliminated that part, we could have almost the same memory usage.

The filesize change on the bitvec json also changed in size. We can examine this by doing a pretty_print json version of both itemsets.

First of all, BitVec crate uses a Vec<u32> as the storage unit to contain the bits(each u32 captures 32 bits) and a usize variable to track the number of bits in use. The json dump of the BitVec is not a human-readable version of the itemset because of this.

A sparse itemset looks like this:

{
  "storage": [
    0,
    8,
    2,
    9437696,
    0,
    0,
    512,
    0
  ],
  "nbits": 256
}

A dense itemset looks like this:

{
  "storage": [
    4294965247,
    4294963199,
    4294967295,
    4294967279,
    4294950911,
    4294967295,
    4294967039,
    2013247999
  ],
  "nbits": 256
}

As we can see the storage consists 8 u32 and the string representation of these integers in dense itemset is bigger due to having more characters.

Final words

Looking at memory and disk usage, BitVec seems like a better candidate to use to represent sets with a restricted number of items. Additionally, if the sets get denser and denser, the cost of using Vec<u8> (not even mentioning HashSet<u8>) gets higher while BitVec stays almost constant.

I haven’t talked about it here but, the problem of using BitVec comes at doing computations on them. It’s much easier to work at integers and using indexed bits to represent items has its own operational cost on them. In the end, in my case, using BitVec’s and Vec<u8> were giving almost par performance where Vec’s are a bit better. For some cases, where I had more than 255 items flexible BitVec performed better than Vec<u16> though.