's bl.aagh

BSD, Ruby, Rust, Rambling

gcstool / ruby-gcs

Efficient set membership with Golomb Compressed Sets

[rust] [ruby]

Similar to Bloom Filters, Golomb Compressed sets allow for space-efficient probablistic storage of sets. In other words, you can ask a GCS if it’s seen an object, and retrieve either “absolutely not” or “probably not” in response.

gcstool was my first Rust project, developed primarily to play about with the pwned-passwords-2.0.txt database. It can store all half a billion items with a false-positive rate of 1-in-50 million in just 1.6GB, importing them in just a few minutes, though with fairly high memory requirements.

% gcstool -H hex create -p 50000000 pwned-passwords-2.0.txt pwned-passwords-2.0-p50m.gcs
Estimated memory use for 501452315 items: 3825 MB.
^C now and get a better computer if memory constrained
Hashing: 25072615 of 501452315, 5.0%, 2785847/sec
Hashing: 501452300 of 501452315, 100.0%, 2639223/sec
Hashing complete in 190.42s
Normalise complete in 0.32s
Sort complete in 8.44s
Deduplicate complete in 1.19s
Encode: 25072615 of 501452315, 5.0%, 122306/sec
Encode: 501452300 of 501452315, 100.0%, 1649514/sec
Encode complete in 103.96s
Index complete in 0.04s
Complete in 304.65s
% du -h pwned-passwords-2.0-p50m.gcs
1.6G    pwned-passwords-2.0-p50m.gcs
% gcstool query pwned-passwords-2.0-p50m.gcs
Ready for queries on 501636842 items with a 1 in 50000000 false-positive rate.  ^D to exit.
> password
Found in 0.3ms
> not a leaked password
Not found in 1.7ms
> I love dogs
Found in 0.7ms
> I guess it works
Not found in 0.1ms

ruby-gcs is a reimplementation in Ruby, complete with construction and query capabilities. Somewhere in my to-do list is to make a plugin for the Rodauth authentication system using it for leaked password detection.

Two other Rust projects were spawned from this: linereader, a fast zero-copy line-oriented reader, and bitrw, a bitwise IO library.