hur.st's bl.aagh

BSD, Ruby, Rust, Rambling

gcstool / ruby-gcs

Efficient set membership with Golomb Compressed Sets

[rust] [ruby]

Similar to Bloom Filters, Golomb Compressed sets allow for space-efficient probablistic storage of sets. In other words, you can ask a GCS if it’s seen an object, and retrieve either “absolutely not” or “probably not” in response.

gcstool was my first Rust project, developed primarily to play about with the haveibeenpwned.com pwned-passwords-2.0.txt database. It can store all half a billion items with a false-positive rate of 1-in-50 million in just 1.6GB, importing them in just a few minutes, though with fairly high memory requirements.

% gcstool -H hex create -p 50000000 pwned-passwords-2.0.txt pwned-passwords-2.0-p50m.gcs
Estimated memory use for 501452315 items: 3825 MB.
^C now and get a better computer if memory constrained
Hashing: 25072615 of 501452315, 5.0%, 2785847/sec
...
Hashing: 501452300 of 501452315, 100.0%, 2639223/sec
Hashing complete in 190.42s
Normalise complete in 0.32s
Sort complete in 8.44s
Deduplicate complete in 1.19s
Encode: 25072615 of 501452315, 5.0%, 122306/sec
...
Encode: 501452300 of 501452315, 100.0%, 1649514/sec
Encode complete in 103.96s
Index complete in 0.04s
Complete in 304.65s
% du -h pwned-passwords-2.0-p50m.gcs
1.6G    pwned-passwords-2.0-p50m.gcs
% gcstool query pwned-passwords-2.0-p50m.gcs
Ready for queries on 501636842 items with a 1 in 50000000 false-positive rate.  ^D to exit.
> password
Found in 0.3ms
> not a leaked password
Not found in 1.7ms
> I love dogs
Found in 0.7ms
> I guess it works
Not found in 0.1ms

ruby-gcs is a reimplementation in Ruby, complete with construction and query capabilities. Somewhere in my to-do list is to make a plugin for the Rodauth authentication system using it for leaked password detection.

Two other Rust projects were spawned from this: linereader, a fast zero-copy line-oriented reader, and bitrw, a bitwise IO library.