Similar to Bloom Filters, Golomb Compressed sets allow for space-efficient probablistic storage of sets. In other words, you can ask a GCS if it’s seen an object, and retrieve either “absolutely not” or “probably not” in response.
gcstool was my first Rust project, developed primarily to play about with the haveibeenpwned.com pwned-passwords-2.0.txt database. It can store all half a billion items with a false-positive rate of 1-in-50 million in just 1.6GB, importing them in just a few minutes, though with fairly high memory requirements.
% gcstool -H hex create -p 50000000 pwned-passwords-2.0.txt pwned-passwords-2.0-p50m.gcs Estimated memory use for 501452315 items: 3825 MB. ^C now and get a better computer if memory constrained Hashing: 25072615 of 501452315, 5.0%, 2785847/sec ... Hashing: 501452300 of 501452315, 100.0%, 2639223/sec Hashing complete in 190.42s Normalise complete in 0.32s Sort complete in 8.44s Deduplicate complete in 1.19s Encode: 25072615 of 501452315, 5.0%, 122306/sec ... Encode: 501452300 of 501452315, 100.0%, 1649514/sec Encode complete in 103.96s Index complete in 0.04s Complete in 304.65s % du -h pwned-passwords-2.0-p50m.gcs 1.6G pwned-passwords-2.0-p50m.gcs % gcstool query pwned-passwords-2.0-p50m.gcs Ready for queries on 501636842 items with a 1 in 50000000 false-positive rate. ^D to exit. > password Found in 0.3ms > not a leaked password Not found in 1.7ms > I love dogs Found in 0.7ms > I guess it works Not found in 0.1ms
ruby-gcs is a reimplementation in Ruby, complete with construction and query capabilities. Somewhere in my to-do list is to make a plugin for the Rodauth authentication system using it for leaked password detection.
Two other Rust projects were spawned from this:
linereader, a fast zero-copy
line-oriented reader, and
bitrw, a bitwise IO library.