tokenize/hash large amount of strings efficiently
=pod =encoding UTF-8 =head1 NAME Text::SpeedyFx - tokenize/hash large amount of strings efficiently =head1 VERSION version 0.014 =head1 SYNOPSIS use Data::Dumper; use Text::SpeedyFx; my $sfx = Text::SpeedyFx->new; my $words_bag = $sfx->hash('To be or not to be?'); print Dumper $words_bag; #$VAR1 = { # '1422534433' => '1', # '4120516737' => '2', # '1439817409' => '2', # '3087870273' => '1' # }; my $feature_vector = $sfx->hash_fv("thats the question", 8); print unpack('b*', $feature_vector); # 01001000 =head1 DESCRIPTION XS implementation of a very fast combined parser/hasher which works well on a variety of I<bag-of-word> problems. L<Original implementation|http://www.hpl.hp.com/techreports/2008/HPL-2008-91R1.pdf> is in Java and was adapted for a better Unicode compliance. =head1 METHODS =head2 new([$seed, $bits]) Initialize parser/hasher, can be customized with the options: =over 4 =item C<$seed> Hash seed (default: 1). =item C<$bits> How many bits do represent one character. The default value, 8, sacrifices Unicode handling but is fast and low on memory footprint. The value of 18 encompasses I<Basic Multilingual>, I<Supplementary Multilingual> and I<Supplementary Ideographic> planes. See also L</UNICODE SUPPORT> =back =head2 hash($octets) Parses C<$octets> and returns a hash reference (not exactly; see L</CAVEAT>) where keys are the hashed tokens and values are their respective count. C<$octets> are assumed to represent UTF-8 string unless L<Text::SpeedyFx> is instantiated with L</$bits> == 8 (which forces Latin-1 mode, see L</UNICODE SUPPORT>). Note that this is the slowest form due to the (computational) complexity of the L<associative array|https://en.wikipedia.org/wiki/Associative_array> data structure itself: C<hash_fv()>/C<hash_min()> variants are up to 260% faster! =head2 hash_fv($octets, $n) Parses C<$octets> and returns a feature vector (string of bits) with length C<$n>. C<$n> is supposed to be a multiplier of 8, as the length of the resulting feature vector is C<ceil($n / 8)>. See the included utilities C<cosine_cmp> and C<uniq_wc>. =head2 hash_min($octets) Parses C<$octets> and returns the hash with the lowest value. Useful in L<MinHash|http://en.wikipedia.org/wiki/MinHash> implementation. See also the included C<minhash_cmp> utility. =head1 UNICODE SUPPORT Due to the nature of Perl, Unicode support is handled differently from the original implementation. By default, L<Text::SpeedyFx> recognizes UTF-8 encoded code points in the range I<00000-2FFFF>: =over 4 =item * B<Plane 0>, the B<Basic Multilingual Plane> (BMP, I<0000–FFFF>) =item * B<Plane 1>, the B<Supplementary Multilingual Plane> (SMP, I<10000–1FFFF>) =item * B<Plane 2>, the B<Supplementary Ideographic Plane> (SIP, I<20000–2FFFF>) =item * There are planes up to 16; however, as in Perl v5.16.2, there are no code points matching C<isALNUM_utf8()> there (so it's irrelevant for proper algorithm operation). =back Although, there is a major drawback: in this mode, B<each instance> allocates up to 1 MB of memory. If the application doesn't need to support code points beyond the B<Plane 0> (like the original SpeedyFx implementation) it is possible to constraint the address space to 16 bits, which lowers memory allocation to up to 256 KB. In fact, L<Text::SpeedyFx> constructor accepts bit range between 8 and 18 to address code points. =head2 LATIN-1 SUPPORT 8 bit address space has one special meaning: it completely disables multibyte support. In 8 bit mode, each instance will only allocate 256 bytes and hashing will run up to 340% faster! Tokenization will fallback to I<ISO 8859-1 West European languages (Latin-1)> character definitions. =head1 BENCHMARK The test platform configuration: =over 4 =item * Intel® Xeon® E5620 CPU @ 2.40GHz (similar to the one cited in the reference paper); =item * Debian 6.0.6 (Squeeze, 64-bit); =item * Perl v5.16.1 (installed via L<perlbrew>); =item * F<enwik8> from the L<Large Text Compression Benchmark|https://cs.fit.edu/~mmahoney/compression/text.html>. =back Rate murmur_utf8 hash_utf8 hash_min_utf8 hash hash_fv hash_min murmur_utf8 6 MB/s -- -79% -86% -89% -97% -97% hash_utf8 30 MB/s 376% -- -35% -47% -84% -85% hash_min_utf8 47 MB/s 637% 55% -- -18% -76% -77% hash 58 MB/s 803% 90% 23% -- -70% -72% hash_fv 194 MB/s 2946% 541% 313% 237% -- -6% hash_min 206 MB/s 3143% 582% 340% 259% 6% -- All the tests except the ones with C<_utf8> suffix were made in L<Latin-1 mode|/UNICODE SUPPORT>. For comparison, C<murmur_utf8> was implemented using L<Digest::MurmurHash> hasher and native regular expression tokenizer: ++$fv->{murmur_hash(lc $1)} while $data =~ /(\w+)/gx; See also the F<eg/benchmark.pl> script. =head1 CAVEAT For performance reasons, C<hash()> method returns a L<tied hash|perltie> which is an interface to L<nedtries|http://www.nedprod.com/programs/portable/nedtries/>. The interesting property of a L<trie data structure|https://en.wikipedia.org/wiki/Trie> is that the keys are "nearly sorted" (and the first key is guaranteed to be the lowest), so: # This: $fv = $sfx->hash($data); ($min) = each %$fv; # Is the same as this: ($min) = $sfx->hash_min($data); # (albeit the later being 2x faster) The downside is the magic involved, the C<delete> breaking the key order, and the memory usage. The hardcoded limit is 524288 unique keys per result, which consumes ~25MB of RAM on a 64-bit architecture. Exceeding this will C<croak> with the message I<"too many unique tokens in a single data chunk">. The only way to raise this limit is by recompilation of the XS module: perl Makefile.PL DEFINE=-DMAX_TRIE_SIZE=2097152 make make test make install =head1 REFERENCES =over 4 =item * L<Extremely Fast Text Feature Extraction for Classification and Indexing|http://www.hpl.hp.com/techreports/2008/HPL-2008-91R1.pdf> by L<George Forman|http://www.hpl.hp.com/personal/George_Forman/> and L<Evan Kirshenbaum|http://www.kirshenbaum.net/evan/index.htm> =item * L<MinHash — выявляем похожие множества|http://habrahabr.ru/post/115147/> =item * L<Фильтр Блума|http://habrahabr.ru/post/112069/> =item * C<cosine_cmp>, C<minhash_cmp> and C<uniq_wc> utilities from this distribution =back =head1 AUTHOR Stanislaw Pusep <stas@sysd.org> =head1 COPYRIGHT AND LICENSE This software is copyright (c) 2021 by Stanislaw Pusep. This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself. =head1 CONTRIBUTOR =for stopwords Sergey Romanov Sergey Romanov <sromanov-dev@yandex.ru> =cut