smap

Please see the README below or on GitHub at https://github.com/wyager/smap


Keywords
library, program, text, Propose Tags , Smap.Commands, Smap.Flags, cabal, stack, process substitution
License
BSD-3-Clause
Install
cabal install smap

Documentation

smap - a command line tool for sets and maps

Installation:

To install from Hackage, run:

cabal install smap

To install from source, you can use that or download this repo and run

stack install smap

You will need cabal or stack if you don't already have one of them.

Tutorial:

The setup:

cat > patients << EOF
Bob Smith
Jane Doe
John Smith
Carol Carell
EOF

cat > has_cold << EOF
Jane Doe
John Smith
EOF

cat > has_mumps << EOF
Jane Doe
Carol Carell
EOF

Simple usage (sets)

cat - Set Union (and Deduplication)

Sick patients:

$ smap cat has_cold has_mumps
Jane Doe
John Smith
Carol Carell

You can also use - instead of a filename to represent stdin/stdout. (This works for any command.)

$ cat has_cold | smap cat - has_mumps
Jane Doe
John Smith
Carol Carell

If you don't provide any arguments, cat will assume you mean stdin.

$ cat has_cold has_mumps | smap cat
Jane Doe
John Smith
Carol Carell

sub - Set subtraction

Healthy patients:

$ smap sub patients has_cold has_mumps
Bob Smith

int - Set intersection

Patients with both a cold and mumps:

$ smap int has_cold has_mumps
Jane Doe

xor - Symmetric difference

Patients who only have a cold or mumps, but not both:

$ smap xor has_cold has_mumps
Carol Carell
John Smith

Advanced usage (maps)

When using smap with sets, the behavior is pretty straightforward. It gets a bit more complicated when dealing with maps.

If you provide smap with a filepath, it will construct a map where the keys equal the values. (This is equivalent to a set). If you pass in +file1,file2 as an argument, smap will construct a map using lines from file1 as keys and lines from file2 as values.

We can get a list of patient last names using cut -f 2 -d ' ' <patient file>

Pick one patient from each family:

$ smap cat +<(cut -f 2 -d ' ' patients),patients
Bob Smith
Jane Doe
Carol Carell

To understand the above:

  • <(cut -f 2 -d ' ' patients) gets a list of all the patients' last names and creates a virtual file with this list. See bash process substitution.
  • +<(cut -f 2 -d ' ' patients),patients constructs a stream where the keys are the last names and the values are the whole names.

cat deduplicates by key, so if we see a second (or third, or fourth, etc.) person from a given family we don't print them out.

Patients who have family members with a cold:

$ smap int +<(cut -f 2 -d ' ' patients),patients <(cut -f 2 -d ' ' has_cold)
Bob Smith
Jane Doe
John Smith

To understand the above:

  • <(cut -f 2 -d ' ' patients) gets a list of all the patients' last names.
  • +<(cut -f 2 -d ' ' patients),patients constructs a stream where the keys are the last names and the values are the whole names.
  • <(cut -f 2 -d ' ' has_cold) gets a list of family names of everyone who has a cold.

So int is filtering the first argument (treated as a key,value stream) by the keys present in the second argument.

Approximate mode

If you're processing lots of lines and running up against memory limits, you can use the --approximate option to keep track of a 64-bit hash of each line instead of the entire line.