csv_fish

CSV categorical data analyzer. Generates 2x2 contingency tables according to specified row/column group conditions, and applies Fisher's exact test.


License
GPL-3.0+

Documentation

CSV Fish

CSV categorical data analyzer. Generates 2x2 contingency tables according to specified row/column group conditions, and applies Fisher's exact test.

Installation

If you don't want to build the program yourself (see instructions below for that), you can get it via the Cargo tool. First install Cargo and then run

cargo install csv_fish

Usage

To use the program, point to the data, group specifications, and desired output location:

csv-fish --data data.csv --groups groups.csv --output output.csv

See the example files for inspiration about the content of the files, and read the below input/output specification before using the program.

The Groups CSV file contains SQL queries executed in SQLite. Do not run the program on input files you cannot trust.

Inputs

  • All CSV inputs are expected to be separated by semicolons.

  • Data CSV - values of categorical variables for individual samples.

    • First row is a header - names of categorical variables in the data set. Each subsequent row represents one sample.

    • Each column represents a categorical variable.

    • Each cell is a value of a given categorical variable for a given sample.

    • Before running any queries on data, all cell values are trimmed of leading and trailing whitespace, to prevent accidental mismatches. (Don't use leading and trailing whitespace as intended differentiators in data.)

    • (More than two possible values per a categorical variable are supported.)

  • Groups CSV - SQL for selection of the groups being tested.

    • If you're not familiar with SQL, you can get a quick overview of how the where conditions work, and the logical operators you can use in them.

    • First row is a header, subsequent rows represent groups (conditions for selecting groups).

    • The conditions are either for selecting rows of the contingency table, or for selecting columns, or for selecting both rows and columns at the same time.

      • When using row-only and column-only conditions, the program then combines each of the row-only conditions with each of the column-only conditions (Cartesian product) to generate row-and-column conditions, which are in turn used for generating contingency tables.

      • When using row-and-column conditions, each such row in Groups CSV file corresponds to one contingency table.

      • It is possible to combine row-only and column-only conditions with row-and-column conditions in a single Groups CSV file.

      • See the example files, which might be better than a thousand words of description.

    • The header of Groups CSV must be:

      condtype;filter;r1cond;r2cond;c1cond;c2cond
      
    • condtype - required - the values can be row, col, or rowcol. The row and col types are for specifying conditions for rows and columns separately, the rowcol type specifies them together.

    • filter - optional - initial SQL WHERE condition applied to all data before conducting the fisher exact test. Can be used to shrink the sample set before performing any further operations.

      • If your data contains samples where some categorical variables are unknown (empty cell), you probably want to add filter conditions so that queries working with that variable don't use those samples. E.g. my_var != ''. See the groups.csv example. As filter is SQL, you can use the operators (like AND) in the condition if you need to reference multiple colums.
    • r1cond - required when condtype is row or rowcol - SQL WHERE condition for selecting row 1.

    • r2cond - optional - SQL WHERE condition for selecting row 2. If empty, the complement of r1cond is used (the result is still limited by filter).

    • c1cond - required when condtype is col or rowcol - SQL WHERE condition for selecting column 1.

    • c2cond - optional - SQL WHERE condition for selecting column 2. If empty, the complement of c1cond is used (the result is still limited by filter).

    • To avoid accidental sample overlaps between r1cond/r2cond or c1cond/c2cond, it is recommended that r2cond and c2cond are not used empty, and filter is used to pre-select a narrower sample set when desired (e.g. skip samples with unknown value for given variables), and then only r1cond and c1cond are used for row/column selection, making the second row/column always the complement of the first one.

Output

  • Results CSV - Fisher's exact test results.

    • First row is a header, each subsequent row represents one Fisher's exact test.

    • The header is:

      filter;r1cond;r2cond;c1cond;c2cond;r1c1;r1c1;r2c1;r2c2;fisher_l;fisher_r;fisher_2t
      
    • The values in the first 5 columns are the same as in the input Groups CSV.

    • The values in the columns r1c1,r1c2,r2c1,r2c2 are counts of samples in the contingency table, which satisfy the respective conditions.

    • The fisher_l,fisher_r,fisher_2t are left, right, and 2-tail p-values of Fischer's exact test.

Building

Only tested on Linux. To build the binary:

make

To run all tests:

make test

If you're feeling adventurous, you can also cross-compile the program for Windows. This requires podman. Make sure to read the script before running it. Then run it:

bash tools/compile-for-windows.sh

License

GNU GPL v3+