A tool to find the differences between two tables.


License
MIT
Install
pip install pl-compare==0.4.2

Documentation

pl_compare: Compare and find the differences between two Polars DataFrames.

Github - PyPi Page

You will find pl-compare useful if you find yourself writing various SQL/Dataframe operations to:

  • Understand how well two tables Reconcile example
  • Find the schema differences between two tables example
  • Find counts or examples of rows that exist in one table but not another example
  • Find counts or examples of value differences between two tables example
  • Assert that two tables are exactly equal (such as for an automated test) example
  • Assert that two tables have matching schemas, rows or column values example

Click for a jupyter notebook with example usage

With pl-compare you can:

  • Get statistical summaries and/or examples and/or a boolean to indicate:
    • Schema differences
    • Row differences
    • Value differences
  • Easily works for Pandas dataframes and other tabular data formats with conversion using Apache arrow
  • View differences as a text report
  • Get differences as a Polars LazyFrame or DataFrame
  • Use LazyFrames for larger than memory comparisons
  • Specify the equality calculation that is used to dermine value differences

Installation

pip install pl_compare

Examples (click to expand)

Return booleans to check for schema, row and value differences

>>> import polars as pl
>>> from pl_compare import compare
>>>
>>> base_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "12345678"],
...         "Example1": [1, 6, 3],
...         "Example2": ["1", "2", "3"],
...     }
... )
>>> compare_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "1234567810"],
...         "Example1": [1, 2, 3],
...         "Example2": [1, 2, 3],
...         "Example3": [1, 2, 3],
...     },
... )
>>>
>>> compare_result = compare(["ID"], base_df, compare_df)
>>> print("is_schemas_equal:", compare_result.is_schemas_equal())
is_schemas_equal: False
>>> print("is_rows_equal:", compare_result.is_rows_equal())
is_rows_equal: False
>>> print("is_values_equal:", compare_result.is_values_equal())
is_values_equal: False
>>>

Schema differences summary and details

>>> import polars as pl
>>> from pl_compare import compare
>>>
>>> base_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "12345678"],
...         "Example1": [1, 6, 3],
...         "Example2": ["1", "2", "3"],
...     }
... )
>>> compare_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "1234567810"],
...         "Example1": [1, 2, 3],
...         "Example2": [1, 2, 3],
...         "Example3": [1, 2, 3],
...     },
... )
>>>
>>> compare_result = compare(["ID"], base_df, compare_df)
>>> print("schemas_summary()")
schemas_summary()
>>> print(compare_result.schemas_summary())
shape: (6, 2)
┌─────────────────────────────────┬───────┐
│ StatisticCount │
│ ------   │
│ stri64   │
╞═════════════════════════════════╪═══════╡
│ Columns in base3     │
│ Columns in compare4     │
│ Columns in base and compare3     │
│ Columns only in base0     │
│ Columns only in compare1     │
│ Columns with schema differences1     │
└─────────────────────────────────┴───────┘
>>> print("schemas_sample()")
schemas_sample()
>>> print(compare_result.schemas_sample())
shape: (2, 3)
┌──────────┬─────────────┬────────────────┐
│ columnbase_formatcompare_format │
│ ---------            │
│ strstrstr            │
╞══════════╪═════════════╪════════════════╡
│ Example2StringInt64          │
│ Example3nullInt64          │
└──────────┴─────────────┴────────────────┘
>>>

Row differences summary and details

>>> import polars as pl
>>> from pl_compare import compare
>>>
>>> base_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "12345678"],
...         "Example1": [1, 6, 3],
...         "Example2": ["1", "2", "3"],
...     }
... )
>>> compare_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "1234567810"],
...         "Example1": [1, 2, 3],
...         "Example2": [1, 2, 3],
...         "Example3": [1, 2, 3],
...     },
... )
>>>
>>> compare_result = compare(["ID"], base_df, compare_df)
>>> print("rows_summary()")
rows_summary()
>>> print(compare_result.rows_summary())
shape: (5, 2)
┌──────────────────────────┬───────┐
│ StatisticCount │
│ ------   │
│ stri64   │
╞══════════════════════════╪═══════╡
│ Rows in base3     │
│ Rows in compare3     │
│ Rows only in base1     │
│ Rows only in compare1     │
│ Rows in base and compare2     │
└──────────────────────────┴───────┘
>>> print("rows_sample()")
rows_sample()
>>> print(compare_result.rows_sample())
shape: (2, 3)
┌────────────┬──────────┬─────────────────┐
│ IDvariablevalue           │
│ ---------             │
│ strstrstr             │
╞════════════╪══════════╪═════════════════╡
│ 12345678statusin base only    │
│ 1234567810statusin compare only │
└────────────┴──────────┴─────────────────┘
>>>

Value differences summary and details

>>> import polars as pl
>>> from pl_compare import compare
>>>
>>> base_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "12345678"],
...         "Example1": [1, 6, 3],
...         "Example2": ["1", "2", "3"],
...     }
... )
>>> compare_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "1234567810"],
...         "Example1": [1, 2, 3],
...         "Example2": [1, 2, 3],
...         "Example3": [1, 2, 3],
...     },
... )
>>>
>>> compare_result = compare(["ID"], base_df, compare_df)
>>> print("values_summary()")
values_summary()
>>> print(compare_result.values_summary())
shape: (2, 3)
┌─────────────────────────┬───────┬────────────┐
│ Value DifferencesCountPercentage │
│ ---------        │
│ stri64f64        │
╞═════════════════════════╪═══════╪════════════╡
│ Total Value Differences150.0       │
│ Example1150.0       │
└─────────────────────────┴───────┴────────────┘
>>> print("values_sample()")
values_sample()
>>> print(compare_result.values_sample())
shape: (1, 4)
┌─────────┬──────────┬──────┬─────────┐
│ IDvariablebasecompare │
│ ------------     │
│ strstri64i64     │
╞═════════╪══════════╪══════╪═════════╡
│ 1234567Example162       │
└─────────┴──────────┴──────┴─────────┘
>>>

Full report

>>> import polars as pl
>>> from pl_compare import compare
>>>
>>> base_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "12345678"],
...         "Example1": [1, 6, 3],
...         "Example2": ["1", "2", "3"],
...     }
... )
>>> compare_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "1234567810"],
...         "Example1": [1, 2, 3],
...         "Example2": [1, 2, 3],
...         "Example3": [1, 2, 3],
...     },
... )
>>>
>>> compare_result = compare(["ID"], base_df, compare_df)
>>> compare_result.report()
--------------------------------------------------------------------------------
COMPARISON REPORT
--------------------------------------------------------------------------------
<BLANKLINE>
SCHEMA DIFFERENCES:
shape: (6, 2)
┌─────────────────────────────────┬───────┐
│ StatisticCount │
│ ------   │
│ stri64   │
╞═════════════════════════════════╪═══════╡
│ Columns in base3     │
│ Columns in compare4     │
│ Columns in base and compare3     │
│ Columns only in base0     │
│ Columns only in compare1     │
│ Columns with schema differences1     │
└─────────────────────────────────┴───────┘
shape: (2, 3)
┌──────────┬─────────────┬────────────────┐
│ columnbase_formatcompare_format │
│ ---------            │
│ strstrstr            │
╞══════════╪═════════════╪════════════════╡
│ Example2StringInt64          │
│ Example3nullInt64          │
└──────────┴─────────────┴────────────────┘
--------------------------------------------------------------------------------
<BLANKLINE>
ROW DIFFERENCES:
shape: (5, 2)
┌──────────────────────────┬───────┐
│ StatisticCount │
│ ------   │
│ stri64   │
╞══════════════════════════╪═══════╡
│ Rows in base3     │
│ Rows in compare3     │
│ Rows only in base1     │
│ Rows only in compare1     │
│ Rows in base and compare2     │
└──────────────────────────┴───────┘
shape: (2, 3)
┌────────────┬──────────┬─────────────────┐
│ IDvariablevalue           │
│ ---------             │
│ strstrstr             │
╞════════════╪══════════╪═════════════════╡
│ 12345678statusin base only    │
│ 1234567810statusin compare only │
└────────────┴──────────┴─────────────────┘
--------------------------------------------------------------------------------
<BLANKLINE>
VALUE DIFFERENCES:
shape: (2, 3)
┌─────────────────────────┬───────┬────────────┐
│ Value DifferencesCountPercentage │
│ ---------        │
│ stri64f64        │
╞═════════════════════════╪═══════╪════════════╡
│ Total Value Differences150.0       │
│ Example1150.0       │
└─────────────────────────┴───────┴────────────┘
shape: (1, 4)
┌─────────┬──────────┬──────┬─────────┐
│ IDvariablebasecompare │
│ ------------     │
│ strstri64i64     │
╞═════════╪══════════╪══════╪═════════╡
│ 1234567Example162       │
└─────────┴──────────┴──────┴─────────┘
--------------------------------------------------------------------------------
End of Report
--------------------------------------------------------------------------------
>>>

Compare two pandas dataframes

>>> import polars as pl
>>> import pandas as pd # doctest: +SKIP
>>> from pl_compare import compare
>>>
>>> base_df = pd.DataFrame(data=
...     {
...         "ID": ["123456", "1234567", "12345678"],
...         "Example1": [1, 6, 3],
...         "Example2": ["1", "2", "3"],
...     }
... )# doctest: +SKIP
>>> compare_df = pd.DataFrame(data=
...     {
...         "ID": ["123456", "1234567", "1234567810"],
...         "Example1": [1, 2, 3],
...         "Example2": [1, 2, 3],
...         "Example3": [1, 2, 3],
...     },
... )# doctest: +SKIP
>>>
>>> compare_result = compare(["ID"], pl.from_pandas(base_df), pl.from_pandas(compare_df))# doctest: +SKIP
>>> compare_result.report()# doctest: +SKIP
--------------------------------------------------------------------------------
COMPARISON REPORT
--------------------------------------------------------------------------------

SCHEMA DIFFERENCES:
shape: (6, 2)
┌─────────────────────────────────┬───────┐
│ StatisticCount │
│ ------   │
│ stri64   │
╞═════════════════════════════════╪═══════╡
│ Columns in base3     │
│ Columns in compare4     │
│ Columns in base and compare3     │
│ Columns only in base0     │
│ Columns only in compare1     │
│ Columns with schema differences1     │
└─────────────────────────────────┴───────┘
shape: (2, 3)
┌──────────┬─────────────┬────────────────┐
│ columnbase_formatcompare_format │
│ ---------            │
│ strstrstr            │
╞══════════╪═════════════╪════════════════╡
│ Example2StringInt64          │
│ Example3nullInt64          │
└──────────┴─────────────┴────────────────┘
--------------------------------------------------------------------------------

ROW DIFFERENCES:
shape: (5, 2)
┌──────────────────────────┬───────┐
│ StatisticCount │
│ ------   │
│ stri64   │
╞══════════════════════════╪═══════╡
│ Rows in base3     │
│ Rows in compare3     │
│ Rows only in base1     │
│ Rows only in compare1     │
│ Rows in base and compare2     │
└──────────────────────────┴───────┘
shape: (2, 3)
┌────────────┬──────────┬─────────────────┐
│ IDvariablevalue           │
│ ---------             │
│ strstrstr             │
╞════════════╪══════════╪═════════════════╡
│ 12345678statusin base only    │
│ 1234567810statusin compare only │
└────────────┴──────────┴─────────────────┘
--------------------------------------------------------------------------------

VALUE DIFFERENCES:
shape: (2, 3)
┌─────────────────────────┬───────┬────────────┐
│ Value DifferencesCountPercentage │
│ ---------        │
│ stri64f64        │
╞═════════════════════════╪═══════╪════════════╡
│ Total Value Differences150.0       │
│ Example1150.0       │
└─────────────────────────┴───────┴────────────┘
shape: (1, 4)
┌─────────┬──────────┬──────┬─────────┐
│ IDvariablebasecompare │
│ ------------     │
│ strstri64i64     │
╞═════════╪══════════╪══════╪═════════╡
│ 1234567Example162       │
└─────────┴──────────┴──────┴─────────┘
--------------------------------------------------------------------------------
End of Report
--------------------------------------------------------------------------------
>>>

Specify a threshold to control the granularity of the comparison for numeric columns.

>>> import polars as pl
>>> from pl_compare import compare
>>>
>>> base_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "12345678"],
...         "Example1": [1.111, 6.11, 3.11],
...     }
... )
>>>
>>> compare_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "1234567810"],
...         "Example1": [1.114, 6.14, 3.12],
...     },
... )
>>>
>>> print("With equality_resolution of 0.01")
With equality_resolution of 0.01
>>> compare_result = compare(["ID"], base_df, compare_df, resolution=0.01)
>>> print(compare_result.values_sample())
shape: (1, 4)
┌─────────┬──────────┬──────┬─────────┐
│ IDvariablebasecompare │
│ ------------     │
│ strstrf64f64     │
╞═════════╪══════════╪══════╪═════════╡
│ 1234567Example16.116.14    │
└─────────┴──────────┴──────┴─────────┘
>>> print("With no equality_resolution")
With no equality_resolution
>>> compare_result = compare(["ID"], base_df, compare_df)
>>> print(compare_result.values_sample())
shape: (2, 4)
┌─────────┬──────────┬───────┬─────────┐
│ IDvariablebasecompare │
│ ------------     │
│ strstrf64f64     │
╞═════════╪══════════╪═══════╪═════════╡
│ 123456Example11.1111.114   │
│ 1234567Example16.116.14    │
└─────────┴──────────┴───────┴─────────┘
>>>

Example using alias for base and compare dataframes.

>>> import polars as pl
>>> from pl_compare import compare
>>>
>>> base_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "12345678"],
...         "Example1": [1, 6, 3],
...         "Example2": ["1", "2", "3"],
...     }
... )
>>> compare_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "1234567810"],
...         "Example1": [1, 2, 3],
...         "Example2": [1, 2, 3],
...         "Example3": [1, 2, 3],
...     },
... )
>>>
>>> compare_result = compare(["ID"],
...                          base_df,
...                          compare_df,
...                          base_alias="before_change",
...                          compare_alias="after_change")
>>>
>>> print("values_summary()")
values_summary()
>>> print(compare_result.schemas_sample())
shape: (2, 3)
┌──────────┬──────────────────────┬─────────────────────┐
│ columnbefore_change_formatafter_change_format │
│ ---------                 │
│ strstrstr                 │
╞══════════╪══════════════════════╪═════════════════════╡
│ Example2StringInt64               │
│ Example3nullInt64               │
└──────────┴──────────────────────┴─────────────────────┘
>>> print("values_sample()")
values_sample()
>>> print(compare_result.values_sample())
shape: (1, 4)
┌─────────┬──────────┬───────────────┬──────────────┐
│ IDvariablebefore_changeafter_change │
│ ------------          │
│ strstri64i64          │
╞═════════╪══════════╪═══════════════╪══════════════╡
│ 1234567Example162            │
└─────────┴──────────┴───────────────┴──────────────┘
>>>

Assert two frames are equal for a test

>>> import polars as pl
>>> import pytest
>>> from pl_compare.compare import compare
>>>
>>> def test_example():
...     base_df = pl.DataFrame(
...         {
...             "ID": ["123456", "1234567", "12345678"],
...             "Example1": [1, 6, 3],
...             "Example2": [1, 2, 3],
...         }
...     )
...     compare_df = pl.DataFrame(
...         {
...             "ID": ["123456", "1234567", "12345678"],
...             "Example1": [1, 6, 9],
...             "Example2": [1, 2, 3],
...         }
...     )
...     comparison = compare(["ID"], base_df, compare_df)
...     if not comparison.is_equal():
...         raise Exception(comparison.report())
...
>>> test_example() # doctest: +IGNORE_EXCEPTION_DETAIL
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 18, in test_example
Exception: --------------------------------------------------------------------------------
COMPARISON REPORT
--------------------------------------------------------------------------------
No Schema differences found.
--------------------------------------------------------------------------------
No Row differences found (when joining by the supplied id_columns).
--------------------------------------------------------------------------------

VALUE DIFFERENCES:
shape: (3, 3)
┌─────────────────────────┬───────┬────────────┐
│ Value DifferencesCountPercentage │
│ ---------        │
│ stri64f64        │
╞═════════════════════════╪═══════╪════════════╡
│ Total Value Differences116.666667  │
│ Example1133.333333  │
│ Example200.0        │
└─────────────────────────┴───────┴────────────┘
shape: (1, 4)
┌──────────┬──────────┬──────┬─────────┐
│ IDvariablebasecompare │
│ ------------     │
│ strstri64i64     │
╞══════════╪══════════╪══════╪═════════╡
│ 12345678Example139       │
└──────────┴──────────┴──────┴─────────┘
--------------------------------------------------------------------------------
End of Report
--------------------------------------------------------------------------------
>>>

To DO:

  • Linting (Ruff)
  • Make into python package
  • Add makefile for easy linting and tests
  • Statistics should indicate which statistics are referencing columns
  • Add all statistics frame to tests
  • Add schema differences to schema summary
  • Make row examples alternate between base only and compare only so that it is more readable.
  • Add limit value to the examples.
  • Updated value differences summary so that Statistic is something that makes sense.
  • Publish package to pypi
  • Add difference criterion.
  • Add license
  • Make package easy to use (i.e. so you only have to import pl_compare and then you can us pl_compare)
  • Add table name labels that can replace 'base' and 'compare'.
  • Update code to use a config dataclass that can be passed between the class and functions.
  • Write up docstrings
  • Write up readme (with code examples)
  • Add parameter to hide column differences with 0 differences.
  • Add flag to indicate if there are differences between the tables.
  • Update report so that non differences are not displayed.
  • Seperate out dev dependencies from library dependencies?
  • Change 'threshold' to be equality resolution.
  • strict MyPy type checking
  • Raise error and print examples if duplicates are present.
  • Add total number of value differences to the value differences summary.
  • Add percentage column so the value differences summary.
  • Change id_columns to be named 'join_columns'
  • Github actions for publishing
  • Update the duplication validation.
  • Fix report output when tables are exactly equal.
  • Github actions for testing
  • Github actions for linting
  • [] Test for large amounts of data
  • [] Benchmark for different sizes of data.
  • [] Investigate use for very large datasets 50GB-100GB. Can this be done using LazyFrames only?

Ideas:

  • [] Simplify custom equality checks and add example.
  • [] Add a count of the number of rows that have any differences to the value differences summary.
  • [] add a test that checks that abritrary join conditions work.