About
Tablemap is a useful Python data wrangling tool for situations where Pandas or SQL may feel cumbersome when dealing with tasks that go beyond their typical routine.
Instead of going the long way around with Pandas, where a table is simply a list of dictionaries, Tablemap offers a more efficient alternative. It allows you to effortlessly chain together processes on tables without the need to excessively rely on stackoverflow.com for complex Pandas operations.
In addition, this tool is designed for those who want a simpler solution that minimizes memory concerns.
Installation
Requires only built-in Python libraries, without any external dependencies.
pip install tablemap
Tutorial
Saving tables in the database
Let's create a table t1
in sample.db
.
from tablemap import Conn, Rows
t1 = [
{'col1': 'a', 'col2': 4},
{'col1': 'a', 'col2': 5},
{'col1': 'b', 'col2': 1},
]
conn = Conn('sample.db')
conn['t1'] = t1
The right-hand side of the assignment can consist of a list of dictionaries, an iterator that yields dictionaries, or an object fetched from the connection (referred to as the Rows object, which will be introduced shortly). For example, you can use conn['t1']
and then chain table-manipulating methods such as map
, update
, by
and more.
In this context, each dictionary represents a row in a table. For instance, {'col1': 'a', 'col2': 4}
represents a row with two columns, col1
and col2
.
The process of opening and closing the database is handled safely in the background.
To browse tables in the database,
rs = conn['t1']
print(rs)
# to convert it into a familiar data structure,
print(rs.list())
# to display only a portion of it
rs1 = rs[1:]
print(rs1)
# rs[1:] creates a new object, while rs remains unmodified.
print(rs.size(), rs1.size())
# This is equivalent to
print(len(rs), len(rs1))
If you prefer a graphical user interface (GUI), you can open the sample.db
file using software such as SQLiteStudio or DB Browser for SQLite.
Once you have cleaned up the table, you may choose to proceed with the analysis using Pandas.
import pandas as pd
df = pd.DataFrame(conn['t1'].list())
conn['t1_copy'] = df.to_dict('records')
Rows objects
The conn['t1']
expression returns a Rows
object, which represents a list of dictionaries. There are two ways to create Rows objects:
- By passing a table name to the Conn object, such as
conn['t1']
. - By directly passing a list of dictionaries or a dictionary-yielding iterator to the Rows class, for example,
Rows(t1)
.
When you pass a column name to a Rows object, it returns a list of elements for that column. For example, Rows(t1)['col1']
would result in ['a', 'a', 'b']
.
Rows objects provide methods that can be chained together to transform a table.
Methods for table manipulation
-
chain
To concatenate the t1
table with itself and create a new table called t1_double
in the database, you can use the chain
method provided by the Rows
object. Here are a couple of examples:
conn['t1_double'] = conn['t1'].chain(conn['t1'])
This code will create a new table called t1_double
in the database and populate it with the concatenated result of t1
with itself.
Alternatively, if you already have a list of dictionaries or an iterator that yields dictionaries named t1
, you can pass it as an argument to the chain
method:
conn['t1_double'] = conn['t1'].chain(t1)
Make sure that the tables being concatenated (t1
and t1
in this case) have the same columns. The order of the columns does not matter for concatenation to work correctly.
-
Some of the properties of this module that make data-wrangling easier
-
All the methods in this section create a new
Rows
object.rs = conn['t1'] rs1 = rs.chain(t1)
rs
andrs1
are different objects, sors
is notchain
ed.t1 = Rows(t1) t1_listed = t1.list() t1_listed[0]['col1'] = 'x' # Create a new list from t1, and it's a fresh one. t1.list()[0]['col1'] != 'x'
-
rs
(orrs1
) does not contain any data in the table, yet. It simply holds instructions and is executed when it's needed. (when you want it to be saved in the database, to be printed out, to be listed up, or simply to get the size of it)So you can easily combine all the methods safely and freely, for example, (
filter
is not covered yet, hopefully, it's self-evident.)rs = conn['t1'] high = rs.filter(lambda r: r['col2'] > 4) low = rs.filter(lambda r: r['col2'] < 2) rs2 = high.chain(low)
Since
rs2
simply holds instructions without actually performing operations, the above code requires very little computing power unless you want to save it in the database or see the result for yourself. -
Memory requirement is minimal.
conn['t1_1'] = rs2
Now it actually works because you are trying to save the rows
rs2
generates in the tablet1_1
. Still,tablemap
does not load up all ofrs2
on memory. It loads and saves one-by-one. -
Opens and closes the database automatically and safely. Users don't have to worry about it. Even the keyboard interrupts (like ctrl-c) during the table insertion do not corrupt the database.
-
-
filter
andupdate
Each row is simply a dictionary with column names as keys, so you can access a column value by passing a column name to the row (dictionary). To create new columns or update the existing ones,
# \ for line-continuation
conn['t1_1'] = conn['t1']\
.filter(lambda r: r['col2'] > 2)\
.update(
col2=lambda r: r['col2'] + 1,
col3=lambda r: r['col1'] + str(r['col2'])
)
A lambda expression is a nameless function. In the expression lambda r: r['col2'] > 2
, the parameter r
represents a single dictionary and the whole expression returns an evaluated value of r['col2'] > 2
for each iteration.
Columns are updated sequentially, so col3
has a5
and a6
, not a4
and a5
.
-
by
andfold
To sum up col2
grouped by col1
,
conn['t1_col2sum_groupby_col1'] = conn['t1'].by('col1')\
.fold(
col2_sum=lambda rs: sum(rs['col2']),
col2_min=lambda rs: min(rs['col2']),
col2_max=lambda rs: max(rs['col2']),
)
by
takes fields as an argument (list of field names or comma-separated field names) for grouping, and the next process (fold
in this case) takes on each group (a Rows
object).
In the expression lambda rs: sum(rs['col2'])
, the parameter rs
represents a Rows
object. So rs['col2']
returns a list of elements in the column col2
. And of course, for that reason, you may chain up all the methods in this section.
While update
works on a dictionary, fold
does on a Rows
object. (fold
folds n rows to one row. So the lambda expression in fold
must return a single value, like a string or a number.)
fold
must be preceded by grouping methods such as by
or windowed
which shows up soon. filter
may or may not be preceded by grouping methods.
-
rename
To replace old column names with new ones,
conn['t1_1'].rename(
c2min='col2_min',
c2max='col2_max'
)
-
join
To merge tables,
conn['t2'] = [
{'col1': 'b', 'col3': -1},
{'col1': 'c', 'col3': 3},
{'col1': 'b', 'col3': ''},
]
conn['t1_col3'] = conn['t1'].by('col1')\
.join(conn['t2'].by('col1'), 'full')
There are 4 join types, 'inner', 'left', 'right', and 'full'. The default is 'inner'. You may want to check this tutorial if you are not familiar with these terms.
Tables must be grouped to be joined.
If the table t1
and t2
have columns with the same name, t1
columns will be updated with t2
columns.
Empty strings represent missing values.
-
distinct
To group the table t1
by col1
and to leave only the first row in each group (removing duplicates),
conn['t1_1'] = conn['t1'].distinct('col1')
You can pass multiple columns to distinct
as in by
-
select
anddeselect
You can pass columns to select
or deselect
to pick up or delete specific columns in a table
conn['t1_1'] = conn['t1'].update(col3=lambda r: r['col2'] + 1)\
.deselect('col1, col2')
To take the first 2 rows from table t1
,
print(conn['t1'][:2])
Negative values are not supported. Of course, you can chain up other methods after slicing.
Like the other methods, slicing does not execute the operation. conn['t1'][:2]
holds the instruction to take the first two rows, not the rows themselves. However, the print
function enforces taking the first two rows to print out on the screen. So it works as expected.
Grouping methods like by
or windowed
may come right before slicing, and the rows will be flattened.
conn['t1']['col1']
is not slicing; it returns a list of column values, not a Rows
object.
-
takewhile
anddropwhile
takewhile
and dropwhile
take a predicate (a function that returns a value to be considered True
or False
, already seen it in filter
) as an argument to do what these names suggest. Refer to itertools.takewhile and itertools.dropwhile
Grouping methods may be preceded right before these methods.
-
map
When update
or fold
is not powerful enough, you can deal with a row or Rows
in a more sophisticated way.
# Some of you may feel uncomfortable with the naming.
# This is just a lambda function for the 'map' method.
# It can be challenging to justify spending time on naming a function that is used nowhere else.
def fn4t1(rs):
# `rs` is a `Rows` object.
# Now you can apply all the methods in this section to manipulate the table.
# And once again, since these methods create a new `Rows` object instead of modifying the original,
# you can safely build any combinations of methods as you want.
tot = sum(rs['col2'])
# You don't always have to pass a function to the `update` method. The same applies to the `fold` method.
return rs[:1].update(col2_sum=tot)
conn['t1_col2sum_groupby_col1'] = conn['t1'].by('col1')\
.map(fn4t1)\
.deselect('col2')
The argument for map
is a function that returns a Rows
object or a single dictionary or None. It takes a single dictionary as an argument or a Rows
object in case the previous process is by
(group
) or windowed
.
-
zip
andzip_longest
Like chain
, zip
takes a list of dictionaries or an iterator that yields dictionaries or a Rows
object as an argument. The argument updates the Rows
object row by row until either one is depleted.
With zip
, the above fn4t1
can be rewritten as
def fn4t1(rs):
rs2 = [{'col2_sum': sum(rs['col2'])}]
return rs.zip(rs2)
Another example,
conn['t1_1'] = conn['t1'].zip({'idx': i} for i in range(100))
zip_longest
creates empty columns when either one is depleted.
-
merge
The same interface as join
. While join
combines cross-producted rows from two tables,
merge
simply zip_longest
them. For example, when join
combines 2 rows against 3 rows in each group,
6 rows are generated while merge
produces only 3 rows.
-
index
To add an index column,
conn['t1'].index('index_column', start=1, step=2)
conn['t1'].by('col1').index('group_index_column')
-
windowed
When you need to group a chunk of consecutive rows,
conn['t1_1'] = conn['t1'].windowed(4, 2).fold(
sum=lambda rs: sum(rs['col2'])
)
# It works with `by`, for example.
conn['t1'].by('col1').windowed(3).index('cnt')
fold
takes the first 4 consecutive rows (of course a Rows
object) and the next 4 starting from the 3rd (skipping 2 rows) and so on. When rows less than or equal to 4 are left, it will be the last.
Grouped rows can also be windowed.
print(conn['t1'].by('col1').windowed(3).index('group_no'))
-
order
andgroup
Actually, by
is a combination of order
and group
. You can control more precisely by separating these processes,
conn['t1_col2sum_groupby_col1'] = conn['t1']\
.order('col1, col2 desc').group('col1')\
.map(fn4t1)
Now, map
takes a Rows
object where col2
is sorted in descending order.
The keyword desc
can be either uppercased, lowercased, or mixed.
The ascending order is the default.
Regarding the group
method, if no argument is provided, it will group all the preceding rows.
-
split
# `xs` is a list of `Rows`.
xs = conn['t1'].by('col1').split()
-
insert
To concatenate the t1
table with itself,
conn['t1'].insert(conn['t1'])
conn['t1'].insert(t1)
insert
is special in that this is the only method that returns None.
Some remarks
-
Table name gimmmick
If the table name starts with '_', data is appended to the existing table instead of creating a new table.
-
cross-join example
# table2 = conn['t1'].list() is not effective. # You should convert it to a list. # Otherwise, 'map' attempts to fetch # the table 't2' from the database # for every group by 'col1'. table2 = conn['t2'].list() def fn4t1(rs): ...do some work using table3 return something conn['some_table'] = conn['t1'].by('col1').map(fn4t1)
-
Rows
methods do not update objects directly. They create a new object every time aRows
method is invoked.So the following code works as expected.
rs = conn['t1'] rs.by('col1').fold(col2_tot=sum(rs['col2']))
In expression
sum(rs['col2'])
,rs
represents aRows
object when the statementrs = conn['t1']
is evaluated. Methods likeby
orfold
in the statement do not affectrs
insum(rs('col2'))
.Take a close look at the next.
def fn4t1(rs): # The original `rs` is not updated. # Only `newrs` holds the instruction to update the column 'col2'. # The update instruction will not be executed in the next statement. # `newrs` simply keeps the instruction here until it's really needed, # for example, during database insertion or content print-out. newrs = rs.update(col2=lambda r: r['col2'] + 1) return newrs.order('col2').zip(rs.order('col2 desc').rename(col2_1='col2')) conn['t1_1'] = conn['t1'].by('col1').map(fn4t1)
-
Since column names are dictionary keys, they are case-sensitive. However, column names in SQLite3 (on which
tablemap
is powered) are case-insensitive by default. To avoid confusion, it is strongly recommended that you keep them lower-cased, and spaces stripped.tablemap
does not automatically convert uppercase column names. Making any excessive assumptions on users' intentions might add more confusions.