datasette-sqlite-regex

A fast regular expression SQLite extension, written in Rust


Keywords
sqlite, sqlite-extension
Licenses
MIT/Apache-2.0
Install
pip install datasette-sqlite-regex==0.2.0

Documentation

sqlite-regex

A fast and performant SQLite extension for regular expressions. Based on sqlite-loadable-rs, and the regex crate.

See Introducing sqlite-regex: The fastest Regular Expression Extension for SQLite (Jan 2023) for more details!

If your company or organization finds this library useful, consider supporting my work!

Usage

.load ./regex0
select 'foo' regexp 'f';

Find all occurrences of a pattern in a string

select regex_find(
  '[0-9]{3}-[0-9]{3}-[0-9]{4}',
  'phone: 111-222-3333'
);
-- '111-222-3333'

select rowid, *
from regex_find_all(
  '\b\w{13}\b',
  'Retroactively relinquishing remunerations is reprehensible.'
);
/*
β”Œβ”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ rowid β”‚ start β”‚ end β”‚     match     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 0     β”‚ 0     β”‚ 13  β”‚ Retroactively β”‚
β”‚ 1     β”‚ 14    β”‚ 27  β”‚ relinquishing β”‚
β”‚ 2     β”‚ 28    β”‚ 41  β”‚ remunerations β”‚
β”‚ 3     β”‚ 45    β”‚ 58  β”‚ reprehensible β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
*/

Extract capture group values by index or name

select
  regex_capture(captures, 0)        as entire_match,
  regex_capture(captures, 'title')  as title,
  regex_capture(captures, 'year')   as year
from regex_captures(
  regex("'(?P<title>[^']+)'\s+\((?P<year>\d{4})\)"),
  "'Citizen Kane' (1941), 'The Wizard of Oz' (1939), 'M' (1931)."
);
/*
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”
β”‚       entire_match        β”‚      title       β”‚ year β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€
β”‚ 'Citizen Kane' (1941)     β”‚ Citizen Kane     β”‚ 1941 β”‚
β”‚ 'The Wizard of Oz' (1939) β”‚ The Wizard of Oz β”‚ 1939 β”‚
β”‚ 'M' (1931)                β”‚ M                β”‚ 1931 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”˜
*/

Use RegexSets to match a string on multiple patterns in linear time

select regexset_is_match(
  regexset(
    "bar",
    "foo",
    "barfoo"
  ),
  'foobar'
)

Split the string on the given pattern delimiter

select rowid, *
from regex_split('[ \t]+', 'a b     c d    e');
/*
β”Œβ”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”
β”‚ rowid β”‚ item β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€
β”‚ 0     β”‚ a    β”‚
β”‚ 1     β”‚ b    β”‚
β”‚ 2     β”‚ c    β”‚
β”‚ 3     β”‚ d    β”‚
β”‚ 4     β”‚ e    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”˜
*/

Replace occurrences of a pattern with another string

select regex_replace(
  '(?P<last>[^,\s]+),\s+(?P<first>\S+)',
  'Springsteen, Bruce',
  '$first $last'
);
-- 'Bruce Springsteen'

select regex_replace_all('a', 'abc abc', '');
-- 'bc bc'

Documentation

See docs.md for a full API reference.

Installing

Language Install
Python pip install sqlite-regex PyPI
Datasette datasette install datasette-sqlite-regex Datasette
Node.js npm install sqlite-regex npm
Deno deno.land/x/sqlite_regex deno.land/x release
Ruby gem install sqlite-regex Gem
Github Release GitHub tag (latest SemVer pre-release)
Rust cargo add sqlite-regex Crates.io

The Releases page contains pre-built binaries for Linux x86_64, MacOS, and Windows.

As a loadable extension

If you want to use sqlite-regex as a Runtime-loadable extension, Download the regex0.dylib (for MacOS), regex0.so (Linux), or regex0.dll (Windows) file from a release and load it into your SQLite environment.

Note: The 0 in the filename (regex0.dylib/ regex0.so/regex0.dll) denotes the major version of sqlite-regex. Currently sqlite-regex is pre v1, so expect breaking changes in future versions.

For example, if you are using the SQLite CLI, you can load the library like so:

.load ./regex0
select regex_version();
-- v0.1.0

Or in Python, using the builtin sqlite3 module:

import sqlite3
con = sqlite3.connect(":memory:")
con.enable_load_extension(True)
con.load_extension("./regex0")
print(con.execute("select regex_version()").fetchone())
# ('v0.1.0',)

Or in Node.js using better-sqlite3:

const Database = require("better-sqlite3");
const db = new Database(":memory:");
db.loadExtension("./regex0");
console.log(db.prepare("select regex_version()").get());
// { 'regex_version()': 'v0.1.0' }

Or with Datasette:

datasette data.db --load-extension ./regex0

Supporting

I (Alex πŸ‘‹πŸΌ) spent a lot of time and energy on this project and many other open source projects. If your company or organization uses this library (or you're feeling generous), then please consider supporting my work, or share this project with a friend!

See also