
Indexing package and scripts for the File Catalog

pip install wipac-file-catalog-indexer==2.0.0


How To


from indexer.index import index

  • The flagship indexing function
  • Find files rooted at given path(s), compute their metadata, and upload it to File Catalog
  • Configurable for multi-processing (default: 1 process) and recursive file-traversing (default: on)
  • Internally communicates asynchronously with File Catalog
  • Note: Symbolic links are never followed.
  • Note: index() runs the current event loop (asyncio.get_event_loop().run_until_complete())
  • Ex:
	index_config,  # see config.py for a description of the fields in these typed dictionaries

from indexer.index import index_file

  • Compute metadata of a single file, and upload it to File Catalog, i.e. index one file
  • Single-processed, single-threaded
await index_file(

from indexer.index import index_paths

  • A wrapper around index_file() which indexes multiple files, and returns any nested sub-directories
  • Single-processed, single-threaded
  • Note: Symbolic links are never followed.
sub_dirs = await index_paths(
    paths=['/data/exp/IceCube/2018/filtered/level2/0820', '/data/exp/IceCube/2018/filtered/level2/0825'],

from indexer.metadata_manager import MetadataManager

  • The internal brain of the Indexer. This has minimal guardrails, does not communicate to File Catalog, and does not traverse file directory tree.
  • Metadata is produced for an individual file, at a time.
  • Ex:
manager = MetadataManager(...)  # caches connections & directory info, manages metadata collection
metadata_file = manager.new_file(filepath)  # returns an instance (computationally light)
metadata = metadata_file.generate()  # returns a dict (computationally intense)


python -m indexer.index
  • A command-line alternative to using from indexer.index import index
  • Use with -h to see usage.
  • Note: Symbolic links are never followed.
python -m indexer.generate
  • Like python -m indexer.index, but prints (using pprint) the metadata instead of posting to File Catalog.
  • Simply, uses file-traversing logic around calls to indexer.metadata_manager.MetadataManager
  • Note: Symbolic links are never followed.
python -m indexer.delocate
  • Find files rooted at given path(s); for each, remove the matching location entry from its File Catalog record.
  • Note: Symbolic links are never followed.

.i3 File Processing-Level Detection and Embedded Filename-Metadata Extraction

Regex is used heavily to detect the processing level of a .i3 file, and extract any embedded metadata in the filename. The exact process depends on the type of data:

Real Data (/data/exp/*)

This is a two-stage process (see MetadataManager._new_file_real()):

  1. Processing-Level Detection (Base Pattern Screening)
    • The filename is applied to multiple generic patterns to detect if it is L2, PFFilt, PFDST, or PFRaw
    • If the filename does not trigger a match, only basic metadata is collected (logical_name, checksum, file_size, locations, and create_date)
  2. Embedded Filename-Metadata Extraction
    • After the processing level is known, the filename is parsed using one of (possibly) several tokenizing regex patterns for the best match (greedy matching)
    • If the filename does not trigger a match, the function will raise an exception (script will exit). This probably indicates that a new pattern needs to be added to the list.
      • see indexer.metadata.real.filename_patterns

Simulation Data (/data/sim/*)

This is a three-stage process (see MetadataManager._new_file_simulation()):

  1. Base Pattern Screening
    • The filename is checked for .i3 file extensions: .i3, .i3.gz, .i3.bz2, .i3.zst
    • If the filename does not trigger a match, only basic metadata is collected (logical_name, checksum, file_size, locations, and create_date)
      • there are a couple hard-coded "anti-patterns" used for rejecting known false-positives (see code)
  2. Embedded Filename-Metadata Extraction
    • The filename is parsed using one of MANY (around a thousand) tokenizing regex patterns for the best match (greedy matching)
    • If the filename does not trigger a match, the function will raise an exception (script will exit). This probably indicates that a new pattern needs to be added to the list.
      • see indexer.metadata.sim.filename_patterns
  3. Processing-Level Detection
    • The filename is parsed for substrings corresponding to a processing level
      • see DataSimI3FileMetadata.figure_processing_level()
    • If there is no match, processing_level will be set to None, since the processing level is less important for simulation data.

Metadata Schema



Re-indexing Files is Tricky (Two Scenarios)

  1. Indexing files that have not changed locations is okay--this probably means that the rest of the metadata has also not changed. A guardrail query will check if the file exists in the FC with that locations entry, and will not process the file further.
  2. HOWEVER, don't point the indexer at restored files (of the same file-version)--those that had their initial locations entry removed (ie. removed from WIPAC, then moved back). Unlike re-indexing an unchanged file, this file will be fully locally processed (opened, read, and check-summed) before encountering the checksum-conflict then aborting. These files will be skipped (not sent to FC), unless you use --patch (replaces the locations list, wholesale), which is DANGEROUS.
    • Example Conflict: It's possible a file-version exists in FC after initial guardrails
      1. file was at WIPAC & indexed
      2. then moved to NERSC (location added) & deleted from WIPAC (location removed)
      3. file was brought back to WIPAC
      4. now is being re-indexed at WIPAC
      5. CONFLICT -> has the same logical_name+checksum.sha512 but differing locations