A library to walk through tar archives, simplifying use by handling listing and decompression.
TarWalker provides a method to easily scan files somewhat like os.walk, handling compressed files, recursing through directories and scanning within tarfiles.
The library is very stable, changes are rare. It well documented and has full unit testing (100% code coverage), and is maintained.
Notes about this library:
There are two (2) classes that are provided. The primary difference is that TarWalker will throw an exception if given a directory.
Install the package using pip, eg:
pip install --user tarwalker
pip3 install --user tarwalker
The following is simple tool to look for a given string within files. Files can be given as arguments or within tarballs, and must end with either '.log' (w/an optional numeric suffix) or with '.txt':
import re
import sys
from tarwalker import TarWalker
PATTERN = re.compile(r'.*\.(txt|log(\.\d+)?)$')
def handler(fileobj, filename, arch, info, match):
try:
for line in fileobj:
if text in line:
path = (arch + ':') if arch else ''
print("Found in: " + path + filename)
return
except IOError:
pass
text = sys.argv[1]
walker = TarWalker(file_handler=handler, name_matcher=PATTERN.match, recurse=False)
for arg in sys.argv[2:]:
walker.handle_path(arg)
Constructing an instance of TarWalker or TarDirWalker take the same parameters. Note that at most one of file_matcher or name_matcher is allowed.
file_handler (required) a callable taking five (5) positional parameters:
fileobj - a readable file object for the file contents.
filepath - a str with the filename, either as one of:
- the file path given to handle_path(), or
- the path of a file found beneath a directory given to handle_path().
- the file path of a file within an expanded tar archive.
archname - a str path of the tar archive name, when handling a file found within a tar archive. It will be a colon (':') separated list if reading a recursive tar archive.
fileinfo - may be None or an object with the following attributes. See os.stat for more details:
- name - the str name of the file,
- size - the size of the file in bytes,
- mtime - modification time, in POSIX (epoch) time,
- mode - the file permission bits,
- uid - the file owner's User ID, and
- gid - the file owner's Group ID
MATCH - the value returned from the name_matcher or file_matcher call.
NOTE: files with a compression suffix will have the suffix removed, and the file object will return decompressed contents. For example, for "foo.txt.gz" filepath would be "foo.txt" and fileobj would be the equivalent contents of "foo.txt".
file_matcher (optional) a callable that takes two (2) positional parameters and returns true if the file should be opened and passed to the file_handler callback:
- filepath - See filepath above.
- fileinfo - See fileinfo above.
name_matcher (optional) a callable that takes one (1) positional parameter and returns true if the file be opened and passed to file_handler:
- filepath - See file_handler, above.
recurse (optional) If true, the algorithm will recurse into tarballs found within other tarballs. Furthermore, if recurse is a callable it will be called before and after opening an interior tarball, with four (4) positional parameters:
- start - a bool that indicates recursion into the given tarball is starting; it is False on the second call.
- tarname - name of the contained (interior) tarball, see filepath above.
- archive - the name of the containing (exterior) tarball, see archname above.
- fileinfo - See fileinfo above.
If you think you have found a defect, or wish to add an enhancement request, please do so via the GitLab issues page:.