Utility to split aggregated logs from Apache Hadoop Yarn applications into a folder hierarchy


Keywords
hadoop, python
License
MIT
Install
pip install splitlog==3.0.2

Documentation

splitlog

Hadoop Yarn application logs aggregate all container logs of a Yarn application into a single file. This makes it very difficult to use Unix command line tools to analyze these logs: Grep will search over all containers and context provided for hits often does not include Yarn container name or host name. splitlog splits a combined logfile for all containers of an application into a file system hierarchy suitable for further analysis:

out
└── hadoopnode
    ├── container_1671326373437_0001_01_000001
    │   ├── directory.info
    │   ├── launch_container.sh
    │   ├── prelaunch.err
    │   ├── prelaunch.out
    │   ├── stderr
    │   ├── stdout
    │   └── syslog
    ├── container_1671326373437_0001_01_000002
    │   ├── directory.info
    │   ├── launch_container.sh
    │   ├── prelaunch.err
    │   ├── prelaunch.out
    │   ├── stderr
    │   ├── stdout
    │   └── syslog
    └── container_1671326373437_0001_01_000003
        ├── directory.info
        ├── launch_container.sh
        ├── prelaunch.err
        ├── prelaunch.out
        ├── stderr
        ├── stdout
        └── syslog

4 directories, 21 files

Installation

Python 3.7+ must be available. Installation via pipx:

pipx install splitlog

How to use

Read logs from standard input:

yarn logs -applicationId application_1582815261257_232080 | splitlog

Read logs from file application_1582815261257_232080.log:

splitlog -i application_1582815261257_232080.log