PegasusTools

Posted Aug 14, 2025 Updated Aug 21, 2025

By Robert Caddy

4 min read

PegasusTools

PegasusTools: An Analysis Package for Pegasus

The goal of this project was to convert some of the miscellaneous Python and Matlab scripts that users of the Pegasus PIC code have into a pip installable Python package, named PegasusTools. At the conclusion of the project PegasusTools has utilities for reading the binary and ASCII files that Pegasus outputs along with some tools to standardize plotting. All of the existing scripts provided to me can be found here.

Usage details can be found in the documentation (also linked on GitHub). Details on how to add new things can be found in CONTRIBUTING.md.

Repo/Project Structure

The project repository is based on the Cookie template for scientific Python packages. The template came with all the infrastructure for CI/CD, pre-commit, packaging, uploading to PyPI, etc. I made minor modifications but nothing that significantly impacts usage. All current content is well tested with ~96% test coverage.

Features

All the features here are detailed in the documentation, this is just a high level summary.

Reading NBF Files

NBF binary files from Pegasus can be read the public PegasusNBFData class. This method is 15-22x faster than the previous methods used to load NBF files by the group.

Performance tests were performed on a 431MB NBF file. Note that the Della results can vary significantly depending on network congestion. To account for this I ran the test with timeit several times over about an hour and picked the results that showed reasonably low standard deviations.

	Della	MacBook (M3 Max, 2TB SSD)
Previous	61MB/s	127MB/s
New	910MB/s	2.8GB/s
Speedup	15x	22x

Reading Spectra Files

The various spectra files from Pegasus can be read with the PegasusSpectralData class. This method is 41-152x faster than the previous one. Testing was done on a 3.44GB .spec file with a similar testing strategy on Della as was discussed in the Reading NBF Files section.

	Della	MacBook (M3 Max, 2TB SSD)
Previous	28MB/s	52MB/s
New	1.14GB/s	7.9GB/s
Speedup	41x	152x

Reading Track Files

Reading particle track files is more complex than the other file types and is where I spent the largest portion of time in this project. There are two formats, ASCII .track.dat files and binary .track_mpiio_optimized files. The ASCII files are one particle per file but there are typically tens of thousands of files totalling up to a few hundred gigabytes. Each binary file has the data from all particles in a given time range but that data is ordered based on MPI rank not particle ID and as such the particle IDs are nearly randomly ordered both within one file and across the entire dataset. Total data size for the binary track files is up to about 2TB. This data then needs to be searched through to find interesting particle tracks. I decided that instead of reading these files directly when searching that first I would convert them into Parquet files, which are optimized for this kind of data and searches.

Due to the large data sizes I would need to use some kind of framework that supported performant larger-than-memory operations. The three I found that might meet my needs were PySpark, Dask DataFrame, and Polars. PySpark requires considerable setup that isn’t feasible in all cases. I spent quite a bit of time trying to get Dask Dataframes to work properly but kept running into issues. Either it would run out of memory, hang, or take an absurdly long time to finish. Given these issues I ended up using Polars, a parallel dataframe library for Python and Rust. The streaming engine (for larger than memory operations) in Polars is currently undergoing a rewrite and is missing a few features, namely .sort doesn’t work in streaming mode at the moment. As such I did much of the parallelization manually to avoid having to sort the entire dataset at once.

Converting a large set of ASCII track files only takes a few minutes. Converting a 2TB binary dataset takes about 2 hours on Stellar using 16 cores and 256GB of memory. This is a one-time cost that also computes several useful quantities along the way for easier future searching. I don’t have hard numbers for how much faster this is but my estimate is that it reduces a single search from hours to minutes.

Reading Trace Files

Trace files (.trace_mpiio_optimized files) are very similar to track files except instead of following a particle through the simulation the record the values at a specific point in space. They are processed similarly to the binary track files, they just take a little less time since they don’t require as much processing.

Reading History Files

.hst files can be loaded with the load_hst_file function which returns a Polars DataFrame. The loading function accounts for any restarts and only uses data from the latest run. This files are small so I didn’t do any performance testing, load times are usually a fraction of a second.

Plotting

PegasusTools includes a matplotlib style sheet pegasus_style that can be used to set the style of matplotlib. PegasusTools also defines a new colormap, the Hawley colormap shown below. PegasusTools registers the Hawley colormap with matplotlib when PegasusTools is imported so the hawley colormap can be used just like any built-in colormap. Also, the Hawley colormap is the default colormap in the pegasus_style style sheet.

hawley

under

bad

over

This post is licensed under CC BY 4.0 by the author.