{ "cells": [ { "cell_type": "markdown", "id": "257f4ff7-0a37-45e7-a701-bab07d2b757f", "metadata": {}, "source": [ "# Speed Up Analysis Code with Parquet Cache\n", "Looping through the XML-like LHE text file format and reconstructing the objects in memory is a slow process. If the in-memory analysis tool you use for studying the LHE files is the awkward library, one can avoid this by caching the awkward-form of the LHE data in a data file format that is much faster to read than the raw LHE file.\n", "\n", "The code below is a small function that will store a parquet cache file alongside any LHE file you wish to read, so any subsequent reads can go through the faster parquet. The parquet cache file will be re-created if anything modifies the original LHE file." ] }, { "cell_type": "code", "execution_count": 5, "id": "e2ae414e-d09a-4792-a60f-b8c4d1a8644e", "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "import awkward as ak\n", "from platformdirs import user_cache_dir\n", "\n", "import pylhe\n", "\n", "\n", "def _parquet_cache(lhe_fp):\n", " \"\"\"Return a stable, user-writable cache path for a given LHE file.\"\"\"\n", " app_cache_dir = user_cache_dir(\n", " \"pylhe\"\n", " ) # ~/.cache/pylhe on Linux, ~/Library/Caches/pylhe on macOS, %LOCALAPPDATA%\\pylhe on Windows\n", " os.makedirs(app_cache_dir, exist_ok=True)\n", "\n", " base = os.path.splitext(os.path.basename(lhe_fp))[0]\n", " return os.path.join(app_cache_dir, base + \".parquet\")\n", "\n", "\n", "def _from_pylhe(lhe_fp):\n", " \"\"\"Read an LHE file into an awkward array in memory.\"\"\"\n", " return pylhe.to_awkward(pylhe.LHEFile.fromfile(lhe_fp).events)\n", "\n", "\n", "def convert_to_parquet(lhe_fp):\n", " \"\"\"Convert the input LHE file into a parquet file of the same name and location\n", " but with the extension updated.\n", "\n", " Converting the LHE file to a parquet file is beneficial because the resulting\n", " parquet file is about the same size as the gzipped LHE file but it offers about\n", " 2 orders of magnitude speed up when reading the data back into an awkward array\n", " in memory.\n", "\n", " Parameters\n", " ----------\n", " lhe_fp : str\n", " path to LHE file to convert\n", " \"\"\"\n", "\n", " ak.to_parquet(_from_pylhe(lhe_fp), _parquet_cache(lhe_fp))\n", "\n", "\n", "def from_lhe(filepath, *, parquet_cache=True):\n", " \"\"\"Load an awkward array of the events in the passed LHE file\n", "\n", " Parameters\n", " ----------\n", " filepath : str\n", " Path to LHE file to load\n", " parquet_cache : bool, optional\n", " If true, use a parquet file alongside the LHE file to cache the parsing.\n", " This caching makes sure to update the cache if the LHE file timestamp is\n", " newer than the parquet cache timestamp. If false, never use a cache.\n", " \"\"\"\n", "\n", " # need the file to exist\n", " if not os.path.exists(filepath):\n", " msg = f\"Input LHE file {filepath} does not exist.\"\n", " raise FileNotFoundError(msg)\n", "\n", " # leave early without even thinking about cache if user doesn't want it\n", " if not parquet_cache:\n", " return _from_pylhe(filepath)\n", "\n", " # if cache doesn't exist or its last modification time is earlier than\n", " # the last modification time of the original LHE file, we need to create\n", " # the cache file\n", " cache_fp = _parquet_cache(filepath)\n", " if not os.path.exists(cache_fp) or os.path.getmtime(cache_fp) < os.path.getmtime(\n", " filepath\n", " ):\n", " convert_to_parquet(filepath)\n", "\n", " # load the data from the cache\n", " return ak.from_parquet(cache_fp)" ] }, { "cell_type": "markdown", "id": "63c527ef-4bb9-4982-badc-2145ff81d031", "metadata": {}, "source": [ "Just as an example, we can use the scikit-hep test data to show how much faster the parquet reading is." ] }, { "cell_type": "code", "execution_count": 6, "id": "705a9b59-3044-456c-b9b9-3a0e1f5bf711", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 7.71 s, sys: 19.9 ms, total: 7.73 s\n", "Wall time: 7.73 s\n", "CPU times: user 8.12 s, sys: 96.7 ms, total: 8.22 s\n", "Wall time: 8.2 s\n", "CPU times: user 103 ms, sys: 16.7 ms, total: 120 ms\n", "Wall time: 103 ms\n" ] } ], "source": [ "from skhep_testdata import data_path\n", "\n", "lhe_file = data_path(\"pylhe-drell-yan-ll-lhe.gz\")\n", "\n", "%time events = _from_pylhe(lhe_file)\n", "# first run needs to generate the cache\n", "# so it will be about as slow as normal LHE reading\n", "%time events = from_lhe(lhe_file)\n", "# later runs will be faster\n", "%time events = from_lhe(lhe_file)" ] }, { "cell_type": "code", "execution_count": null, "id": "f7efdbf7-f40c-4b29-8b00-f455b4a25684", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.5" } }, "nbformat": 4, "nbformat_minor": 5 }