How to go about processing large amounts of datasets?

Dat_Guy · 1 February 2025 08:26

Hello
I am working on a personal project and was wondering how to go about processing large amounts of data fastest and most efficiently.

Currently my efforts have been downloading the ROOT docker locally to base my code on top of C++.
And was also wondering about techniques to manage memory during the processing procedure, since my RAM could run out.

Thanks in advance!

zmarshal · 2 February 2025 14:52

Hi @Dat_Guy ,

It depends on what resources are available to you and what sort of processing you need to do. On a private laptop, something like RDataFrame is pretty darn good if you are comfortable with a ROOT-based environment. If you prefer something pythonic, the system from IRIS-HEP (Analysis Systems | Institute for Research and Innovation in Software for High Energy Physics) and SciKit-HEP (Scikit-HEP Project · GitHub) are pretty good.

If you are able to use a small cluster, then something like SciKit-HEP’s integration with DASK will do pretty well, or cutting up the jobs and submitting them to a batch system with either setup would work.

For memory management, again it depends on what you’re doing. For the input, avoiding reading everything into memory is usually a good idea (be diligent about closing things up). Keeping histogram binning modest is a good idea as well (especially for multi-dimensional histograms). Other recommendations would probably be dependent on your workflow and what tools you end up using.

Good luck!

Dat_Guy · 4 February 2025 19:08

Hey, thanks for the comment

I am using uproot for data retrieval and parsing.
Regarding processing using vector and awkward.

My workflow is such as this (sharing some code, hope it makes sense)
I have defined ATLAS_Parser class

atlas_parser = parser.ATLAS_Parser()
atlas_parser.get_data_index([consts.ATLAS_ELECTROWEAK_BOSON])
atlas_parser.parse_all_files(schemas.TOPQ_SCHEMA, limit = 50)

the issue is with parse_all_files, as it goes file by file and parses it rather slowly.
It’s implemented like this →

def _parse_file(self, schema, file_index):
        with uproot.open({file_index: "CollectionTree"}) as tree:
            events = {}

            for container_name, fields in schema.items():
                cur_container_name, zip_function = ATLAS_Parser._prepare_container_name(container_name)
                
                tree_as_rows = tree.arrays(
                    fields,
                    aliases={var: f"{cur_container_name}AuxDyn.{var}" for var in fields}
                )
                sep_to_arrays = ak.unzip(tree_as_rows)
                field_names = tree_as_rows.fields

                tree_as_rows = zip_function(dict(zip(field_names, sep_to_arrays)))

                events[container_name] = tree_as_rows

            return ak.zip(events, depth_limit=1)

I have two main thoughts. Is my way of parsing the data effective and how possibly speed it up, maybe using multiprocessing.

Thanks again

zmarshal · 5 February 2025 23:51

Hi @Dat_Guy ,

I’d suggest checking out the IRIS-HEP Analysis Grand Challenge pages, where they document their setups with significant parallelism pretty well: