Do you happen to have samples with ttbar → mu + nu + jets at 13.6 TeV? I’m questioning the wisdom of downloading dataset 601229 (PhPy8EG_A14_ttbar_hdamp258p75_SingleLep) for which I would need to request more events than are currently available anyway, to then cut away the 2/3s that don’t contain a muon, vs generating 20M of my own ttbar events.
Thanks Zach. Okay, I guess I’m just going to bite the bullet and ask for the full dataset then. I’m trying to estimate the fake dimuon rate for 300fb^-1 lumi at 13.6 TeV, so good stats are crucial.
@zmarshal/ openEvtGen team, not trying to be pushy, but is it possible to get higher stats for the ttbar semileptonic dataset with muon filter, dataset id #601229, and within what sort of timeframe? Trying to weigh up waiting with generating my own. Thanks!
How many events are you after? Looks like we’ve processed to HEPMC (internally) 10M and released them all. We have (internally) something like 1B, but of course releasing that is a lot of disk space (even if it saves you a lot of compute).
Just checking: you’re trying to get after fakes from heavy flavor, or from ‘true’ ttbar events, or from something else?
Putting on my ‘what approximations are probably good enough’ hat, I think I would have no objection, if you’re trying to get after fakes from b fragmentation and decay, if you combined all the single lepton ttbar samples we have available (e.g. 601398, 601414, 601497, 604468, 604470, 604472, 604474, 604476, 604478, 604480, 604482). Those will have slightly different setups for hadronization and fragmentation (maybe that’s even a useful thing), and will end up with slightly different random seeds for the fragmentation and decay of the b-quarks, which will create sufficiently different events in those samples for you to use them all. If you were going to use those to try to set a systematic uncertainty from fragmentation and hadronization, of course, that won’t work, but then I’d worry you’ll need higher statistics for lots of samples.
Hi @zach, it looks like 2.5M ttbar events that I generate use 280GB of storage, whereas the same number of MC events downloaded from your dataset requires 1.3TB of storage. I need 20M MC events, and am storage-limited in the short term, so I think that swings it in favour of generating events myself. Any idea what makes your datasets so much larger? Do you think it’s the specific generator (e.g. MGaMC vs Sherpa)? Or the hadronization settings (MPI turned off for me)?
That’s a pretty big difference, I agree! It’s possible that the MG5 events are smaller (have fewer intermediate particles) — if you’ve got both in HepMC format, you could just look for how many lines each file has that start with ‘P’ or ‘V’ per event. For one of ours, I get 860 vertices per event and 1498 particles per event. With MPI I’d expect a difference, but I don’t have a good intuitive feeling for whether it’ll be huge or modest. Some smaller contribution could “just” be from the weights in the events, but I’d be a little surprised if that’s causing such a large difference.
Just to check the other obvious thing, you also gzip your files? Are they also in 10k event batches? (bigger files generally zip better, so if you use larger batches your files could be smaller per event)
Thanks for checking! I agree with you, that’s MPI plus differences between MG+Py and Sherpa (I don’t want to guess which one is more important…). I think we have some options in the future, but this tells us a few things:
We should seriously consider the removal of intermediate particles in the record if possible. That might require some discussion in a future appropriate forum, because we don’t want to block a study by removing the ‘wrong’ thing.
If we go to much longer files, we could save maybe 10% of our disk space — not nothing, but not life-changing.
If we want to get more compression, we will need something more complicated than gzip, and then we’ll want to think through the same questions about what Pythia (and other tools) can read.
We can also try to pay some attention to these things if/when we move to HepMC3, to see if there are other ways to control output size. Unfortunately, I think for these samples modifications are going to be too finicky to be short-term, so there’s not a lot we can do to help with the file sizes for now.