Downloading the DEPLHI data

My goal is to create a copy of all the DELPHI data for Open Science II. I am using the cernopendata-client to handle the downloads, and I encountered two errors that I believe are not on my side.

First error:

$ cernopendata-client download-files --doi 10.7483/OPENDATA.DELPHI.6LIH.7UJA --verify --protocol xrootd

==> Downloading file 1 of 4

-> File: ./None/hzha03pyth6156_hattbb_206.5_70_90_22432.xsdst

==> ERROR: Please provide at least one of the following arguments: (recid, doi, title)

But I can just use recid, so going on the second error:

$ cernopendata-client download-files --recid 93773 --verify --protocol xrootd

==> Downloading file 1 of 4

-> File: ./93773/hzha03pyth6156_hattbb_206.5_70_90_22432.xsdst

==> Verifying file hzha03pyth6156_hattbb_206.5_70_90_22432.xsdst…

-> Expected size 29291520, found 29291520

-> Expected checksum adler32:03d9681c, found adler32:3d9681c

==> ERROR: File checksum does not match.

This error is caused by cernopendata-client/cernopendata_client/verifier.py at fc54c028682d149cb81b30532fddaa0bdef5a3ac · cernopendata/cernopendata-client · GitHub where the hex function strips the leading zeroes, but the server checksum retains them. Statistically, this will cause 1/10th of all files to fail the checksum verification. Currently, I am uncertain whether this behaviour is desirable for other use cases. I would be happy to create a pull request otherwise.

I wanted to use the CERN Open Data Portal to retrieve all the recids, but the website and the API powering it are limited to 10 000 results. Even a request like https://opendata.cern.ch/api/records/?q=&sort=mostrecent&size=50&from=10000&experiment=DELPHI&type=Dataset returns a status 400. Can you suggest an alternative or a solution for this limitation?

Hi @lukesm

Thank you for the bug report!

The first two issues for the cernopendata-clientwere fixed in fix(download): use record IDs for all local data paths by tiborsimko · Pull Request #167 · cernopendata/cernopendata-client · GitHub and fix(verifier): zero-pad Adler32 checksums to 8 hex characters (#169) by tiborsimko · Pull Request #169 · cernopendata/cernopendata-client · GitHub. We have just released a new version of the cernopendata-client 1.0.2 with these fixes. Here are the full release notes: Release v1.0.2 · cernopendata/cernopendata-client · GitHub

As for the CERN Open Data portal pagination observation, I have created an issue here: pagination: allow paginate over more than 10000 results · Issue #276 · cernopendata/cernopendata-portal · GitHub

Please note that you can also find DELPHI open data records in their JSON format in this repository: opendata.cern.ch/data/records at master · cernopendata/opendata.cern.ch · GitHub You could parse “delphi-*.json” files to discover information about record IDs and the corresponding file locations. This should correspond 100% to the status of records on the CERN Open Data portal.

1 Like