I've written some simple tools that check the number of events in an ntuple and compare it to the DFC. In the directory /cdf/data10a/ucntUtils/nevts/ there are three programs. These were meant to be run with version 5.3.3_nt of the cdf software, but should be easily portable to other versions. The first program is get_evts_file_db.py. This retrieves a list of files and number of events from the DFC. To run, you give it a dataset name and a book in the DFC. For example: get_evts_file_db.py bhel0d filecatalog will return a list of all files in bhel0d (the 5.3.X inclusive electrons) along with the corresponding number of events. get_evts_file_db.py ztop1i cdfptop will return a list of all files in ztop1i (the Top groups 5.3.X Pythia Z->tau tau MC) along with the corresponding number of events. If you know the dataset name, but don't know the book, execute: grep dataset /cdf/data10a/ucntUtils/dataset_book_catalog where dataset is the dataset name you're looking for. Chances are, it's in there. If it's not, update the file by executing: /cdf/data10a/ucntUtils/dumpbooks.sh Ok, you're going to want to save the list created with get_evts_file_db.py to a file. For example: get_evts_file_db.py bhel0d filecatalog >&! bhel0d.lis will work. Once you have this list saved, you're ready for step 2. This is a script called get_evts_file_data.tcsh, which should be run from the nevts/ directory. This takes a directory containing .root files made from the list of files made with get_evts_file_db.py and that list. For each .root file in the directory, it prints out the database file, the number of events the database claims, the .root file, and the number of events in the .root file. This output is automatically saved to a .evts file. For example, get_evts_file_data.tcsh /cdf/data10a/Datastes/UCNT_533nt_59719x/DATA/ELECTRON/bhel0d/01/ bhel0d.lis will create the file bhel0d.lis.evts, and the first few lines look like: bd02ab94.0001hel0 5488 flat_5.3.3_nt_bhel0d_GJ5856.0_174996_1.root 5488 bd02abda.012chel0 5226 flat_5.3.3_nt_bhel0d_GJ5856.0_175066_300.root 5226 bd02abda.0005hel0 5189 flat_5.3.3_nt_bhel0d_GJ5856.0_175066_5.root 5189 Finally, there's a third program called parse_evts_file which takes this .evts file and spits out the number of lines processed in the file, any mismatched files, and the total number of events from the database and from the ntuples. So, parse_evts_file bhel0d.lis.evts will spit out: *********************************************** Begin Parsing bhel0d.lis.evts Processed 500 files. There were 0 mismatched files. TOTAL EVENTS IN DB/TOTAL EVENTS IN DATA: 2535312/2535312 *********************************************** One thing to be careful about here is the case when an ntuple contains no events. In this case, the line in the .evts file will look like: bd02ab94.0001hel0 5488 flat_5.3.3_nt_bhel0d_GJ5856.0_174996_1.root and the parsing program will fail on this line and not produce any error messages. To make sure this did not happen, do a word count on the .evts file: wc bhel0d.lis.evts The first number is the number of lines in the file. This should be the same as the number of "Processed files" above. If it's not, a quick look through the .evts file will reveal the problem. A few more things to note. The first is that the script get_evts_file_data.tcsh expects that the file names of the ntuples will have as the last bit of the name: run_section.root, where run is the decimal run number and section is the decimal run section number. Any file made with our ntuple script will follow this convention. The second is that, as of Nov. 12, 2004, the datasets found on http://hep.uchicago.edu/cdf/flatntuple/533_UCNT_DataSet_Info.html and listed as completed have a subdirectory in nevts/ for the .lis and .evts files. So, for bhmu0d, there is a bhmu0d directory with the following contents: -rw-rw-rw- 1 cwolfe cdf 25k Nov 12 02:22 bhmu0d.lis -rw-rw-rw- 1 cwolfe cdf 9.5k Nov 12 02:22 bhmu0d_3.lis.evts -rw-rw-rw- 1 cwolfe cdf 25k Nov 12 02:22 bhmu0d_3.lis -rw-rw-rw- 1 cwolfe cdf 36k Nov 12 02:22 bhmu0d_2.lis.evts -rw-rw-rw- 1 cwolfe cdf 25k Nov 12 02:22 bhmu0d_2.lis -rw-rw-rw- 1 cwolfe cdf 36k Nov 12 02:22 bhmu0d_1.lis.evts -rw-rw-rw- 1 cwolfe cdf 25k Nov 12 02:22 bhmu0d_1.lis A third point is that there are occasionally mismatches between the ntuples and the DFC. As an example: parse_evts_file bhmu0d/bhmu0d_2.lis.evts gives the following output: *********************************************** Begin Parsing bhmu0d/bhmu0d_2.lis.evts bd028e52.0255hmu0 4779 flat_5.3.3_nt_bhmu0d_PU6345.0_167506_597.root 4780 Processed 500 files. There were 1 mismatched files. TOTAL EVENTS IN DB/TOTAL EVENTS IN DATA: 2930333/2930334 *********************************************** For all the data (not MC), the few files where there have been a mismatch, it's like this example -- the ntuple has one more event than the DFC claims. The reason, according to Ken Hatakeyama is: "It seems that it happens when the farm tells us that we have lost events in reprocessing (0c->0d or 0e->0d), but actually we haven't lost events." Fourth, there's also a simple utility which will check for duplicate events in a file or a group of files, called find_duplicate_evts. Simply run it like so: find_duplicate_evts ntuples where ntuples is some list of (one or more) ntuples. Finally, in order to port this utility to another version of the cdf software, all that should be done is to setup the new version and execute: make numevts make find_duplicate_evts And everything should work. I won't guarantee it, but it should work. If you have any questions, email me (cwolfe@hep). -- Collin