Skip to content

Commit 356fc1d

Browse files
committed
add get_files_on_disk to readme
1 parent 18523a8 commit 356fc1d

1 file changed

Lines changed: 40 additions & 0 deletions

File tree

README.md

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ Table of Contents
1515
* [bind_condor.sh](#bind_condorsh)
1616
* [Usage](#usage-1)
1717
* [Setting up bindings](#setting-up-bindings)
18+
* [get_files_on_disk.py](#get_files_on_diskpy)
1819
* [tunn](#tunn)
1920
* [Detailed usage](#detailed-usage)
2021
* [Web browser usage](#web-browser-usage)
@@ -214,6 +215,45 @@ In this particular case, it is necessary to upgrade `pip` because the Python ver
214215
**NOTE**: These recipes only install the bindings for Python3. (Python2 was still the default in `CMSSW_10_6_X`.)
215216
You will need to make sure any scripts using the bindings are compatible with Python3.
216217
218+
## `get_files_on_disk.py`
219+
220+
This script automates the process of querying Rucio to find only the files in a CMS data or MC sample that are currently hosted on disk.
221+
(The most general form of this functionality is not currently available from other CMS database tools such as `dasgoclient`.)
222+
223+
There are two major use cases for this tool:
224+
1. Finding AOD (or earlier formats such as RECO or RAW) files for testing or development. (AOD samples are not hosted on disk by default, so typically only small subsets of a sample will be transferred to disk for temporary usage.)
225+
2. Obtaining file lists for premixed pileup samples for private MC production. (Premixed pileup input samples are no longer fully hosted on disk because of resource limitations.)
226+
227+
A fraction of each premixed pileup sample is subscribed to disk by the central production team, and the corresponding list of files is synced to cvmfs.
228+
By default, this script will just copy this cached information.
229+
This is the most stable and preferred approach, so only deviate from it if absolutely necessary.
230+
231+
This script should *not* be run in batch jobs, as that can lead to an inadvertent distributed denial of service disruption of the CMS data management system.
232+
The script will actively try to prevent you from running it in batch jobs.
233+
Please run the script locally, before submitting your jobs, and send the resulting information as part of the job input files.
234+
235+
The available options for this script are:
236+
```
237+
usage: get_files_on_disk.py [-h] [-a [ALLOW ...] | -b [BLOCK ...]] [-o OUTFILE] [-u USER] [-v] [--no-cache] dataset
238+
239+
Find all available files (those hosted on disk) for a given dataset
240+
241+
positional arguments:
242+
dataset dataset to query
243+
244+
optional arguments:
245+
-h, --help show this help message and exit
246+
-a [ALLOW ...], --allow [ALLOW ...]
247+
allow only these sites (default: None)
248+
-b [BLOCK ...], --block [BLOCK ...]
249+
block these sites (default: None)
250+
-o OUTFILE, --outfile OUTFILE
251+
write to this file instead of stdout (default: None)
252+
-u USER, --user USER username for rucio (default: [user])
253+
-v, --verbose print extra information (site list) (default: False)
254+
--no-cache do not use cached file lists from cvmfs (default: False)
255+
```
256+
217257
## `tunn`
218258
219259
A simple utility to create and manage SSH tunnels.

0 commit comments

Comments
 (0)