Dataset acess time in Xrootd. 2016 survey

Xrootd dataset access

A survey of the access time of all files in the Xrootd distributed data storage was done on 2016/05/09 and data extracted and regrouped by triggersetup name / production name (there are many datasets and combinations and the graphing became coubersome at best). Precision on the stream may have helped clarify (but this was a quick first attempt to extract some useful information). For example AuAu_200_production_mid_2014,P15ic and AuAu_200_production_mid_2014,P15ie are not a repeat of the same production (the former relates to st_physics with HFT tracking, the second st_mtd stream data only). A later version may present a split but will expand / inflate the number of representation to an un-manageable number (again, please take this as guidance and general lesson learn).

From the "raw" numbers, we counted the number of access to the files and sum it. Every count reported is hence not how many times a dataset was accessed but how many files from this dataset was requested from Xrootd.

The access count is showed below.

Datasets Access count
muonminbias,P06ie 6
ppEmcBackgroundCheck,P06ie 14
pp2pp_VPDMB,P10ic 20
pp500_production_fms_2013,P14ia 28
vernier_scan,P10ic 33
tof_prepost_himult,P10ic 34
tof_prepost_himult,P11id 34
tof_production2009_single,P10ic 35
production2009_500Gev_25,P09ig 37
ppLongTest,P06ie 40
upsilonTest,P06ie 52
vernier_scan,P11id 59
barrelBackground62,P12ia 75
ppProductionTransFPDonly,P06ie 76
pp500_production_2013a,P14ia 93
tof_production2009_single,P11id 128
LowLuminosity_2010,P10ik 139
production_fms_pp200long2_2015,P15ik 148
pp2ppStrawMan,P10ic 149
ppEmcCheck,P06ie 149
low_luminosity2009,P10ic 159
pp200_production_2012_setup,P12id 181
pp500_upc_2013,P14ia 233
production2009_500Gev_b,P09ig 312
production_fms_pp200trans_2015,P15ik 327
pp500_production_fmsonly_2013,P14ia 341
barrelBackground,P06ie 397
production_pp200long3_2015,P15ik 407
low_luminosity2009,P11id 431
production_pAu200_fms_2015,P15il 487
pp500_lowluminosity_2012,P13ib 490
pp500_production_fmsonly_2013,P14ig 521
production2009_500GeV_carl,P09ig 521
commission2009_200Gev_Lo,P10ic 541
pp200_production_fms_2012,P12id 549
ppProductionLongNoEmc,P06ie 616
pp200_production_noemc_2012,P12id 698
AuAu_200_production_mid_2014,P15il 755
commission2009_200Gev_Lo,P11id 831
production2009_200Gev_nocal,P10ic 969
cu62productionMinBias,P13ib 1013
production2009_200Gev_noendcap,P10ic 1025
AuAu_200_production_low_2014,P15il 1059
commission2009_200Gev_Hi,P10ic 1212
pp500_production_fms_2012,P13ib 1497
production62GeV,P04ie 1775
production62GeV,P04id 1840
production2009_200Gev_nocal,P11id 1975
pp2pp_Production2009,P10ic 2027
production2009_200Gev_noendcap,P11id 2058
ppProductionTransNoEMC,P06ie 2093
commission2009_200Gev_Hi,P11id 2483
pp500_production_2013_noendcap,P14ia 2882
pp500_production_2012_noeemc,P13ib 3232
production2009_200Gev_Lo,P10ic 3314
pp500_production_2013_noendcap,P14ig 3332
ppProductionJPsi,P06ie 3343
production2009_500GeV,P09ig 3680
production2009_200Gev_Lo,P11id 4241
zdc_polarimetry,P10ic 5221
pp500_production_2011_noeemc,P11id 5329
zdc_polarimetry,P11id 5574
ppProductionMB62,P12ia 6582
ppProduction62,P12ia 6612
AuAu11_production,P10ih 7824
pp2006MinBias,P06ie 8063
ppProduction,P06ie 10305
production2009_200Gev_Hi,P10ic 13462
production_pAu200_2015,P15il 16121
production2009_500Gev_c,P09ig 18809
AuAu19_production,P11id 19277
pp500_production_2011_long,P11id 22772
AuAu7_production,P10ih 24613
ppProductionLong,P06ie 25085
productionMinBias,P05ic 25469
production2009_200Gev_Hi,P11id 29870
cuAu_production_2012,P15ie 30779
AuAu27_production_2011,P11id 30914
ppProductionTrans,P06ie 31101
cuAu_production_2012,P14ia 32217
production_pp200long2_2015,P15ik 47721
production_pp200trans_2015,P15ik 49772
AuAu_200_production_2014,P15ic 50739
production_15GeV_2014,P14ii 53324
AuAu_200_production_high_2014,P15ic 62430
production2009_200Gev_Single,P10ic 72434
AuAu39_production,P10ik 75555
AuAu62_production,P10ik 76928
ppProduction,P05if 78457
pp500_production_2011,P11id 80153
AuAu_200_production_2014,P15ie 86255
AuAu_200_production_low_2014,P15ic 103837
pp200_production_2012,P13ib 113806
AuAu_200_production_low_2014,P15ie 127728
pp200_production_2012,P12id 142581
production2009_200Gev_Single,P11id 146920
AuAu_200_production_mid_2014,P15ic 154651
pp500_production_2012,P13ib 183427
AuAu_200_production_mid_2014,P15ie 205764
UU_production_2012,P12id 208982
CosmicLocalClock,P11id 216021
AuAu200_production_2011,P11id 231039
pp500_production_2013,P14ia 265296
pp500_production_2013,P14ig 299906
AuAu_200_production_high_2014,P15ie 395588
AuAu200_production,P10ik 438687


A few pattern appeared as follows
  • Based on he access count alone and the number of datasets as defined above (106 total),  a first finding is that we see that 14% of those datasets have been accessed less than 100 counts, 23% for < 500 counts, 32% for < 1,000 counts, 50% for < 5,000 counts and 57% for < 10,000 access counts over a period of 10 months.
    Observation 1: there is room for some cleanup although those datasets are unlikely large (total side still being sorted)
     
  • Some dataset such as AuAu11_production,P10ih had been on storage for a while but not accessed at all with a small usage for the past 2 months (8k access) - this is the same for AuAu27_production_2011,P11id / AuAu39_production,P10ik / AuAu7_production,P10ih  but also AuAu_200_production_2014,P15ie
    Observation 2: some of those are old and it is unclear why they kept coming as data to preserve if they were not being used. From a purely resource perspective, those should have been rotated out and restored whenever needed.
     

For eye guiding, below are the access count by groupings (best intent / there may be repeats of series).

Au+Au 200 series

Executive summary: Overall, all datasets are being accessed in this category modulo a few of them with a clear access activity drop in the last month (not a single access) very visible and identifiable on the graph.



BES, special trigger series

In this category, there is a slew of small sets that are rarely acessed (all squished to the bottom) includign things like zdc_polarimetry but also things like p+Au 200 2015 (P15il) or Cu+Au production 2012 (P15ie) that has not been accessed for a month and Au+Au 62 GeV (P10ik) not accessed for 3 months+ now. The U+U production 2012 (P12id) seems to be the most popular followed by Au+Au 39 (P10ik). Not sure of the reason but the CosmicClock dataset has been regularly acessed across the entire time period.





p+p series

200 GeV

From this graph, we can see that most of the p+p 200 GeV datasets are not being accessed in the last months to the exception of only a few datasets (3 at most). Everything is below a few 1,000 access.




500 GeV

The 500 GeV datasets show a similar pattern - most datasets have not been accessed in the past month (such as p+p 500 production 2012 P13ib) and a slew of datasets have not been accessed at all in ages (only 4 are regaulrly accessed).