Background as of fall 2009
The network layout at the STAR experiment has grown from a base laid over ten years ago, with a number of people working on it and adding devices over time with little coordination or standardization. As a result, we have, to put it bluntly, a huge mess of a network, with a mix of hardware vendors and media, cables going all over the place, many of which are unlabelled and now buried to the point of untraceability. We have SOHO switches all over the place, of various brands, ages and capabilities. (It was only about one year ago all hubs were at least replaced with switches, or so I think – I haven’t found any hubs since then.) There are a handful of “managed” switches, but they are generally lower-end switches and we have not taken advantage of even their limited monitoring capabilities. (In the case of the LinkSys switches purchased one year ago, I found their management web interface poor – slow, buggy and not very helpful.)
In addition to the general messiness, a big (and growing) concern has been that during each of the past several years, there have been a handful of periods of instability in the starp network, typically lasting from a few minutes to hours (or even possibly indefinitely in the most recent cases which were resolved hastily with switch hardware replacements in the middle of RHIC runs). The cause(s) of these instabilities has never been understood. The instabilities have typically manifested as slow communications or complete lack of communication with devices on the South Platform (historically, most often VME processors). Speculation has tended to focus on ITD security scanning. While this has been shown to be potentially disruptive to some individual devices and services, broad effects on whole segments of the network have never been conclusively demonstrated, nor has there been a testable, plausible explanation for the mechanism of such instability.
The past year included the two most significant episodes of instability yet on starp, in which LinkSys SLM 2048 switches (after weeks or months of stability) developed problems that appeared to be similar to prior issues, only more severe. The two had been purchased as a replacement (plus spare) for a Catalyst 1900 on the South Platform. When the first started showing signs of trouble, it was replaced by the second, which failed spectacularly later in the run, becoming completely unresponsive through its web interface and pings, and was only occasionally transmitting any packets at all, it seemed. (After all devices were removed, and the switch rebooted, it returned to normal on the lab bench, but has not been put back into service.)
At this point, all devices were removed from the LinkSys switch and sent through a pair of unmanaged SOHO switches, which themselves each link to an old 3Com switch on the first floor. Since then, no more instabilities have been noted, but it has left a physical cabling mess and a network layout that is quite awkward. (And further adding to the trouble, at least one of the SOHO switches has a history of sensitivity to power fluctuations, every once in a while needing to be power-cycled after power dips or outages.
In addition, there have been superficially similar episodes of problems on the DAQ/TRG network, which shares no networking hardware with the starp network. As far as I know, these episodes spontaneously resolved themselves. (Is this true?) Speculation has been on “odd” networked devices (such as oscilloscopes) generating unusual traffic, but here too there is no conclusive evidence of the cause. Having no explanation, it seems likely this behavior will be encountered again.