There's been a tension between super expensive
HPC clusters and on-prem cloud style data centers for large scale computation since the e-Science programme 20+ years ago (just noting that as part of that,
(We (Cambridge University Computer Lab) developed the 
Xen Hypervisor subsequently used by 
Amazon in their Cloud setup for quite a while, so there). 
Since a principle use of such facilities is around discovery science,
it is unlikely to be successful in that role if based on pre-determined designs based
The High Energy Physicists and
folks with similar types of computation have favoured buying expensive
systems that have astronomical size RAM and a lot of cores very close
to the memory. Not only are these super expensive (because they are
not commodity compute hardware) they are almost always dedicated to
one use and are almost always used flat out by those groups, perfectly
justifiably since the data they process keeps flowing.
Meanwhile, most people have tasks that can be classified as either
small (work on a fast laptop these days) or what we call
"embarrassingly parallel", which means they trivially split into lots
of small chunks of data that can be independently processed to (e.g.)
create models which can then be aggregated (or federated). These work
really well in Cloud Computing platforms (AWS, Azure etc).
However, public cloud is a pay-per-use proposition, which is fine for
a few short term goes, but not great if you have things that run for a
while, or frequently. Or if you are a member of a large community
(e.g. UK academics and their friends) who can outright buy and operate
their own cloud platforms in house (aka "on prem" short for on
premises). This is also true for any data intensive organisation
(health, finance etc).
There are operational costs obviously (but these are already in the
price of public pay-per-use clouds) that include energy, real-estate,
and staffing at relatively high levels of expertise.
However, most universities have got more than one such a service in
house already. And all are connected to the JANET network (which is
about to upgrade to 800Gbps, which continues to be super reliable and
just about the fastest operational national network in the world). So
they are sharable. THey also often feature state of the art
accelerators (GPUs etc) - these are also coordinated nationally in
terms of getting remote access as psrt of collaborating projects, so
that sign-on is fairly straighforward to achieve for folks funded from
UKRI- see UKRI facilities for current lists etc
There are good reasons to continue this federated system of work
including
folks with similar types of computation have favoured buying expensive
systems that have astronomical size RAM and a lot of cores very close
to the memory. Not only are these super expensive (because they are
not commodity compute hardware) they are almost always dedicated to
one use and are almost always used flat out by those groups, perfectly
justifiably since the data they process keeps flowing.
Meanwhile, most people have tasks that can be classified as either
small (work on a fast laptop these days) or what we call
"embarrassingly parallel", which means they trivially split into lots
of small chunks of data that can be independently processed to (e.g.)
create models which can then be aggregated (or federated). These work
really well in Cloud Computing platforms (AWS, Azure etc).
However, public cloud is a pay-per-use proposition, which is fine for
a few short term goes, but not great if you have things that run for a
while, or frequently. Or if you are a member of a large community
(e.g. UK academics and their friends) who can outright buy and operate
their own cloud platforms in house (aka "on prem" short for on
premises). This is also true for any data intensive organisation
(health, finance etc).
There are operational costs obviously (but these are already in the
price of public pay-per-use clouds) that include energy, real-estate,
and staffing at relatively high levels of expertise.
However, most universities have got more than one such a service in
house already. And all are connected to the JANET network (which is
about to upgrade to 800Gbps, which continues to be super reliable and
just about the fastest operational national network in the world). So
they are sharable. THey also often feature state of the art
accelerators (GPUs etc) - these are also coordinated nationally in
terms of getting remote access as psrt of collaborating projects, so
that sign-on is fairly straighforward to achieve for folks funded from
UKRI- see UKRI facilities for current lists etc
There are good reasons to continue this federated system of work
including
- better resource utilisation and
- better cost aggregation as well as
- potentially higher availability
- (lower latency and
- lower power consumption) than nationally centralised systems.
- The other reason that a widely distributed approach is good is that it continues to support teams of people with requisite state of the art computing skills, who are not distanced from their user communities, so understand needs and changing demands much better than a remote, specialised and elite, but narrow facility.
Since a principle use of such facilities is around discovery science,
it is unlikely to be successful in that role if based on pre-determined designs based
on 10-20 year project cycles such as the
large scale computational physics community embark on. This is not,
however, an either/or proposition. We need both. But we need the bulk
of spending to target the place where most new things will happen,
which is within the wider research community
pre-determined designs based on 10-20 year project cycles such as the
large scale computational physics community embark on. This is not,
however, an either/or proposition. We need both. But we need the bulk
of spending to target the place where most new things will happen,
which is within the wider research community
large scale computational physics community embark on. This is not,
however, an either/or proposition. We need both. But we need the bulk
of spending to target the place where most new things will happen,
which is within the wider research community
pre-determined designs based on 10-20 year project cycles such as the
large scale computational physics community embark on. This is not,
however, an either/or proposition. We need both. But we need the bulk
of spending to target the place where most new things will happen,
which is within the wider research community
We have a track record of nearly 4 decades of having a national comms infrastructure 
which is pretty much best in the world - we can quite easily do as well for a compute/storage setup too.
 

No comments:
Post a Comment