Tuesday, July 25, 2023

differentially private high dimensional data publication - perhaps a common case

 imagine you have data about 100M people, that has around 1000 dimensions,

some binary, some other types statistically distributed in various ways, but lets just say kind of uniform random

so a given person as a pretty clear signature even if it is all binary - 2^1000 is a big space. i.e. a key that specifically very likely is different for each person

but imagine 10 of the dimensions are not binary, but (say) a value gaussian distributed, and 990 dimensions are basically 0 for most people, but 1 (or a small number) for each person, but for a different dimension 

so the 10 dimensions are a fairly poor at differentiating between individuals in the 100M population

but the remaining 990 still work really well. i.e. these are rare things for most people but different for different people, so still a very good signature

but say we want to publish data that doesn't allow that re-identification, but retains the distribution in te 990 dimensions -

so what if  we just permute those values between all the individuals? we leave the 10 values alone, but swap (at random) the very few 1s between fields with other fields (mostly 0s, a few 1s). for all 100M members of the population?

what's the information loss?

baiscally, we're observing that unaltered, and published the data in the higher but sparsely occupied dimensions has very strong identifying power, but very poor explanatory power....so messing with it this way, massively reduces the identification facet, but shouldn't alter the overal distributions over these diemensions (w.r..t the densley populated fewer (10) dimensions)


does this make any sense to try?

ref: PrivBayes

 Another way to think of this is that the low occupancy dimensions are unlikely to be part of causation coz they have poor correlation with anything else, mostly

Monday, July 17, 2023

National Infrastructure for AI, ML, Data Science

There's been a tension between super expensive 

HPC clusters and on-prem cloud style data centers for large scale computation since the e-Science programme 20+ years ago (just noting that as part of that, 

(We (Cambridge University Computer Lab) developed the 
Xen Hypervisor subsequently used by 
Amazon in their Cloud setup for quite a while, so there). 

The High Energy Physicists and
folks with similar types of computation have favoured buying expensive
systems that have astronomical size RAM and a lot of cores very close
to the memory. Not only are these super expensive (because they are
not commodity compute hardware) they are almost always dedicated to
one use and are almost always used flat out by those groups, perfectly
justifiably since the data they process keeps flowing.

Meanwhile, most people have tasks that can be classified as either
small (work on a fast laptop these days) or what we call
"embarrassingly parallel", which means they trivially split into lots
of small chunks of data that can be independently processed to (e.g.)
create models which can then be aggregated (or federated). These work
really well in Cloud Computing platforms (AWS, Azure etc).

However, public cloud is a pay-per-use proposition, which is fine for
a few short term goes, but not great if you have things that run for a
while, or frequently. Or if you are a member of a large community
(e.g. UK academics and their friends) who can outright buy and operate
their own cloud platforms in house (aka "on prem" short for on
premises). This is also true for any data intensive organisation
(health, finance etc).
There are operational costs obviously (but these are already in the
price of public pay-per-use clouds) that include energy, real-estate,
and staffing at relatively high levels of expertise.
However, most universities have got more than one such a service in
house already. And all are connected to the JANET network (which is
about to upgrade to 800Gbps, which continues to be super reliable and
just about the fastest operational national network in the world). So
they are sharable. THey also often feature state of the art
accelerators (GPUs etc) - these are also coordinated nationally in
terms of getting remote access as psrt of collaborating projects, so
that sign-on is fairly straighforward to achieve for folks funded from
UKRI- see UKRI facilities for current lists etc 

There are good reasons to continue this federated system of work
including 

  • better  resource utilisation and 
  • better cost aggregation as well as 
  • potentially higher availability 
  • (lower latency and 
  • lower power consumption) than nationally centralised systems.

  • The other reason that a widely distributed approach is good is that it continues to support teams of people with requisite state of the art computing skills, who are not distanced from their user communities, so understand needs and changing demands much better than a remote, specialised and elite, but narrow facility.

Since a principle use of such facilities is around discovery science,
it is unlikely to be successful in that role if based on pre-determined designs based 
on 10-20 year project cycles such as the
large scale computational physics community embark on. This is not,
however, an either/or proposition. We need both. But we need the bulk
of spending to target the place where most new things will happen,
which is within the wider research community
pre-determined designs based on 10-20 year project cycles such as the
large scale computational physics community embark on. This is not,
however, an either/or proposition. We need both. But we need the bulk
of spending to target the place where most new things will happen,
which is within the wider research community

We have a track record of nearly 4 decades of having a national comms infrastructure 
which is pretty much best in the world - we can quite easily do as well for a compute/storage setup too.

Tuesday, July 11, 2023

Why is the design principles of the Internet are like Climate Interventions are like a bicycle helmet laws?

  1. For a long time, people argued about whether the Internet should have reliable, flow controlled link layers. In olden times, physical transmission systems were not as good as today, so the residual errors and multiplexing contention led to all sorts of performance problems. There were certainly models that suggested that for some regime of delay/loss, you were better off with a hop-by-hop flow control and retransmission mechanism. As the physical network technologies (access links like WiFi, 4G, Fibre to the home) and switches got faster and more reliable, the end-to-end flow control&reliability, and congestion control seem to be a more optimal solution (I'm tempted to add security here too!). But here's the key point I want to deliver - if we had built a lot of switches with additional costs of hop-by-hop (e.g. just one of many) mechanisms, we would have added a lot of latency, which would have led the network to take a lot longer to reach the operating point where a pure end-to-end set of solutions might never have come about - indeed the sunk cost in deploying, and maintaining much more complex switches and NICs would lean against the removal of such tech.
  2. So how is this like climate? Well, people are now sufficiently worried about global heating, and the failure to slow our emissions to anything approaching the necessary low to prevent even 2C temperatures, and worse, that chain-reaction effects may be imminent, that now we are re-visiting arguments for geoengineering, or what I sometimes call re-terraforming the Earth. One such mechanism involves seeding the upper atmosphere so that it reflects a lot more sunlight than currently - an affordable approach exists and could mitigate 1-2C of global heating almost right away. Aside from the downsides (for example, you might catastrophically interfere with precipitation so that things like the Monsoon could move by 1000s of kilometers and months), any such technology would also slow down the effectiveness of actual viable long term solutions like solar power generation. So the short term fix actually directly messes up the better answer.
  3. And how on earth can this be like bicycle helmet laws? So the arguments for wearing bicycle helmets are good - in the event of an accident, they definitely can save your life, or reduce the risk of serious brain injury. No question, there. There is a small amount of plausible evidence that cyclists who wear more visible safety gear do attract a slightly higher risk from drivers who drive closer, based on (unconscious bias) perception that the cyclist is less likely to do something random. That's not the main problem. Statistics from countries that make cycling helmets mandatory conclusively show a large scale reduction in the number of people that cycle, and this leads to a reduction in population health, both from reduced opportunities for exercise and from increased pollution from other modes of transport. Some of those people that don't cycle will actually die as a result of not wearing a helmet, in some sense. So the long term solution is to make cycling safer and to remove the need for personal, unsafe, cars or their drivers who are the root cause of the risk. Autonomous vehicles, and segregated bike lanes seem like things one should continue to argue for, rather than forcing a short term solution on people that is counter productive (i.e. reduces the inherent, healthy actual demand for cycling.).
So there you have it - the Internet Architecture is like Geoengineering and Helmets - as easy as falling off your bike,

AI everyday life skillz

 This extremely useful report from Ada Lovelace et al has lists of "AI" stuff that the public actually encounter - it just predate the hysteria about LLMs so it might change (a bit) if people were re-surveyed (though I doubt it, as this was well constructed being about lived experience more than hearsay and fiction)


nevertheless, it suggests we might want to assess the public readiness to cope with various new AI tech as it (slowly) deploys....


we can look at it through several lenses - the lens of every day includes smart devices (home, phone, health/fitness) and services (cloud/social/media - recommenders etc), and workplace (better software that reduces slog on boring tasks and integrates things nicely - especially stupid stuff like travel/expense claims, meeting&document org/sharing, fancy tricks to improve virtual meeting experiences etc), then there's state interventions (in the report above, face recog, but what about tax surveillance and the like).

of course, there's the trivial lens - that of your camera phone:-) enhanced by some clever lightfield tricks etc etc...


but if we are thinking longer term (5-50 years), what are the key lessons people should be internalising to reduce future shock?


to be honest, I have no idea, and I think climate is far more important than worrying about the LLM taking your job. unless you are a really bad wordmith.

Monday, July 10, 2023

Existential threads

 People who like being in headlines are clutching at straws when they talk about existential threats.

The latest in a long line of "we're all doomed" was trigged by the hype surrounding a new chatbot, mostly similar to the old chatbot, but with a slightly smoother line of patter. LLMs are not AI, or even AGI, they are giant pattern matchers.

In order of threats to things, my list is quite short

  • LLMs are a threat to journalists, as they reveal how few journalist actually do their job, and that job, therefore is at risk, from being replaced by a script, just like workers in call centers. Threat? tiny. When? Right now.
  • Nuclear Fusion Reactor - these actually could save the planet, and the tech is now mere engineering away from being deployable - just main problem is that that engieering is very very serious - more complex than, say, a 747/Jumbo Jet, which is typically a 20 year lead time. Nevertheless, these are. a threat to fossil fuel  industry. Threat: modest. When? 10-20 years off.
  • Quantum Computers - these are.a threat to some old cryptographic algorithms, for which we already have replacements. However, decoherence and noise are a threat to QC, so these may never happen. Someone clever might solve that, so let say 5-50 years, or not at all. Threat: miniscule.
  • Climate. catastrophe. already. right now. Threat: total; When: yesterday.
So there's my list. AGIs might happen if we survive all the above, or at least 3. You choose.

Blog Archive

About Me

My photo
misery me, there is a floccipaucinihilipilification (*) of chronsynclastic infundibuli in these parts and I must therefore refer you to frank zappa instead, and go home