Monday, July 15, 2024

what is the internet, what is ai, and what is for dinner?

 at an "internet is 50" event at the royal society (is 350) yesterday

it was clear that a lot of people want to claim they invented the internet, and they are not wrong, but there are very different viewpoints which correspond to layers, and, as with a lot of archeology, when you dig through the layers of ancient civilisations, you find historical context (as well as entire slices full of carbon indicating rather violent and abrupt end to some).


photonics - 60 years old - clearly was the internet (wasn't at all for 30 years, but who's counting)


radio - 100 years old, and now trending as 6G, despite that most the internet is over WiFi on account of money


ip - whether v4, v5 (st) or v6, this is the echt internet


web (web science etc) - from 92, what a lot of people confuse with the internet despite that zoom and whatsapp aren't web:-)


cloud (compute/data center etc etc) - confusing, since early pictures of the arpanet are clouds, but cloud computing is only about 20 years old. and isn't the internet, despite that some cloud-based services wanna pretend they are.


ai (in search which originally was information retrieval which is machine learning or stats, and the basis of coding/modulation and training signals to optimise for bandwidth and nouse, and also the basis of search. but not the internet either.



This profusion and confusion of layers also happens with AI, thusly:


AI was stats (info theory, maximum likelihood etc etc)


then it was ML (optimisation, training on signal etc)


then it was AI (only because an artificial neural network included the world "neural", despite being as similar to natural neural networks as spiderwebs.

Tuesday, July 09, 2024

randix

 i'm thinking about a replacement for posix that resists vulnerabilities through large amounts of random behaviour - so thinking about the relevant system call api

we propose and have prototyped 

spoon() which replaces fork(), and has far less precise semantics

and

resurrect() which replaces both exec() and kill(), with the obvious connotations

open(),close(),read(),write(),link(), and seek() are replaced by a single multihead attention system call

llm() which either entrains or implies, depending on the sense of the first param.

rand() of course, behaves exactly as before, at least under test, generating the pseudo-random number sequence 1,2,3,4,5,6,7,8,9 etc etc



Saturday, June 22, 2024

Ten Tales of Ross Anderson, mostly tall

While an undergraduate at Trinity College Cambridge, Ross famously accidentally blew up a bedder with his experimental Quantum Bomb in the Anderson Shelter. The bedder wasn't harmed but the experiment showed that the Maths Department had been teaching the wrong type of Quantum Mechanics, with high probability.

This would of course, come back to haunt Ross, later in life.

Then there was the  time he nearly succeeded in back-dooring the NSA...unluckily, they chose a lesser (ironically not quantum proof) algorithm over his, which was a bit of a shame Seeing all the five-eyes data from inside would have been a bit of a coup.

And of course, several times he saved a large fraction of the western banking system from collapse (again). This was largely down to understanding the inherent contradictions in blockchain, and the toxic nature of proof-of-astonishment, and the resulting potential oscillatory value proposition that this triggered whenever a user suffering from prosopagnosia was encountered.

His undergraduate teaching of Software engineering from a psephological perspective was always incredibly popular. However, students were not sure how valuable it would be in their future careers. They were wrong, unsurprisingly.

And then he got interested in vulnerabilities in Hebridean sea shanties, and prompt engineering LLMs to create new lyrics with ever more powerful earworms.

This was useful in helping the Campaign, led by Ross, to get Cambridge University to stop firing people illegally for being obstreporous. The so called Employer Justified Brain Drain was why Ross had taken a position in the University of the Outer Hebrides so that he could continue to be a thorn in the side as he had for so long been...

Tuesday, June 04, 2024

10,000 maniacs and AI is destroying Computer Science, one topic at a time....

 This year will see approximately 10,000 papers published in the top 3 conferences in AI alone.

What does that even mean? How can anyone have an overview of what is happening in AI?

How is their "community" calibrated on what is original, what constitutes rigour, what he paper is significant in terms of potential impact on the discipline?

But that's not what I came here to say, at least, thats just the starting point.

For a couple of years now, we've seen papers "tossed over the fence" to other conferences (I'm using the conference as an example venue, but I am sure journals, technical press, and bloggers are seeing the same thing). 

A paper on AI and Systems (or Networks, or Databases, or pick your own long established domain) should bring interesting results in those domains - indeed, it is clear that AI brings challenges for all those domains (mainly of scale, but some with different characteristics that we havn't encountered precisely before). This is not a problem - we (in Systems) welcome a challenge - we really relish it!

But how do we know that the AI part is any good? How do we know it isn't outdated by other papers recently, or disproven to be a good approach, or even if some paper in the AI community has taken the same AI tech and resolved the systems challenge? How does anyone in the AI community know either?

This is not sustainable for AI, but it is becoming unsustainable across all of Computer Science pretty rapidly. The AI community, driven by a mix of genuine excitement, but also by hype, and some ridiculous claims, greed (for academic or commercial fame & wealth), but also just to "join in" the big rush, is polluting the entire landscape of publications, but more problematic, it is atomising the community, so that we are rapidly losing coherence, caliabration and confidence about what is important, what is a dead end, and what is just good training for another 30,000 PhD students in the dark arts.

I have no idea how to fix this. Back in the day, at height of Internet madness, the top ACM and related conferences had a few hundred submissions and accepted in the range 30-100 papers a year. You could attend and meet many of the people doing the work, and scan/read or attend most sessions, even get briefiings from experts on whole session topics, or have discussions (dare I say even hackathons too).

In that world, we started also to insist quite stongly that papers should be accompanied by code, data, and, ideally, an artefact evaluation by an independent group (extra Programme Committee) who could to a lot more than just kick the tyres on the system, but try out some variations perhaps with other data, perhaps more adversarial, perhaps more thorough sensitivity analysis etc etc). 

Imagine if the top 3 AI conferences did require artefact evaluation for all submissions - that's probably in the region of 40,000 papers in 2024. But imagine how many fewer papers would be submitted because the authors would know they'd not really have a chance of passing that extra barrier to entry (or would be in a lower tier of the conference, at least).

And while using AI to do reviewing is a really bad idea (since that doesn't help train or calibrate the human community at all) AI assisted artefact evaluation might be entirely reasonable.

So like the old netflix recommender challenge, the AI Artefact Evaluation challenge could help.

Maybe they're already doing it, but who has the time to find out, or know how well it is working in those 10^4 wafer thin contributions to something that can really not claim to be Human Knowledge, anymore.

Thursday, May 23, 2024

Cross "Border" Digital Infrastructure

 So again while at ID4Africa in Cape Town this week, I heard a lot of people talking about Cross Border use of digital identity. Lets talk a bit about infrastructure here, as I'm not sure people are aware of how hard it is to determine, reliably, where a person, or device are located, geograhpically, let alone jurisdictionally.

We (Microsoft Center for Cloud Research) wrote about this a wwhile back when simply considering the impact of GDPR on Cloud Services and the location of personal data.

The infrastructure doesn't tell you where it is - borders are not digital, they are geo-political constructs that only exist in someone's mind. GPS doesn't work in doors, and can be remarkably perverse in cities anyhow. Content providers (e.g. the BBC in the UK) worry about delivery of content (and adverts and charging) because of different business models in different countries, different content ownership (pace Google YouTube, but also OpenAI), and have, as yet, not solved this problem.

COnsider that someone in ireland can be in or out of the EU in a single step. Or that someone might be on a boat or plane outside a national jurisdiction, using a network to process personal data, which is, exactly, where? Data and processing can be replicated or shareded across multiple sites (indeedmost Cloud Services specifically support keeping copies of state machines while rtunning far appart so that they survive local outages (power failure, disaster/flood etc) and are still live/available. In some cases, the geographic separation to get a required level of relaibility may involve running live programs on live data in multiple jurisdictions/sovereign states. The law does not comprehend this yet (well). and designing digital id (services and wallets etc) without understanding it is not going to help much. Of course, we have the concept of "adequacy" between countries (with regards GDPR - this was also discussed in ID4Africa last year/2023) - it needs some very careful updating.

Also, recentl moves in Internet Standards worl are both towards more anonimity (e.g. oblivious HTTPs) but also towards providing precise location as a service (e.g. proposals from CloudFlare). 

Be careful what you wish for, where?

sustainability of digital wallets for public infrastructure services

One thing occurred to me when listening to people at ID4Africa 24 talk about wallets is that there's a major sustainability problem due specifically to security considerations. 

Any wallet needs to be trusted if it is used for transactions that involve personal data or money.

To implement this trust, the wallet software currently built by major vendors such as Apple, Google and (say) HSBC can use secure enclaves (Trusted Execution ENvironment) support on the device (e.g. trustzone on ARM processors, or variants as built by various handset vendors).

However, the supprt varies with time, but with modications to hardware coming along (e.g. future ARM support for multiple realms and attestation) and simply because software and hardware volunerabilities arise, some of the latter being mitgated by changes to the software, some not.  THis is expensive, so vendors tend to time out support on older devices fairly aggressively.

One report from Cambridge shows how short that can be in practice, so your device no longer gets security patches for the OS (or application SDKs). At this point, can you trust things on it? Almost certainly not in this day and age.

So there are around 750M people in Europe, 450, of them in the EU. If we mandate wallets for Id (or even just make them the only convenient way to access many services) you need to upgrade, typically by replacing all their phones about every 3 years. That's 130M phones a year. Many of these phones cost at least 100 euro and upwards of 1000 euro for high end devices. That's a cost of 130B euro a year.

Oops.

While some of the materials can be recycled (including many newer batteries), the rare earths and other materials used in these devices are already pretty unacceptable in supply chain ethics.

Not a sustainable way to do things. Meanwhile, proposing to run a secure cloud based wallet is viable, but the cost of running a data center with much of peoples' personal data, which full encrypted access, and TEE style processing is also very high (some large single data center energy use is approaching that of large city metro energy use already), plus moving the data to and from between device and clould is also a non-trivial contribution to running costs, both monetary, and energy/carbon wise.


We are building ourselves into another unacceptable future...


Someone please check my arithmetic...

Wednesday, May 22, 2024

DPG #2 or should I say DPPG or possibly DPPI

 We're hearing a lot about DPIs - Digital Public Infrastructure (the Internet, the spectrum for mobile telephone,  open banking networking etc)....

So then there's a lot of talk about building new Infrastructures for (e.g.) Digital Identity - and provisioning this through Public Private Partnerships - so really we then have a DPPI - indeed, the Internet and WWW and Cloud serve as an example of just that too.

But then we have Digital Public Goods - for me, this is an extension of the notion of open source - so again the software that runs the Internet is available in open source form, together with documentation, and even much test data (simulators too). 

But new systems have evolved new forms of ownership, so a lot of the digital content in the Internet is a mix of open (free) and open access but not free to re-purpose (e.g. copyright owners want recompense) - this showed up first in music/file sharing networks, now subsumbed by systems like Youtube - which have to navigate the ownership space carefully .

New forms of digital goods now include trained models (AIs - e.g. LLMs) - these derive value from the data they are trained on (supervised, therefore involving human labour too, or unsupervised), so we then have a new form we might call a DPPG, something that has a mix of properties of public goods and private goods. 

This needs careful consideration, since a lot of IP rights are being skated over right now - the old "move fast and break things" is being applied by some unscrupulous (or to be more generous, just careless) organisations.

Is OpenAI just napsterising the stuff in the common crawl that has clear limits on commercial/for profit re-use (code and data)?

A couple more points about the Public/Private Partnership aspect of digital infrastructure (and goods). 

The Internet was public til 1992. Then the US government divested, so the birth of commercial ISPs happened. Later, ISPs got big enough to own transmission infrastructure (fiber, last mile copper, spectrum etc). Some of the net remained state provided (from my narrow UK perspective, examples are the UK JANET network for research&education and the NHS spine for health services - there are plenty like that) - there are also community provided networks (e.g. Guifi in Spain) that are collectively owned and operated. In the process of federating these together various tools and techniques emerged for "co-opetition" - things like BGP for routing, CA transparency for certificates etc -these are also examples of how to co-exist in a PPP world and they have (mostly) worked for the 32 years since then.

So there are interfaces between components provided using different models (public, private, community). And these change over time (both the technical and the legal, regulatory, business relationships).

The other thing here is time scales - the Boeing 747 ("Jumbo Jet") has had a product lifetime from 1963 until 2023. Software to model it (from wind tunnel tests, to avionics etc) has to run until the last one stops flying. That's 60 years so far. Any DPPG (software artefact, digital twin etc) being designed today better have a design lifetime of at least 100 years. Yes, that is right. One Hundred Years. Not of solitude.

What sorts of businesses have survived unscathed for these sorts of timescales, and what models do they use (my university is 800+ years old, and then there's the Vatican:) Quite a lot of nation states have not lasted that long.

Tuesday, May 21, 2024

DPGs #1

 


The oldest and best example of a digital public good is the Internet. Why people don't start from this is surprising to me:


Since 1982, source code of the exemplary implementation from UC Berkeley has been 

available plus documented in an open access series of books documenting that code and working:TCP/IP Illustrated (vol 2)


The key thing here was that every thing accepted as an internet standard had at least 2 interoperating implementations, preferably three, one of which was open source (unencumbred by any IP) - for me, this defines digital (code/data), public (there's no barrier to entry due to ownership restrictive practices) infrastructure (you can run the code and computers are general purppose machines so any computer can run it, subject to resource constraints:-)

Two other reference points - despite the best of intentions and some clver game theory in design processes , we still suffer from frequent tussels in cyberspace - see Tusslees in Cyberspace
from the same people that said this:

"We reject: kings, presidents, and voting. We believe in: rough consensus and running code." David D. Clark 1992.


The standards process in the IETF has open governance, overseen by the non profit Internet

Society, with free access to standards documentation (RFCs) and processes...plus online/remote access to standards meetings for 30 years...go from here: The Internet Society and the Internet Engienering Task Force which includes hackathons and code sprints as well as writing specs.



For many years, there were also open events for interoperation testing. I remember going to the first Interop Trade Show in Monterey in 1986



The actual operational running of the internet (a mix of private, public and mixed provisioning) 

has teams of people around the world coordinating - e.g. NANOG and RIPE and AfNOG in US and Europe

and Africa e.g. see Reseaux IP European and also net information registries e.g.  AfriNIIC


As well as this, the origin of computer emergency response teams (the "CERTs) who deal with 

coordinated response to security incidents...was from coping with attacks on systems and the infrastructure.


Much of the leading edge research is also covered in open access academic conferences which also typically feature published code and test data (artefacts) and even reproducibility testing results - e.g. see ACM SIGCOMM for a good list of examples of state-of-the-art (probably about 5 years ahead of deployment


A sustainable DPG would include a decentralised grid made of a very large number of microgenerator sources - we have been building something like this on public buildings in the City where I live (London, England) where we crowdfund putting large solar installations on schools, gyms, etc, at scale of 100Kw typical configurations. We are working on getting permission to build a publically owned grid to re-distribute spare power locally (rather than having to just go through the privately operated centralised grid). SUch a system could (with appropriate use of storage, e.g. in batteries in nearby parked EVs) provide a power source for must digital public services.


A whole ecosystem ready built as a way to do all aspects of a DPG!

Blog Archive

About Me

My photo
misery me, there is a floccipaucinihilipilification (*) of chronsynclastic infundibuli in these parts and I must therefore refer you to frank zappa instead, and go home