imagine you have data about 100M people, that has around 1000 dimensions,
some binary, some other types statistically distributed in various ways, but lets just say kind of uniform random
so a given person as a pretty clear signature even if it is all binary - 2^1000 is a big space. i.e. a key that specifically very likely is different for each person
but imagine 10 of the dimensions are not binary, but (say) a value gaussian distributed, and 990 dimensions are basically 0 for most people, but 1 (or a small number) for each person, but for a different dimension
so the 10 dimensions are a fairly poor at differentiating between individuals in the 100M population
but the remaining 990 still work really well. i.e. these are rare things for most people but different for different people, so still a very good signature
but say we want to publish data that doesn't allow that re-identification, but retains the distribution in te 990 dimensions -
so what if we just permute those values between all the individuals? we leave the 10 values alone, but swap (at random) the very few 1s between fields with other fields (mostly 0s, a few 1s). for all 100M members of the population?
what's the information loss?
baiscally, we're observing that unaltered, and published the data in the higher but sparsely occupied dimensions has very strong identifying power, but very poor explanatory power....so messing with it this way, massively reduces the identification facet, but shouldn't alter the overal distributions over these diemensions (w.r..t the densley populated fewer (10) dimensions)
does this make any sense to try?
ref: PrivBayesAnother way to think of this is that the low occupancy dimensions are unlikely to be part of causation coz they have poor correlation with anything else, mostly

 

No comments:
Post a Comment