This is a read only archive of pad.okfn.org. See the
shutdown announcement
for details.
Open_Data,_Personal_data_and_Privacy_workshop
Welcome to the notes page of the Open Data, Personal Data and privacy workshop_ June 2014
All thoughts welcome!
1. Chatham House rules will apply to all discussions taking place at the meeting!
Participants
Fiona Nielsen
Helena Nielsen
Anna Crowe
Malavika Jayaram
John Gibson
Andy Turner
Mark Elliot
Reuben Binns
Dr Helen Wallace
Carl Wiper
Ossi Kuitinnen
Anssi Mikola
Casper Bowden
Christopher Wilson
Steve Benford
Antonio Munoz
Sanne Stevens
Dr Kieron O'Hara
Dr Mark Taylor
Phil Booth
Mark Lizar
Walter van Holst
William Heath
Kumar Sharad
Alex Edwards
Keith Spicer
Simon Burall
Jacqui Taylor
Peter Jones
John Harrison
Emma Prest
Javier Ruiz
Laura James
Dirk Slater
William Heath
Amelia Andersdotter
Sally Deffor
------------------------------------------------------------------------
Agenda Hacking: Issue Identification & ID of key assets that exist and examination of whats needed:
Tuesday Afternoon session - Casper + others on anonymisation
- Background from Casper:
- EDPR: Great document - describes state of european law and also state of computer science. Very unsual to have this in a doc from Europe
- Paper Vitaly Schatikov - the Netflix paper, addressing social network data.
- Starting the story: 1995. [will write paper - do let casper know if you have info]1
- European data protection directive came out then.
- The following is not well known:
- DPD went to parliament. Feb 95, 2 articles, one defining personal data and one on depersonalised data.
- when it came out after secret discussions, got current directvie. 1 article defining personal data, plus recital 26 defining depersonalised data. member states don't have to translate recitals! => so a huge fudge at that stage! Believe UK did this, UK didn't like directive, so made this fudge.
-
-
- what's the effect? R26: data shall be considered anonymous if data should be considered identifiable if can be identified by the controller or by any other person -- this was missed in UK law version
-
- eg ISP allocate IP address. The ISP know it but can others be assumed to know? unsure.
- by 2007-8, IP addresses considered personal data
- didn't matter for most countries where R26 was translated, but in UK it wasn't, so data is treated as a different category.
- therefore guidance from ICO very mixed up
-
- 2012, ICO give anonymisation guidelines. Inc psuedonymous data, counts as anonymous and not personal data. Big impact on privacy etc
-
- Working Party on Article 29 issued opinion, meant to be guidance for all EU countries. [ref in footnote 21 of primer]
- 30 pages, comprehensive, say clearly a cardinal mistake is to regard psued data as anon data.
- they describe 7 methodologies which can be used.
- conclusion bleak! if you want to anonymise you need to hire a CS PHD who is clueful and do strong analysis of data and potential use!
-
- But no general formula so it's rather negative guidance. "you need lots of help and even then it's hard to say"
-
- this isn't the usual data protection guidance
-
- also, confusion about meaning of pseud data itself.. 2 types: (1) record level data, take away name and identifier numbers, and replace with simple index number (serial or random). so the data is linked but the index doesn't have a meaning. (2) Other Types of Pseudonymisation
- these are supremely different and any deifnition that blurs them is problematic
- question is: does the data controller keep a copy of the index, or is it deleted after psuedonymise? in most cases, you need the index for some use.
- Article29 WP says: if you want to protect privacy you should delete the index. Casper recommends reading the docs, although tough computer science concepts, but very good.
- Where does this all lead?
- new general DP regulation in the works for last 2 yr. hung up on various issues.
- worry that commission has just sold out on anonymisation; it published hte regulation without any pseud concept, but this time, put the definition in an article, where it belongs. So it will be interpreted correctly in each country.
- then civil liberties committed had amendments and chose to introduce pseud data concept without good definition. council of ministers changed this to the process of psuedonymisation, rather than a definiton. and sadly the EC went along with that rather than pushing for it.
Casper happy to explain any of the above!
Questions / topic ideas / comments:
* impact for UK if regulations enacted?
* practical understanding of current situation adn implications for those who are releasing pseud or anon data, or who are attempting to anonymise data
* what about permissions to access the index - eg you can get data with a court order if you are police but not accessible for most people.
* what about practical measure to make it almost impossible for an ordinary citizen to get access the data?
* regarding defiition of pseudanonymisation as a category, a strange midpoint between anon and identifiable. the process of psuedonimisation depends heavily on measures applied
* would like to go to anonymity concept in a relative way, identifiable for you but not for the recipient. get away form these absolutes. there's no room for technical or contractual measures for access to the index here
* international transfer of personal data... many implications beyond consent
* what extent does scottish ICO-equivalent fit with UK vs EU?
* is pseudo meaningful and useful category, even if we understand anonymisation isn't binary? better to stick with the scale of "fiction" we have today with nonymisation scale, than to introduce this extra category in the middle?? Perhaps the role of Pseud is on data CONTEXT rather than CONTENT? useful to draw attention to context maybe?
*
Remarks back:
absolute or relative concept of identifiability is central
relative concept is that if you have a data controller, and this DC in theory is the only person able to reindentify some data via index, then around 2007 another opinion on concept of personal data. this was 'if i'm the DC and the only person who can identify, then it's not personal data as long as you take as much security as you can to ensure the index doesn't escape'. but now it's gone a lot further to concept of absolute identifiability . [this is the 261 opinion]
'likely reasonably' / 'erasonably likely' - new regs say 'reasonably likely to be...'
if everything, every cookie, is identifiable, how do you cope with this in internet context? in the past this was ignored. Now, it's 'reasonably likely' or it's personal data. BUT special exemptions may apply.
eg in new regs wording as of last oct, for research purposes, try to use anonymise, if that wasn't possible you could use pseud, or if there was important public interest you might get special exemption to use personal data.
Research data - research community need to be able to access the index to do the research.
new WP216 is adamant about index deletion but this defeates the purpose of the research protocol for, say, cancer registries or long term medical studies (longitudinal).
when you talk about reasonably likely identifiable, you mean 2 things (1) reasonable via data analysis, to get probabaablistic chance of identifying, or (2), where the index is removed and only accessible via say a court order.
new regs are problematic because:
they says pseud is still personal BUT have taken away all your rights as a data subject
if there's a breach of psedu, you notify the data controller, not the data subject.
also nullified right to access pseud data, whic his also the right of the data subject to correc the data.
without the index, you cant CONTACT the data subject to give the subject their rights!
So it's all really confused. eitehr you know who the subject is but they have no rights, or the subject isn't identifiable and so can't have their rights.
so the 3 pillars of the EC are drifting towards this idea that pseud data is personal, but have lost hte rights associated with it. It's a way of covering legal embarrassment.
are there times it makes sense to use pseudonymisation?
Pseud data only makes sense if there's prior knowledge of one person
it's a reasonable internal security provision. eg internal accounts in a system, for audit purposes. but it doesn't make sense as a release to third party control or risk limiter.
but actual release is a whole other thing.
POlicy makers have been conned into thinking pseud techniques are a 'new tech' whic solves the problems and makes it safe to release data. it's been brewing in UK for 10y or so. Wellcome trust etc.
Now agenda of current uk govt - 'if you pseudonymise it, no problem'
ICO knew about this stuff ages ago... [casper admits he's deeply involved so isn't objective!]
mark eliot runing a mailing list and new project spunoff from ICO code, a body of knowledge about psudonymise. Kieran proposed larger research scientific centre of expertise. Govt hasn't chosen to do this substantively
ICO in tough spot now
what are the practical risks of potential future problems?
* data controller database compromised, index gets into public domain
* future tech is better than current and allows reidentification
is pseud still better than releasing personal data?
'balance' term suggests there is a spectrum with 2 ends and you pick a point. but in fact it's complex and there's no magix sweet spot
it is a good excuse - path of least resistance to argue there's a nice balance point.
other risks:
* social steganography
* ...
* eg data collected... countries with civil war, factions may use a dataset to identify a social structure around a data point
* shown time and again people can be identified with some success
two notions of risk - social risk and security/tech notion of risk (if vulnerability risk is 100%)
at what stage has one 'done enough' ? may revise judgement in future of course
not just that you can do an attack but it's plausible.
is a question: does a motivated intruder exist?
of course that will depend on the 'target' individual - some more at risk than others
level of diligence of a private sector DC to a public sector DC may differ?
- motivations different
- penalisations of private sector reckless action - none known so far
approaching the uncertainty
what is data? what is personal? bad situation!
very few people understand this.
need to go deeper in discussion
human rights issue - treaty of rome, what UN is doing. Wider discussion of humanity
internet architecure is destiny
tech & policy - 2 worlds -- need more overlap
need global discussion
3 ah-ha moments:
* legal and tech world intersction
* how the discrepancies arise between UK EC and other positions - how it all works out - legislation in flux and in practice different/. hadn't realised 4 words made so much difference and that this means different countries have different setups
* pseudonymisation seen as comfort blanket by policy makers removing need to think abotu complex landscape of anonymisation
WP216 document - very radical! mostly because DP officials often not clueful technically (30 out of 1500 with any CS skills??)
lobbying from US companies trying to kill it. commission trying to set up carrot and stick. simplifying stuff is carrot, but stick is all-embracing personal data, with exemptions to make it acceptable.
what are other implications?
* for data subject as above
* for DC in terms of index removal as above
* "the DC should not have to collect any more data to give effect to regulation" - what does that mean? sounds harmless but is fatal! eg your google clickstream when not logged in. Indexed by cookie and IP address. you go ask google and you provide your cookie. If google get that request, they can't guarantee that cookie is yours, and so they cannot give out your data because there's extra info that is necessary.... unless you can strongly prove the data is yours. Nulifies right to access or correct or delete your data ! they should have said "you should offer the data subject a strong data authentication secret' (just for this purpose)
if you could access your pseud data, that might be useful for you, but depends DC being willing to give it to you.. Article 10 removes that.
implications for private data sharing and thid parties? none in particular
hashes as indexes
a hash is a number which is a mashup of some data
if you care about this pelase beat up ministry of justice
also lobby your MEPS
cabinet office should have set up a scientifiec centre of excellence on this -- they didn't so we have ac ommuity group, better than nothing. but they are data consumers not privacy people. Funded by ICO. stepped in with money when no other moey was there, and they didn't have to but at least they tried and no one else did
crash course in differential privacy:
- it's hard!
- invented 2007 by Cynthia who is a genious
- a national research centre, physically restricting data location , heavy duty data research around this
- anonymisation - classic technique - perturb data whist keeping statistical properties the same
- usually do that by looking at whole data set and mixing it up from there
- application of Diff Priv:
- imagine formulating your stats query and asking it of this database
- the sytem looks at your query and automatically works out the optimal distribution of noise to give best privacy for privacy bar whilst givig you the right stats result. it allows for a set EPSILON level of privacy protection
- so when you fire in these querues you get a fixed privacy budget. once you've asked some set of queries you have used up all your privacy, and then you haev to throw it away
- this is like alien weirdness to policy people!
- too hard to explain
- paradoxical consequence
- Example in practice: if you make some info available it PREVENTS some other info from being released later! because you only have so much privacy to use up
- if you pseudonymise the data it's hard to predict identifiable stuff is, but with Diff Priv, as long as you are within your epsilon, you know it's OK, regardless of what happens later.
- note that interactive scenario is less cheap :)
- individual guaranteed diff priv: if you are a part of the dataset or not, the info revealed abotu you is the same! the answer is statistically very close.
- this is NOT an anonymisation technique.
- this limits inference attacks (that's where you ask about broken legs then about appendicitis then about diabetes and the set of data you get overall lets you infer stuff)
- how can you institutionalise this? the epsilon knob needs some risk calibration! v tough.
- helps if you keep some hold out data which you can use later for model validation
in CS now: can you do distributed Diff Priv? can you do diff priv on data streams?
---------------------------------------------------------------------------
Tuesday afternoon - Project design discussion group
Topics for discussion/outputs
- issues around personal data - looking to recruit students to engage with things
- open data evangelists vs privacy fascists - middle ground is hard to design around - the two communities don't talk to each other
- how do we 'bake in' privacy and security into how projects are funded, designed - and giving those in the field the capacity to implement
- let's avoid "privacy by design" as terminology - can implemented as a checkbox exercise, not enough thought about how it applies to the work
- open data and development - how to build in data ethics?
System design
- can have naive system designs (e.g. a broker for health insurance, where any company can request information - you can do that differently)
- not enough awarenes of potential solutions
- can work with large commercial partners to probe the issues around personal data and privacy - employ multidisciplinary approaches - internet of things - interested to produce prototypes and field them in reasonable numbers and then do ethnographies and then policy work. Finding out how a smart object is used in the home.
- - could this just be commerical validation? Willingness to revisit fundamental design aspects will be pretty low
- - You're not going to fundamentally challenge whether you need something. E.g. smartmeters - turned out the sampling rate was less than one minute, by correlating the data with TV programming, they could give with a reasonable amount of confidence, could show what TV programme they were watching. Power differential - was surveillance for the benefit of the power company, not transparency of the power company (e.g. saying when you could turn off an appliance to save extra money).
Life cycles - frame for working through the issues
- how to map the lifecycle of a project to know at what points you can intervene to change things
- issues for development sector that are different from the private/government sector - are there things others do well/they do well
- analytics around impacts of programme - how to assess impact on privacy and security
- baselines - independent criteria; e.g. certain kinds of data you don't collect without consent - are there universalities?
- looking at harm stories and failures
- specific technologies that are higher risk
- do we want to think of a vertical approach (where each sector has its own rules) or are there general rules
E.g. baselines for when you work with mobile data, geolocation data.
Types of technologies
- most ebook readers have in their terms and conditions that they can collect your reading data for the purpose of improving the service. Looks at how quickly you read, what pages you read. The data is held by a cloud provider - your favourite passages of e.g. the Quran are held.
- notion of big data - working backwards from original statistical method. We're collecting too much data.
First issue - collect only what you need.
- This is a general principle of data protection.
- blind faith by people collecting the data in anonymisation/scrubbing of databases - just collect it all
- Idea of anonymisation - can just fix everything and worry about it later.
- they feel there are technical tools to fix everything post facto. And they now can collect everything.
Perhaps we should think - worst case scenario
- what would a genocidal dictator do?
- in the design commuity this is gathering way - turning to dystopian design fiction around products
- but need to translate it for an ordinary audience, make it user friendly
- often get polluted data because the data is not collected in the right way
There is a question of who makes these decisions when you design a project
- the user is often left out of the story - funders, development agency
- if you ask a user they would often say "take it all", but are there actually things we wouldn't collect, as responsible actors
- successful design is paternalistic
- but it's not an either/or - methods sit together
At the design fiction end, there is a role for digital artists in provoking debates - e.g. Blast Theory - illustrating the dystopia
These methods line up at different parts of the lifecycle
At the level of the individual system, need to help people understand they can achieve things with less data - they can do with less.
- How do we bulid incentives to say the less you collect, the better? As data gets cheaper, what are the other benefits we can promote.
- Can we put a pricetag on data collection as a minimum?
- Shifting incentives is important.
Summary of the discussion:
- Data minimisation - very important principle - why is it so threatened?
- Collecting data is cheap - the harm is externalised.
- Practical propsals to bring that principle back.
- Using analogies that people understand.
- Using technologies to e.g. bring class actions, bombarding agencies with freedom of information requests
- Right to be forgotten - debate between freedom of expression vs data protection people in civil society - divisive debate
Take-aways
- alternative project design strategies - dystopian design fiction, artists
- collecting best practices, documenting harm - a necessary step
- problem of data minimisation principle not being respected - what incentives are needed (e.g. agreement with funders to have certain baselines)
Tuesday afternoon - Consent
PAX:
- Christopher wilson
- Peter
- Mark. Barriers
- Helen- tracking genomes (children)
- Helene- “informed”
- Mark-
- Ruben- delegated consent
- John – personal info brokers, permission hubs
- Carl – is consent problematic
- Simon – underlying (technical) architecture
- Phil
Open data definition - about things like bus timetables, but not about people. There's no trouble with impersonal. But trouble when it comes to indibviduals, shopping habits, health, etc.
Both gov and priavte sector.
OECD roundtable paper - four classifications: 'derived' - result of data mining, observed data (publicly visible), provided. (ref forthcoming), inferred. Looking at the data origin. There's a difference between surreptitiously observed vs consent-based observation.
Data is 'personal' when it relates to a person, but also depends on what other data it could be combined with to identify someone. Depends on the context.
The main paths to open data release is through - 1) anonymisation, 2) legislation requiring disclosure, or you 3) get consent. The latter is hard when you don't know the purpose in advance of the open data use.
E.g. health records: there's been a debate about the release of health records through HSCIC - 'lightly anonymised'. Data collected by GP's, then uploaded to a central database where anonymised, then released. You can opt out by writing a paper letter - the GP puts a flag on your record.
Another kind of consent woudl be where you are given lots of paper certificates (e.g. qualifications, driving license). Then you decide to give a third party sight of the data. Where the individual has control over the disclosure.
Difficulty in working out what counts as 'incompatible' with the original purpose of the data collection.
There's a missing point working out this.
There are different issues with licensing and data protection - but the two are applied in unison. Conceptually different.
One of the main reasons for licensing data is to maintain the status of non-personal.
Is a personal data license possible? what would it look like? Can you really over-ride fundamental DP principles with a license?
Licenses are often put in place to ensure compliance with DP principles.
That would be relying on contract law - proividing a proxy protection.
If a gov department shares data under conditions with another, that's not open data. Can think of this as 'front office' / 'back office'.
It was assumed in DP law that 'data controller' is an organisation. But it could be interpreted as 'natural person'. But not thought of as individual controlling their own data.
There's a lack of distinction around someone who could be both data controller and data subject at the same time.
There's some flexibility wrt 'joint data controller' 'data processor'.
Seems like we're talking about the legal mechanisms but no consent itself. Let's take a step back!
Is general consent possible for open data? There's consent to whether personal data flows to a particular organisation (consent of access), then there's consent of use/control (what the organisation do once they have it).
You can give your genome data to the Personal Human Genome Project with consent, but the approval of a particular research is done by the organisation, no the data subject.
Think through some concrete examples:
- 1. Personal Genomic Data
- 2. Quantified self data is one end of the spectrum.
Think about the use of the data, consent covers that.
The point of collection, use, and origin.
Question 1: Can I consent to open data
- Arguably not (legally) because so broad. Not specific enough.
Question 2: can you consent to 'nice research' and not 'bad research' (e.g. biological warfare research)
Not many people will likely want to upload their genomic data if there are no constraints.
Let's assume anonymisation is a black box which works.
There is legislation which defines these things - people trying to make personal data not personal data using less than adequate anonymisation techniques.
Consent implies permission, but what the genome project case is talking about personal 'publishing'.
How do you get informed consent to anonymisation when anonymisation techniques are a 'black box' for most people.
To treat in law personal data as just being yours is wrong - e.g. the genomic data example which affects your relatives.
There are certain situations, types of data, which should always remain closed.
Difference between publication and consent.
- Publication: take example from the copyleft people, publishing under a license for re-use, sharealike, etc. These may be useful to constrain further use of data.
At the moment I self-publish data, I give up certain data protection rights?
Depends on the context - e.g. social media if you publish there it doesn't necessarily mean you're giving anyone permission to do anything.
The way to get away from that is if you publish your data on your own website.
So paradoxically I have less control over my own data if I publish it myself.
The question is if someone else takes it from independent blogs, they become a data controller.
They don't necessarilyt need consent if they have one of the other bases of fair processing.
The 'lock comes down again' every time there is a re-use.
The problem is if your a data controller, you need a condition for processing. One of those conditions is consent.
In practice, the other available conditions don;t work. But in a big data/ open data context it often comes down to consent.
there;s a connection between the practical examples nd the broader principles.
The problem is around making a broad enough consent for many different purposes, and still being valid as 'specific' consent.
E.g. individuals may not agree with e.g. GSK activity as 'medical research'.
Are the protocols for consent applied to data controllers adequate?
If you define purposes tightly enough, it may be feasible to get informed enough consent for research uses.
General principles for consent.
- 1. There is independent oversight (what does independent mean?)
- 2. A mechanism for feeding back to the individual how its been used and where. They must be able to find it out. It's not too difficult if you prepare for it in advance prospectively - very hard to do retrospectively. Audit trail should be a requirement
- 3. Proportionate sanctions - what would they look like?
But the UK Biobank records example - they subsequently decided they couldn't delete everybody who wanted to opt-out. They then shared with an unspecified company they didn't originally mention.
One of the reasons often put forward against consent, is that 'people aren't that bothered'. They just want the benefit / end product. People don't want to read privacy notices.
So that's one of the practical challenges - how do you give people the information in a condensed form ?
The MIT media lab clear button initiative is an example of opt-out mechanisms.
All of my data has some bearing on other people - if you exclude all personal data relating to other people, you exclude too much. E.g. household data. It's a slippery slope. There's a spectrum.
E.g. smart meters - always family-level data. But not about specific people, like genomic data.
Anonymisation gets harder the more data you have - so every kind of data becomes relevant.
There's no hard/fast line on this.
There are no stronger restrictions on access, because people couldn't be sure that those with access hadn't used it secretly.
There is a notion (a social contract) trhat by going to the hospital that you have implied consent to data processing for delivering treatment. But secondary uses are not implied (e.g. invoice reconciliation).
Even if you forget about the information content of a genome, it's still a biometric.
Conclusions:
- Different laws apply
- Consent can't be completely open-ended.
- It's not about just access, but also use.
Tuesday Afternoon - Resources Group
Resources on Anonymisation
1. ICO Code of Anonymisation: broad guide on how to think about anonymisation
2. ONS Guidelines consistent with the ICO Code of Anonymisation
- Guidelines on how to anonymise microdata (individual records)
- Guidelines on how to anonymise tabular data for social survey
- Guidelines on how to anonymise of admin data
- Guidelines for health stats
- Guidelines on birth and death stats
3. ONS SDC courses on disclosure control across government which are open to others. Also run bespoke courses on disclosure.
4. ONS writing e-learning materials on anonymisation
5. ONS have a 'helpline'. Questions can be sent to sdc.queries@ons.gov.uk
6. UK Anonymisation Network (ukanon.net) working with ONS, Southampton University, Manchester University, ODI and ICO. They provide consultancies and are collecting case studies, writing a book, holding a symposium.
7. NHS 'Anonymisation standard for Publishing Health and Social Care Data Specification'
http://www.isb.nhs.uk/library/standard/128
8. Statistical Disclosure Control, 2012
“A reference to answer all your statistical confidentiality questions.”
9. Nice example: Department for Work & Pension have StatXplore for public to interrogate data flexibly with built in algorithm to anonymise data.
10. Cabinet Office and Tim Berners Lee have a star rating system for open data.
Don't want too many guidelines, could create confusion and inconsistent advice.
GAPS
Easily digestible resource flagging key things to think about.
Create an overview of what is out there so people know where to look.
Glossary to define terms (do we even have a common language yet?)
Advice for businesses looking to anonymise and release data.
Session on Open Data FOR Privacy – convened by Reuben Binns [Notes emailed by Elizabeth from ORG, who wins the prize for best report!]
Open data can help achieve privacy aims through transparency or by changing business models, e.g. buyers are able to market themselves to sellers. Idea of “self advertising” is that people advertise themselves rather than the other way round. The aim is a positive outcome of innovation as well as protection: “applied privacy”.
Discussion of using the ICO notification register of data controllers to obtain information on how data is processed for e.g. for health, by banks and who the recipients are. The new EU law proposes to get rid of the register as it aims for a light touch and focusses on what organisations are doing internally. It is intended to lead to better data protection outcomes in the longer term.
One tool is freedom of information requests but only for public authorities. Questions can be asked regarding use of data and data breaches. ICO investigations – their parties do not get access to details. Customers may want open publishing of reports – advocacy would be needed to promote this. The downside is this might have a chilling effect and prevent companies from reporting breaches if it is made public.
We should not ignore the private sector discussions around big data and the role of transparency. For customer and stakeholders it is on the corporate agenda. It is about companies being transparent about what tracking they are doing when people use their websites. Certification schemes could be a good idea. Mozilla has done it recently.
Another barrier is the disruption of business models. Many companies trade on having data in a market where customers are the product sold to advertisers. This could be disrupted.
We need new tools for self-determination. People need to get control of their footprints which requires innovation policies for a human-centric society rather than corporation-centric.
It is not clear whether enough citizens care about the issue. The ease of the current route means many people may not want to create their own world of data. But there is demand within the data protection world. It is a social need. Companies need to find competitive advantage from doing it. The ICO's scheme of red, yellow, green for compliance could be developed. The ICO is launching an accreditation scheme. It will relate to specific practices rather than companies. There is discussion around privacy as a competitive advantage e.g. Boston Consulting Group. But many people don't care as long as their data is not lost. The EU data Protection Commissioner produces a report recently. Monetised value of data may outweigh the power of consumers.
There are three advantages:
1) Accountability – companies get found out if they do wrong;
2) Allows customers who care to make rational decisions; and
3) For customers who don't care, transparency in the system may preserve trust in the system in general (like open source software where people don't read the code)
A concern is inappropriately shifting responsibility to the individual. This should not absolve companies of their obligations. Information mediator companies may be needed to get people a good deal. The regulator must oversee it. It could change to be the overseer of the privacy infrastructure. Privacy policies put the responsibility on the individual to read them all. People don't want to make decisions on data everyday. Companies like Google promote transparency for hidden interests. 'Open-washing', like 'green-washing'. Most open data is not demand driven at present.
We need lobbying for 'my data'. The government or regulator should have a role in 'my data' but it may create more work in terms of complaints. We need to see examples of being effective but it is hard when the benefit is intangible such as transparency. We should look at sectors where there is an incentive for individuals to repurpose the information. We could look at the Netherlands in terms of subject access requests to telcos. Customer switching is a measure.
We are looking for a mapping of the space and the barriers. Transparency is a stepping stone but just revealing bad things does not solve the problem.
The 'AHAs':
1) Three uses of transparency:
a) Accountability – companies get found out if they do wrong;
b) Allows customers who care to make rational decisions; and
c) For customers who don't care, transparency in the system may preserve trust in the system in general (like open source software where people don't read the code)
2) Barriers:
a) Chilling effect – companies may become secretive.
b) Transparency itself is not enough, need to do something with the data that is made transparent.
c) Engagement – people have other concerns as well as privacy and we should not place too much responsibility in consumer's hands.
Tuesday afternoon - open data & anonymisation through aggrgegations
Open data which is about say weather is fine to open up - no personal element
but what about when we gave a set of data about people, can we open it up and if so how?
how does aggregation work? there must be some degree of 'collapsing'
if you have a 2x2 table, you can turn that into a record-based system. there's no real distinction between teh two. Being table vs record doesn't matter.
from the theory - there's key anonymity. you can collect data until there's enough individuals within each group that sufficient aggregation is achieved.
but what about practice?
key anonymity works well when database is low dimensional - few columns, many rows. but if ther'es lots of columns this fails.
you can collapse some columns, sure but then we must talk about the distance between the columns! closest neighbour and farthest neighbour may be very close.
Example:Dr Foster dataset. it is aggregated, anonymised, to the last 2-3 digits of postcode. licensed to people who can provide services, authorised orgs.
the threat model is very different here. it's not a n active advsersy in attack; it may be honest but curious person.
individual desire to give data for medical research but not have it totally identifiable
the degree of collapsing renders the use negligible.
run the risk of collapsing the big numbers along with the small ones
no magic k - it's all contextual
if you collapse too much - eg national statistics - you can answer some questions but not others
can you work out the best questions to answer with a dataset and collapse in that way?
but with open data the benefit comes from use you can't predict - if you knew the best way it would be used you wouldn't need to open, you could just give the data to teh user!
think of census. huge stats effort to get personal data ready for open release. worth investment - lots of minds to agrgegate appropriattely.
changes in census metho in UK coming; 10% sample release not 5% and some removal of fields of that
some licence conditions - but very light touch - on the sample data
reduced set of variables is published - useful as teaching/training dataset
also, record swapping, to create uncertainty & create noise
So two ways: collapse data; or remove fields(columns)
Then we look at how sensitive the data/variables are.
eg abortion data...
need to keep data useful but add perturbations. in census there's nothing super sensitive; there's health but it's self-assessed (eg 'are you a carer' )
lots of unceratinty and writein answers - processing not always perfectly accurate etc, transcription and translation
can try to add records to cover expected abnormalities
record swapping and other forms of perturbation -- when you've done this you release samples of records, and final stats database
use of netflix database in attacks on other things
the noise was added to netflix but not enough
knowledge of information may be imperfect anyway
-- but even with that you can use eg netflix set to identify people
two different datasets - with various data processes - knowledge is imperfect - that's different from deliberate perturbation (SDC)
these methods can waeken the results but could still leave potential for attack
we haven't seen these datasets released for study by security researchers - we haven't tested the anonymised set
to carry out SDC tests you need real data - can happen inside an organisation
external researchers different
ethics committee tough -- so it's an internal process to audit as well as to anonymise; no real validation. no way for external experts to assist /audit?
how can you evaluate anonymisation techniques using a meta anonymised data set?
there are guidelines you can use for this depending on risks in data set
how can these methods be tested in a privacy friendly manner?
there are ways to bring in external researchers to evaluate disclosure risk
datasets reproduced after perturbation may have some gaps, some info. strongest test - you shoudl not be able to find the person in the raw dataset adn in the processed one ?
but record swappping, it only applies to some proportion (?)
a practical attacker might be interested in one neighbourhood, gather data on that, and then get some set of attributes from that... depends on data how a practical attack would proceed
the info gathering based attack has been tested by UKAN
predicated on specific sort of knowledge
'level 1 response knowledge' - you know the person is in the data
attack could target a group rather than an individual
underlying models always depend on whether or not you know for sure if a specific target individual is in the data
what's the context of this data? what other data is in the world or available? this informs attack modelling
you can only simulate one attacker - not the set of several/all attackers...
in primary risk analysis you assume the attacker has exactly the data you have
what threats are considered? 'can i identify an individual and find out some specific info about them'
false positives... for some matches there's greater certainty the match is correct
if you have high priority matches, you then retrain your model and repeat
you can have a theoretical limit with regards to a specific attack model. that's estimatable accurately. but it depends on assumptions about adversary
our original question: if we have personally identifiable info, and you want to release as open data, without (much) potential for identification, what do you have to do?
(assuming it's not about data where you want to be able to identify!)
you must decide your risk appetite
is it different if you ask people 'are you willing to contribute your dataset' - what if you get a subset of data, just a few folks contribute? changes aggegation picture
it's different for every dataset!
data about abortion vs census type data vs what's your favourite colour?
Netflix - you can stop putting more data out but you can NEVER retract data.
netflix - trouble in communicating this is: you may say, hey, my movie data, no worries i'll share. but people never thought that it woudl lead to inference of sexual identity! really hard to explain risks here
attack audits are about certainty of identification rather than probability of indeitification
3 ah-has:
* hard to do really good auditing because you need to bring an adversary researcher in house to test anonymisation of data (without access to original you can't really test)
* high levels of aggregation are needed on most data for them to be opened and nondisclosive
* make a licence instead of open data
Wednesday
For practical examples of public record data that contains personal data https://docs.google.com/document/d/1yVnTPbTs_u0KIQ4XEM_VxOMUe0jxsfDnCHNsXbPrMes/edit#
Wednesday exercise on public record information:
We will discuss the following examples exceprted from the above doc (above doc has links etc in some cases):
-
- Birth dates as part of official declarations of income and assets in Poland (for some groups of citizens inc public servants & MPs & MEPs)
-
- “In Poland all the people who are requested to submit the official Declarations of income and assets need to fill in the space for year of birth. You can see the official template here (Polish only). You can find out more on who is required and whose declarations are disclosed to the public here. The public disclosure means obligation of uploading yearly the declarations on the Public Information Bulletin (BIP) website of the public administration body relevant for the given person. This doesn't mean however that the data is open, because most of the declaration is handwritten (nobody knows why, there is no obligation for it) and then scanned and uploaded as a bitmap (see, for example, MEP Adam Bielan's declaration). Only the declarations of MPs and MEPs were digitised by a NGO Association 61 on their I have the right to know project. For example: this MP website and this digitised declaration from 2009-2011 (XLS) - which includes his date of birth.”
-
- Group3:
- * depends on purpose. this is purpose = detect conflict of interest, and the info is selected based on what's needed for that purpose and no more
- * of course info once released may be used for other purposes :)
-
- Personal information about parliamentary candidates in Canada
-
- “The Parliament of Canada publishes date and place of birth, previous occupation and biographical notes about members of parliament. In jurisdictions with French content, even without an explicit gender field, it is possible to determine gender based on gender agreement of the nouns and adjectives used to describe the person. Many officials at various levels of government use a personal address and phone number as their ‘constituency address’.”
- Group 1
- Politicians and more senior public officials are seen to give up some privacy for accountability, but the degree of privacy they can expect and be entitled to despite being politicans is an open question - differs across jurisdictions
- Interesting the cultural difference, where Francophone reveal more - but should the gender of a candidate be secret anyway?
- Does it include income declaration? This is the trend.
- Some politicians elsewhere use personal email address to avoid scrutiny and access to communication under freedom of information requests, but in places like the UK under FOI personal addresses are not exempt if used for official purposes
- In some cases friends and families hide assets for corrupt public figures, how far do we go in extending the obligation to open up?
-
- Group3
- * home addresses is a bit much - govts should provide ways to reach candidates other than this
- Full names and dates of births for property owners in the Netherlands
-
- “The Dutch cadastral data includes full name and date of birth of the property owner, next to data on the building and the buying price. My hometown makes all of that available through the city council website. As an explorer web service, not as data (bulk) download though.
-
- It needs to be possible to establish ownership of real estate clearly. There is a law on the cadastral office that specifically makes personal data in cadastral data an exception to data protection requirements. Much as with the company register.
-
- It predates reuse and bulk availability though. I am sure that as cadastral data will become more open (the ministry is e.g. demolishing current revenue models around the data, and full openness is mandated before 2015) there will be more discussion on data protection in the context of reuse. Although I suspect the onus there will be on the reuser, not at the dataholder.”
- Group 3:
- * assuming many people own their own homes, this means home addresses + names published, which is a risk for some sections of population. (discrimination in effect)
- * if the alternative is a register you can access via a lawyer, then you discriminate in access. Rich people get lots of acecss to info, poor people do not
- * weight against benefits - need to be able to look up who owns piece of land. does that mean we need a public register or the ability to get the info?
- Group 1:
- agree good reasons for transparency, but issues of power and who can benefit most from digitisation
- in some countries there is purpose limitation, you can acces cadastral data but not for marketing e.g. it should not be fully open
- but how do you prevent this? technical measures? what if it gets mixed with other sources
- database concept is outdated!
- Group 7:
- The function of the cadastral database was brieflie discussed. Unlike for example England, ownership of property is not established through an uninterrupted string of title deeds, but through registration of title deeds in a (now centralised) cadastral database. It is not the title deed that gives the legal effect of transfer of property, but its registration in the database.
- There was agreement that there are good reasons to include the name of the property owner, the inclusion of the last selling price in the register was not perceived as problematic either. The inclusion of the date of birth of the property owner does not serve any evident purpose other than as a (semi-)unique identifier in case of owners with the same names.
-
- Wealth records of public officials in Indonesia
-
- The Indonesian government publishes information about the personal wealth of over 4000 public officials as part of its anti-corruption programme.
- Group 3:
- * is it necessary and proportionate?
-
- Personal data included in bankruptcy filings, civil and criminal records in Hong Kong
-
- “Public personal data here means civil and criminal litigation records and personal bankruptcy filings. These are data of a personal nature that is freely available to the public if a member of the public goes to the courthouse and look it up in paper registry records. The records include name and partial Hong Kong Identity Card number. Some of these records would include other personal details such as address, date-of-birth, spouse name and so on. In some cases there is a very small fee to get the 'daily record' of cases in court that day.
- Group 3:
- * time limits important here. could one restrict access after a certain time. it's not about deletion but about access.
- * question about sharing and access even within police depts in UK. eg removal from national database but kept in local databases
- * history & science use when it's later vs personal data about a living person
- Group 6:
- * The means of access has a huge role to play in privacy. Even though data may be the same data, different access methods may requires different privacy controls.
- * No clear line on right to be forgotten. Feel that generally should be able to be forgotten by the public, but controlled access by certain bodies to history could be OK. There is no clear answer line to what bodies are OK and what aren't.
-
-
- Household income and asset declarations from public officials in Georgia
-
- “Georgia has a very extensive requirement for public officials to disclose their assets and income of their whole household, including data on the names and birth dates of their household members (usually spouse and children), going into details like the number plates on the cars the household has or the banks the household has accounts with.
-
- In this regard, the public interest has been valued higher than privacy.
-
- In Georgia, dates of birth and ID numbers are very important, as there are fairly few first and last names, and there are often dozens of people who have the same name, so birthday and/or ID is an important identifier.
-
- The asset declarations are available in English.
-
- There remains some room for improvement, we have asked the agency to add ID numbers of individuals and companies so that we can verify records with other public records, and they are likely to add these details in the coming months.”
- Group 3:
- * tough around family members who may have no link with public office; but equally obvious route for corrupt officials to use family members to launder funds
- * reporting to an intermediary institution, rather than public, doesn't help as the institution could be corrupt. need external third party analysis
- * if you are opting into public life, you are choosing this, you should discuss with family and be aware of public scrutiny
- Group 6:
- * Initially couldn't see the utility of releasing this data; were concerned that people who hadn't made the active decison to work in public office were having their data released, potentially without their consent.
- * Assumed that this was to help combat corruption. Felt that 'Emergency' or specials situations like this could sometimes warrant exceptions to privacy.
- * Discussed example of an infectious disease, where release of peoples location data could help control the spread.
- Personal information in ethics disclosure systems in the US – both intentional and unintentional
-
- “In the US, the Congressional Bioguide might be of interest. We use their identifiers as a hub for a lot of our legislative data work.
-
- There are many, many ethics disclosure systems that collect and redistribute personal information from public officials as well. California's Form 700 is an example.
-
- The real devil is in the unstructured disclosure fields. We've seen this recently in the FCC's political file database, which brought already-public but previously-inconvenient data into electronic form. In this case, that included not only PII but scans of checks, the account and routing numbers from which could be used fraudulently.
-
- You do occasionally see PII in structured fields -- the USASpending.gov datasets leaking SSNs from agencies that unwisely used them as award identifiers for grant recipients is one example -- but in my experience it's the bags of text where problems really crop up. PII concerns are a strong argument for mandating structured disclosure, I think.”
- Group 3:
- * processes of related stuff eg agencies should be reviewed!
-
-
- UK company directors’ dates of birth
-
- “Company directors in the UK has been the classic example of personal information published as part of the public record. My date of birth can be found on public websites, because I'm a company director.”
-
-
- Personal information from UK government employees, GPs, planning applicants
-
- “UK government employees’ names and salary bands are published as part of the UK government’s organogram.
-
- The National Health Service publishes the full names of all GPs, and the dates that they joined practice.
-
- Planning applications also contain the name and address of home owners, and may result in the publication of correspondence including email addresses of the applicant and/or their agent.
-
- Group 3:
- * they should black out email addresses in correspondance
- *
- Group 4:
- Hong Kong
- “Public personal data here means civil and criminal litigation records and personal bankruptcy filings."
- ECJ case - privacy matters over time - relevant journalistic practice in 1997 may not be relevant in 2014.
- different expectation of privacy 10 years afcter it happened than 2 days.
- Building in controls over subsequent digitisation. Some kind of appeal process, based on making case for legitimate interest.
- Could define legitimate interest negatively - e.g. you won't do this or that.
- In this case, even witha use retriction, this record shouldnt be made available openly.
- It's iumportant to find a way to communicate other restrictions that may apply even if there is an open license.
- Open data movement needs to communicate data protection restrictions within the context of their main aims (i.e. transparency, empowerment, etc.), and not as a barrier.
- In the context of Sweden, Data protection and privacy being used as an excuse only becomes an issue where public officials are trying to obscure themselves. E.g. they might not say which people were at which meetings, but they choose how this is disclosed. Other than this, you don't have a means of protesting or finding out.
- In Sweden its been difficult to release data that isn't personal, bu thte government is selling it. Whereas accessing personal data is relatively easy (in sweden).
- Maybe a starred warning system to accompany datasets that might invoke DP principles just by downloading them.
- Developing appropriate redress mechanisms, e.g. after releasing data you shouldn't have? So Open community need to engage with this.
- Different levels of re-use? e.g. car owner registry.
- To some degree you need someone else to take on that responsibilty, too much burden on individuals.
- If you have a question you can ask (not just the ICO) but some other person (maybe an ombudsmen?).
- Very often persons can't protect their data (because they have to give their data to government).
- In the context of aid, there's lots of data out there but no one uses it. Supply-driven, but most people aren't trying to solve development problems with data.
- The word 'privacy' or 'data protection' is used as an excuse. So how to get around that without devaluing privacy?
Notes on Break-out group on Baking Privacy into Funding data4development projects.
Alex, Emma, Malavika ,Chris, Sanne
Some issues we could talk about:
1. Life cycle of development projects when collection, when storage, when etc. Willow just finished a lifecycle as an output of the RDF Oakland – built on that.
2. What is specific about development context? Are there particular things in development that make it different and how do we address those.
- one way to approach that is try to think of what the prototypical challenges might be – stand out more in development.
3. Impact assessment; when you are rolling out an development project are you thinking about what the impact might be in sense of intended consequences. What are the artefacts and legacies you leave behind that could be repurposed.
4. Are there baselines of any universal conditions and attributes that we can use across every context. Are there red lines? Certain kind of data that we never should collect? Or is it all context dependent?
5. Harm stories and failures; how it went spectacularly wrong?
6. Are there particular technologies or programs that have a higher risk in terms of prioritizing funding.
How do you make sure that is not just a box that you tick off – because often privacy by design ends up a box-ticking thing.
Where would we want funders to be? What we want funders to do? As a matter of policy.
- criteria based policy: projects need to do x-y-z
- carrots; highlighting projects for inclusion of privacy
- separate funds – support to inject privacy/capacity
Mechanism that could come in as appropriate as grantees to work with it.
Capacity and familiarity of funders. The engine room just interviewed funders. Nobody feels that these questions are raised enough. Broader questions on ethics and privacy are not addressed.
N
They don’t think of them as data-projects. As water project etc. But: everything is a data-project.
The most progressive donors have the most ‘hands-of’ approach. This is a tension – but you are always being paternalistic in some sense – also if you say leave it up to the people – and they decide everything is fine.
Developing countries do not have the legislations but what do we do in the UK in order to fulfil the data-protection act and could we bring these to funders.
1. IT system goes to an audit by a security editor
2. Data-processing agreements with suppliers that are matching the datacentres.
When conditionality is most appropriate and when a standing offer is appropriate. If donors have a standing fund for responsible data support then there is money for an audit as well.
Can we make interventions: hey – this is a good project, a safe platform eg and take what we have learned.
General audit of most popular tools – risks or concerns (frontline sms, magpie or ODK). Here are some questions you should think about before you go further needs to be integrated in to a platform for functionalities what you want.
If donors had a list of the most popular 35 tools with the responsible data risks and concerns – a cheat sheet for popular tools.
Due diligence on the infrastructure. This creates an incentive and it doesn’t require institutions to change policy but PO can use it.
Open integrity Index: iilab
Rebecca MacKinnon Ranking Digital Rights. Mini offshoot in the development sector. Tools change so quickly – it needs to be kept up to date. We are interested in Granularity . Potential for something like that – require researches.
Matrix for Oxfam Novib on tools and MAVC mapping of who uses most tools. So within a few months we have the most relevant tools – might have greater uptake if it has something about functionalities as well.
Community to review such a tool is willing and able. We could try to get funded some research for the Open Integrity Index to make a draft on the basis of our tools and could follow up with the group of donors we interviewed.
Best would be individual consultant for every project to help think it through – not feasible. But a set of questions and ideas would be good. All your grantees/new grantees have a training.
There appears to – very little people do training on these issues; it is ad hoc, no coordination or sharing of methodologies. Only 3 donors that have a module. Huge room for improvement and coordination .
Make it available – the ‘training’. Classic example: The Engine Room grantee OSF and needed help managing our governance policy and they helped us to set up assistance with the audit department. Offer the expertise.
Funders themselves don’t think about these issues – they don’t know what questions to asks. They are required by their policy to send
At the moment the privacy risks that exists – to make people aware of misuse of data; are there other ways then trainings to do something about it. Is there anything that can be done in the funding cycle to mitigate these threats?
It needs to be genuinely helpful. It’s better than nothing to have a baseline that people have to consider. If you at least get people to tick the boxes.
You can’t force people to see the relevance of privacy all of a sudden - and you can make people check the boxes. Funders want to be hands-of for reasons of respect and ownership – but sometimes it’s a cop-out not to have to spend to much energy on certain projects.
Focus to have tools for funders for PO’s to use that are motivated.
We should have this conversations with other funders round the table and ask them.
3 points:
Awareness: everything is a dataproject.
Incentives: competing and perverse incentives in funders relationships – box checking exercise. Most immediate activity is to provide PO’s with resources: cheat-sheet with risks and dangers of certain tools, and can get into a conversation with their grantees.
Sharing of information across organisations and get funders together in a safe environment: Funderdome.
Tuesday Morning Group - Open Data and Research Data
Mark Taylor
Andy Turner
Helen Wallace
Amelia Andersdotter
At the moment too often the availability of data depends on who has the economic power to leverage their own particular interests.
Open data as double-edged sword:
a. There is a problem that the “open data” research agenda can be a way of labeling activity, and establishing particular governance, that is in corporate interests rather than public interest. Can reduce individual/ public control.
b. At the same time, “open data” can be a way of leveraging public benefit from commercial research activity. It can be a condition of commercial research access that the results are made public.
Given that open data can be a double-edged sword, how do we ensure that any mandate to open data leverages public rather than private interests?
1. Legislative reform?
There are a number of existing research exemptions – routes through to lawful processing. E.g. rules on keeping data – on the purposes for which data can be used – for subject access and requests to be forgotten…
Do “research exemptions” appropriately recognize different kinds of research, and research of different public – as opposed to commercial – utility?
SUGGESTION: Need to get public and researchers involved in the appropriateness of allowing “research” to do things that would not otherwise be permissible (?) Ensuring the governance operates ‘in the public interest’
2. Make the political aspect of open access to research data more transparent.
SUGGESTION 2: Need to get rationing decisions – e.g. decisions by funders – more transparent and accountable. Work on who is making what decisions, and what conditions they are applying. Ensure that decisions are made in the public and not corporate interest.
This work should inform understanding of improved best practice on ethical training, ethical review and ethical audit of research practices.
Andy Turner's Notes
- Workshop on Personal Data, Open Data and Privacy June 2014
https://docs.google.com/document/d/1_FLykBOzA3-ZhFth2VeexHR9N2ev3N_cJEM2Ouk0zsE/edit
Steve, Sally and Antonio
dystopian views about when individuals own and control their own personal data
future scenarios where individuals controlling their own data can have catastrophic effects example in accessing and releasing health data of their personal challenges with for example cancer and when in the future insurance or private healthy care companies then use this information to determine the cost of accessing this service
1. Character has control and could no longer manage it
2. Character 2 has right to give away control and this has consequences for who access it in the future
3. Access and control to someones record and they are profiled based on this open Data, lack of control , when there is a problem with the wrong processing of data
1. My niece will be born within 5 years. According to the registries of the health system she will have big probability to suffer a heart attack before the 40s. Other institutions which have access to the DNA registries of her and her family discover that she will have a very creative mind. Segmenting this and analyzing other social aspects of her parents the home office discover a tendency to be a social activist, who must be investigated closely. The education system recommends a specific curriculum for her according to inherited attitudes, more focus on human studies than in science. Her parents prefer her studying Maths, but the authorities said that it would have an extra costs. So in the school and the University she met some people very close to her way of thinking, around a social change. This behavior was detected by the police though her social network and supported by the previous analysis made a lot of years ago by the home office. In some moment on time she detected that the police was following her and she became a bit aggressive in some public demonstrations against the government. One morning she suffered a heart attack when find out she was being followed by a stranger. 2. Jay is a youngman in the University somewhere in Europe. Due to recent legislation which gives him full access to, ownership and control over his data, plus the emerging discourses around how much personal data management is worth economically, Jay decides to offer all of his data for sale to the highest bidder. This included all his personal (social media) conversation via text and emails, as well as all his health, education. financial, geo data, retail, legal records etc. In addition to this, he signs up to offer his full records perpetually. He obtains a sizable amount of money and is quite pleased with this arrangement. Fast-forward to a few year and Jay is out of school and is just been considered for a new position in another country. Unfortunately for hism, the new organ
isation was able to access all his records for a small fee. In it, they were able to find out that Jay falsified some of his school records, had previously harbored a fugitive in his university hostel and had defaulted on his student loan payments. Additionally, he was unable to obtain any reasonable insurance due to the fact that he was perceived as a high risk. Needless to say, he did not get the job or any others. 3.Susan is very protective of her data. So concerned was she about privacy that she manages it all herself. She has total control over each and every single information collected about her and is responsible for storing and managing it, dispensing it only when necessary. Over time, this becomes a huge burden and she had to increase the capacity of her storage devices in order to contain all the data. Additionally she had to take a Masters degree in Computer Science (Information Management) in order to effectively make sense of the ever-increasing data, spending quite a lot of money in the process. Fast-forward to a decade and Susan is diagnosed with a terminal disease. All her property (including the data storage)is passed on to her nephew who knows nothing about data management and was therefore unable to provide doctors with her medical history which would have enabled them to find treatment for her condition. Due to this, she died without a solution been found.
How to communicate or enhance education on Open data and Privacy
communication on privacy as a concept and the considerations around privacy in open data systems needs to happen in a way that engages people especially when the audience are people outside the community, the messages need to be simple and clear with no use of jargon., but not loosing the key message through over simplifying. Need to consider the different audiences (for example, when to use blog posts, and when to use participatory forums such as this). Not a one size fits all. These meetings should be brokered by organisations perceived as neutral. Common understanding of the terms being used is also necessary to enhance clarity.
When there are concerns about data misuse, independent organisations like (Open Knowledge?) need to provide the platform to discuss these and not organisations which may have a vested interest. A neutral and open platform to bring a variety of viewpoints together and not having some few to dominate or polarise the discussion.
Case studies and more evidence needed to back advocacy activities. Convince people immersed in data management that there are humans in the datasets.
Engagement plus evidence is therefore likely to be more successful : more public discussion, tools needed to communicate in this space is different., foe examples videos might not be too appropriate as people need to feel safe.
At the end day 1, participants called for clarity on the open data brand, the terms, they wanted to go through the open data sets mandated by law though they contained some personal data, and to see the privacy concerns of each.
They also wanted to explore the other side of the conversation around the obligations placed on people if they had access and control over their own data.
Understanding of Open Data:
It's a new name for an old concept; structured; transparency; no challenge about how will be used; access for free; interactive, participation, engagement; there is a gap in the definition- is personal data ever released without controls?; data that has a license and allows all types of use; data that is open to all and in a form that uses open standards; freedom; minimum restrictions on the reuse of data; restrictions on open data may be valid but what are they?;reusing govenment data- but does it get reused?; provenance (open history of use); it is more than government data; vulnerable (in terms of tech/platforms as well as people); are we creating data orphanages (eg. data.gov); facts; confusion with regards to the intersection of open data and big data; online; license; public access to data; democratic knowledge for many, not few; need for suitable anonymization; enabling smaller organisations to benefit (societal value); rebalancing of power; another buzzword; raw material to be explored; available; easily accessed online; about things; open compute; provenance (open history); bad!; government spending, e.g how much of our taxes are spent on what; data that is beneficial to the public; data that is re usable by everyone; data that everyone has access to; data that is under an open licence; five stars for degree of open; interpretation has not being done; free to reuse; machine-redable; moving towards making more public data, open data; legally free; some data is easy to make open; five star rating for accessibility making teh most utility of taxpayers money; look at the data from different angles; making public utilities easier to use; open data gives transparency; difficult to get open data for services; free for anyone, anywhere, for any purpose; holding organisations to account
understanding of privacy
community, health and wellbeing; political, power, relationships, etc; social condition, not a technical issue; boundaries in the relationship with others; information about me that i choose not to share; protection; you have a level of control over who gets access to information and how it is used; ineefective (as a call to action to get thing to change); should only be used when necessary plus proprtionate in line with Human Rights law; control and self-determination over your life; privacy depends on capacity; private sphere- about your personal life; more salient/necessary at difficult times; freedom (to be anybody); diversity of personal spaces, interests; protection from trespass/invasion; variety of laws with regards to privacy; intimacy; enables/supports othetr choices, plus values plus behaviours; open data + privacy not incompatible 9if done well!); who gets access for which purposes?; personal space (closing the door-zone to be private); protection; I don't want intruders; expectations/preferences; who judges privacy: social norms; social convention; privacy is 'home' in the digital space; states of separation; cultural evolution; political rights; control over access; anonymity, secrecy, autonomy; secret; control; creating your own personal space even in public contexts; privacy within open data is a sliding scale; balancing between freedom and freedom; privacy is for the weak; transparency for the strong; ability to control your different identities in different contexts; 'entfaltung der personlichkeit' (unfolding of the personality (German Constitution); protection of the vulnerable; Scott Mcnealy; personal; naked; confidential; autonomy; putting a boundary around a context; selective disclosure; privacy on society level; fundamental human right; privacy as a local and social matter (the hardest issues revolve around family and friends, not governments and corporations!)
We (as a group) should:
continue to involve more people in discussions about society, data, online, ethics, power, etc.; improve understanding of legal possibilities regarding regarding open personal data; extend CKAN to allow people to attach 'privacy implications' info to datasets; share our projects, insights etc more often; we should develop easily understandable intro to isssues (+ risks/arguments); we should help develop the workshop reader into a good resource; decide if open data means open to everyone- or does it include data open to some people for some purposes; create a 'debunking myths' primer/doc- a light touch intro to scope what open data is/isn't; we should continue to engage with both privacy + open data communities about the issues surrounding openess as against privacy; good practice platforms for info self-determination; we should agree that personal data should not be open data; we should do book sprints on anonymisation or privacy by design; we should define standards for provenance/consent in metadata; we should keep in touch as a group and work on areas of common interest (it has been great getting such a diverse group together);recognise that data is inseperable from compute (need to consider openess of process, usage as well as of data)
I (individual) will:
Offer up Horizon prototype apps and services (like the marathon example) as case studies for 'privacy and openess' inspections, (and can even provide funding for it!)- SB
Take the lessons learned from this workshop- issues & process- and incorporate them into our phd training programme. I welcome the involvement from external partners e.g.through distinguished lectures.-SB
Advocate for responsible data within Hivos; and also input theses insights into future projects; find and share guidelines for data use-SS
I will raise the question of purpose specification for re-use of personal data with the Working Group; advocate for open register of government data sharing activities- RB
Commission several follow-up papers + tools (e.g. Open data Licensing)-JR
Co-run a session on privacy at OK Festival- JR
Javier will co-write the chapter on Privacy for the Open Gov Guide/OGP (with PI); help create a list of resources;help buid some prescriptive principles/guide for open data privacy- JR
Pilot mydata in traffic systems in Finland; publish mydata policy recommendations (in Finland)- OK
Think through terminology, framing messaging and narratives regarding open data; upgrade my tech and data skills competency- MJ
feed notes/outcomes into recode (EU FP7 Prospect) policy meeting in June 2014-MT
Include learnings from workshop in briefings on use of helath data + genomes (info, contacts, ideas)- HW
Find all the resources/tools referenced at the workshop and share them (where possible); write a blogpost on synthesis of insights from this workshop; help to develop and test the checklist for open data publishing-SD
Define the limits of data control and consent in Open Data
Attend the open data lobby event in Sweden, which I initially dismissed!
Assist with the development of the checklist for the release of open data/coordinate development of the publishing checklist- CW
Get the support of the Anonymisation Network
Help Test out the RDF 'how to' open data checklist-EP
Contribute to a checklist for open data releases; gain greater understanding around EU directives on data including amendments;
Write a blog post/article on 'everything is a data project'-MJ
Write a blog post on the false dichotomy between open + personal data-WvH
Make use of the WG and UKAn input on upcoming changes to offence, location, anonymisation on data.police.uk-AE
I (individual) would like to:
Stay connected to these discussions and contribute to the RDF checklist for data publishing-AC
Try and establish more local spaces for open data, privacy, power over info debates
Thin and write about Open Data & privacy specifically in a development or a developing country context -MJ
Help with a more nuanced debate that doesn't class openess and privacy as two competing and incompatible values-MJ
Scan all the Open Knowledge groups to see where privacy issues have arisen-RB
Discuss further implications of Open Data for Data protection and health data-
Look into linkages between open data + big data, and the common issues between them-CW
Have conversations with ODI leads about the inclusion of 'privacy considerations' and 'destroy by' dates in ODI guidance and certification- AE
See the expertise of the WG used in actual work
Write an article about mydata with Brian Arthur and Shohana Zuboff; Set up Open data and privacy workshop in Finland; finalize mydata proposal to ministry of communication, Finland, set up mydata development community- OK
Get a much-more quantified perspective on benefits + risks of open data
Work with others to develop a research project to look at public attitudes to uses of data in research-especially role of commercial interests-HW
More enagagement of this group with the WG; work on producing case studies that evidence the benefits + risks of open data, and also of personal data control by individuals-SD
See a very accessible intro to anonymisation & pseudonymization (and can help)-JR
See a code book of design paterns/code examples/ recipes for PbD-WvH
To have an operational system that offers fine-grained role based access control, which keeps metrics about the usage 9linking) of data for particular purposes and which opens up data for research uses that are ethically reviwed and considered to be for the common good-AT
Issues in puplic data being opened though containing personal info
Transparency to discover conflict of interest; official public biographies for mp candidates; what is the purpose-is all the included data necessary for the intended purpose of release?; many public filings would become digital-faster access; what is the alternative? sometimes it is pay-for-access, discriminatory?; by making data open data, there is no expiry date; willingness to be transparent is a requirement for public officials;
Needs;
Agree on understanding of the terminologies; collate resources that are available; have a tool kit or glossary of terms and use cases;investigate what should be done with research data