This is a read only archive of pad.okfn.org. See the shutdown announcement for details.

Open_Data,_Personal_data_and_Privacy_workshop



Welcome to the notes page of the Open Data, Personal Data and privacy workshop_ June 2014


All thoughts welcome!


1. Chatham House rules will apply to all discussions taking place at the meeting!



Participants
Fiona Nielsen
Helena Nielsen
Anna Crowe
Malavika Jayaram
John Gibson
Andy Turner
Mark Elliot
Reuben Binns
Dr Helen Wallace
Carl Wiper
Ossi Kuitinnen
Anssi Mikola
Casper Bowden
Christopher Wilson
Steve Benford
Antonio Munoz
Sanne Stevens
Dr Kieron O'Hara
Dr Mark Taylor
Phil Booth
Mark Lizar
Walter van Holst
William Heath
Kumar Sharad
Alex Edwards
Keith Spicer
Simon Burall
Jacqui Taylor
Peter Jones
John Harrison
Emma Prest
Javier Ruiz
Laura James
Dirk Slater
William Heath
Amelia Andersdotter 
Sally Deffor


------------------------------------------------------------------------

Agenda Hacking: Issue Identification & ID of key assets that exist and examination of whats needed:



Tuesday Afternoon session - Casper + others on anonymisation


Casper happy to explain any of the above! 

Questions / topic ideas / comments:
    * impact for UK if regulations enacted?
    * practical understanding of current situation adn implications for those who are releasing pseud or anon data, or who are attempting to anonymise data
    * what about permissions  to access the index - eg you can get data with a court order if you are police but not accessible for most people.  
    * what about practical measure to make it almost impossible for an ordinary citizen to get access the data?
    * regarding defiition of pseudanonymisation as a category, a strange midpoint between anon and identifiable.   the process of psuedonimisation depends heavily on measures applied
    * would like to go to anonymity concept  in a relative way, identifiable for you but not for the recipient. get away form these absolutes.  there's no room for technical or contractual measures for access to the index here
    * international transfer of personal data... many implications beyond consent
    * what extent does scottish ICO-equivalent fit with UK vs EU?
    * is pseudo meaningful and useful category, even if we understand anonymisation isn't binary?   better to stick with the scale of "fiction" we have today with nonymisation scale, than to introduce this extra category in the middle??  Perhaps the role of Pseud is on data CONTEXT rather than CONTENT?   useful to draw attention to context maybe?
    * 
    
Remarks back:
    absolute or relative concept of identifiability is central
    relative concept is that if you have a data controller, and this DC in theory is the only person able to reindentify some data via index, then around 2007 another opinion on concept of personal data.  this was 'if i'm the DC and the only person who can identify, then it's not personal data as long as you take as much security as you can to ensure the index doesn't escape'.   but now it's gone a lot further to concept of absolute identifiability .  [this is the 261 opinion]
    
    'likely reasonably' / 'erasonably likely'  - new regs say 'reasonably likely to be...'
    
    if everything, every cookie, is identifiable, how do you cope with this in internet context?  in the past this was ignored.  Now, it's 'reasonably likely' or it's personal data.  BUT special exemptions may apply. 
    
    eg in new regs wording as of last oct, for research purposes, try to use anonymise, if that wasn't possible you could use pseud, or if there was important public interest you might get special exemption to use personal data. 

Research data - research community need to be able to access the index to do the research.  
    
    new WP216 is adamant about index deletion but this defeates the purpose of the research protocol for, say, cancer registries or long term medical studies (longitudinal).
    
    when you talk about reasonably likely identifiable,  you mean 2 things (1) reasonable via data analysis, to get probabaablistic chance of identifying, or (2), where the index is removed and only accessible via say a court order.  
    
    new regs are problematic because:
    
    they says pseud is still personal BUT have taken away all your rights as a data subject
    
    if there's a breach of psedu, you notify the data controller, not the data subject.  
    also nullified right to access pseud data, whic his also the right of the data subject to correc the data.
    
    without the index, you cant CONTACT the data subject to give the subject their rights! 

So it's all really confused. eitehr you know who the subject is but they have no rights, or the subject isn't identifiable and so can't have their rights. 

so the 3 pillars of the EC are drifting towards this idea that pseud data is personal, but have lost hte rights associated with it.    It's a way of covering legal embarrassment. 

are there times it makes sense to use pseudonymisation?
Pseud data only makes sense if there's prior knowledge of one person

it's a reasonable internal security provision.  eg internal accounts in a system, for audit purposes. but it doesn't make sense as a release to third party control or risk limiter. 

but  actual release is a whole other thing. 

POlicy makers have been conned into thinking pseud techniques are a 'new tech' whic solves the problems and makes it safe to release data.  it's been brewing in UK for 10y or so.  Wellcome trust etc. 
Now agenda of current uk govt - 'if you pseudonymise it, no problem' 

ICO knew about this stuff ages ago...   [casper admits he's deeply involved so isn't objective!]

mark eliot runing a mailing list and new project spunoff from ICO code, a body of knowledge about psudonymise.  Kieran proposed larger research scientific centre of expertise.  Govt hasn't chosen to do this substantively

ICO in tough spot now


what are the practical risks of potential future problems?

* data controller database compromised, index gets into public domain
* future tech is better than current and allows reidentification

is pseud still better than releasing personal data?
'balance' term suggests there is a spectrum with 2 ends and you pick a point. but in fact it's complex and there's no magix sweet spot
it is a good excuse  - path of least resistance to argue there's a nice balance point. 

other risks:
    * social steganography
    * ...
    * eg data collected...  countries with civil war, factions may use a dataset to identify a social structure around a data point
    * shown time and again people can be identified with some success
    
    
    two notions of risk - social risk and security/tech notion of risk (if vulnerability risk is 100%)
    
    at what stage has one 'done enough' ?  may revise judgement in future of course
    
    not just that you can do an attack but it's plausible. 
    is a question: does a motivated intruder exist?
        of course that will depend on the 'target' individual - some more at risk than others
        
        level of diligence of a private sector DC to a public sector DC may differ?

approaching the uncertainty

what is data? what is personal?    bad situation!
very few people understand this. 
need to go deeper in discussion
human rights issue - treaty of rome, what UN is doing. Wider discussion of humanity 


internet architecure is destiny

tech & policy - 2 worlds -- need more overlap
need global discussion

3 ah-ha moments:
    * legal and tech world intersction
    * how the discrepancies arise between UK EC and other positions - how it all works out - legislation in flux and in practice different/. hadn't realised 4 words made so much difference and that this means different countries have different setups
    * pseudonymisation seen as comfort blanket by policy makers removing need to think abotu complex landscape of anonymisation 
    
WP216 document - very radical!   mostly because DP officials often not clueful technically (30 out of 1500 with any CS skills??)

lobbying from US companies trying to kill it. commission trying to set up carrot and stick.  simplifying stuff is carrot, but stick is all-embracing personal data, with exemptions to make it acceptable. 

what are other implications?
* for data subject as above
* for DC in terms of index removal as above
* "the DC should not have to collect any more data to give effect to regulation" - what does that mean?     sounds harmless but is fatal!   eg your google clickstream when not logged in. Indexed by cookie and IP address. you go ask google and you provide your cookie.   If google get that request, they can't guarantee that cookie is yours, and so they cannot give out your data because there's extra info that is necessary....  unless you can strongly prove the data is yours. Nulifies right to access or correct or delete your data  !    they should have said "you should offer the data subject a strong data authentication secret' (just for this purpose)

if you could access your pseud data, that might be useful for you, but depends DC being willing to give it to you..    Article 10 removes that. 

implications for private data sharing and thid parties?   none in particular

hashes as indexes
a hash is a number which is a mashup of some data    

if you care about this pelase beat up ministry of justice
also lobby your MEPS

cabinet office should have set up a scientifiec centre of excellence on this -- they didn't so we have ac ommuity group, better than nothing.  but they are data consumers not privacy people. Funded by ICO. stepped in with money when no other moey was there, and they didn't have to but at least they tried and no one else did


crash course in differential privacy:

in CS now: can you do distributed Diff Priv? can you do diff priv on data streams?

---------------------------------------------------------------------------



Tuesday afternoon - Project design discussion group

Topics for discussion/outputs 
- issues around personal data - looking to recruit students to engage with things
- open data evangelists vs privacy fascists - middle ground is hard to design around - the two communities don't talk to each other
- how do we 'bake in' privacy and security into how projects are funded, designed - and giving those in the field the capacity to implement
- let's avoid "privacy by design" as terminology - can implemented as a checkbox exercise, not enough thought about how it applies to the work
- open data and development - how to build in data ethics?

System design
- can have naive system designs (e.g. a broker for health insurance, where any company can request information - you can do that differently)
- not enough awarenes of potential solutions
- can work with large commercial partners to probe the issues around personal data and privacy - employ multidisciplinary approaches - internet of things - interested to produce prototypes and field them in reasonable numbers and then do ethnographies and then policy work. Finding out how a smart object is used in the home.
Life cycles - frame for working through the issues
- how to map the lifecycle of a project to know at what points you can intervene to change things
- issues for development sector that are different from the private/government sector - are there things others do well/they do well
- analytics around impacts of programme - how to assess impact on privacy and security
- baselines - independent criteria; e.g. certain kinds of data you don't collect without consent - are there universalities?
- looking at harm stories and failures
- specific technologies that are higher risk
- do we want to think of a vertical approach (where each sector has its own rules) or are there general rules

E.g. baselines for when you work with mobile data, geolocation data.

Types of technologies
- most ebook readers have in their terms and conditions that they can collect your reading data for the purpose of improving the service. Looks at how quickly you read, what pages you read. The data is held by a cloud provider - your favourite passages of e.g. the Quran are held. 
- notion of big data - working backwards from original statistical method. We're collecting too much data.

First issue - collect only what you need. 
- This is a general principle of data protection.
- blind faith by people collecting the data in anonymisation/scrubbing of databases - just collect it all
- Idea of anonymisation - can just fix everything and worry about it later.
- they feel there are technical tools to fix everything post facto. And they now can collect everything.

Perhaps we should think - worst case scenario

There is a question of who makes these decisions when you design a project

At the design fiction end, there is a role for digital artists in provoking debates - e.g. Blast Theory - illustrating the dystopia

These methods line up at different parts of the lifecycle

At the level of the individual system, need to help people understand they can achieve things with less data - they can do with less.

Summary of the discussion:

Take-aways




Tuesday afternoon - Consent 

PAX: 

Open data definition - about things like bus timetables, but not about people. There's no trouble with impersonal. But trouble when it comes to indibviduals, shopping habits, health, etc.
Both gov and priavte sector.
OECD roundtable paper - four classifications:  'derived' - result of data mining, observed data (publicly visible), provided. (ref forthcoming), inferred. Looking at the data origin. There's a difference between surreptitiously observed vs consent-based observation.

Data is 'personal' when it relates to a person, but also depends on what other data it could be combined with to identify someone. Depends on the context.
The main paths to open data release is through - 1) anonymisation, 2) legislation requiring disclosure, or you 3) get consent. The latter is hard when you don't know the purpose in advance of the open data use.
E.g. health records: there's been a debate about the release of health records through HSCIC - 'lightly anonymised'. Data collected by GP's, then uploaded to a central database where anonymised, then released. You can opt out by writing a paper letter - the GP puts a flag on your record.
Another kind of consent woudl be where you are given lots of paper certificates (e.g. qualifications, driving license). Then you decide to give a third party sight of the data. Where the individual has control over the disclosure.
Difficulty in working out what counts as 'incompatible' with the original purpose of the data collection.
There's a missing point working out this.
There are different issues with licensing and data protection - but the two are applied in unison. Conceptually different.
One of the main reasons for licensing data is to maintain the status of non-personal.
Is a personal data license possible? what would it look like? Can you really over-ride fundamental DP principles with a license?
Licenses are often put in place to ensure compliance with DP principles.
That would be relying on contract law - proividing a proxy protection.
If a gov department shares data under conditions with another, that's not open data. Can think of this as 'front office' / 'back office'.
It was assumed in DP law that 'data controller' is an organisation. But it could be interpreted as 'natural person'. But not thought of as individual controlling their own data.
There's a lack of distinction around someone who could be both data controller and data subject at the same time.
There's some flexibility wrt 'joint data controller' 'data processor'.
Seems like we're talking about the legal mechanisms but no consent itself. Let's take a step back!
Is general consent possible for open data? There's consent to whether personal data flows to a particular organisation (consent of access), then there's consent of use/control (what the organisation do once they have it).
You can give your genome data to the Personal Human Genome Project with consent, but the approval of a particular research is done by the organisation, no the data subject.
Think through some concrete examples:
Think about the use of the data, consent covers that.
The point of collection, use, and origin.
Question 1: Can I consent to open data
Question 2: can you consent to 'nice research' and not 'bad research' (e.g. biological warfare research)

Not many people will likely want to upload their genomic data if there are no constraints.
Let's assume anonymisation is a black box which works.
There is legislation which defines these things - people trying to make personal data not personal data using less than adequate anonymisation techniques.
Consent implies permission, but what the genome project case is talking about personal 'publishing'.
How do you get informed consent to anonymisation when anonymisation techniques are a 'black box' for most people.
To treat in law personal data as just being yours is wrong - e.g. the genomic data example which affects your relatives.
There are certain situations, types  of data, which should always remain closed.
Difference between publication and consent.

At the moment I self-publish data, I give up certain data protection rights?
Depends on the context - e.g. social media if you publish there it doesn't necessarily mean you're giving anyone permission to do anything.
The way to get away from that is if you publish your data on your own website.
So paradoxically I have less control over my own data if I publish it myself.
The question is if someone else takes it from independent blogs, they become a data controller.
They don't necessarilyt need consent if they have one of the other bases of fair processing.
The 'lock comes down again' every time there is a re-use.
The problem is if your a data controller, you need a condition for processing. One of those conditions is consent.
In practice, the other available conditions don;t work. But in a big data/ open data context it often comes down to consent.
there;s a connection between the practical examples nd the broader principles.
The problem is around making a broad enough consent for many different purposes, and still being valid as 'specific' consent.
E.g. individuals may not agree with e.g. GSK activity as 'medical research'.
Are the protocols for consent applied to data controllers adequate?
If you define purposes tightly enough, it may be feasible to get informed enough consent for research uses.
General principles for consent.

But the UK Biobank records example - they subsequently decided they couldn't delete everybody who wanted to opt-out. They then shared with an unspecified company they didn't originally mention.
One of the reasons often put forward against consent, is that 'people aren't that bothered'. They just want the benefit / end product. People don't want to read privacy notices.
So that's one of the practical challenges - how do you give people the information in a condensed form ?
The MIT media lab clear button initiative is an example of opt-out mechanisms.

All of my data has some bearing on other people - if you exclude all personal data relating to other people, you exclude too much. E.g. household data. It's a slippery slope. There's a spectrum.
E.g. smart meters - always family-level data. But not about specific people, like genomic data.
Anonymisation gets harder the more data you have - so every kind of data becomes relevant.
There's no hard/fast line on this.
There are no stronger restrictions on access, because people couldn't be sure that those with access hadn't used it secretly.
There is a notion (a social contract) trhat by going to the hospital that you have implied consent to data processing for delivering treatment. But secondary uses are not implied (e.g. invoice reconciliation).
Even if you forget about the information content of a genome, it's still a biometric.

Conclusions:






Tuesday Afternoon - Resources Group

                                   
Resources on Anonymisation

1. ICO Code of Anonymisation: broad guide on how to think about anonymisation

2. ONS Guidelines consistent with the ICO Code of Anonymisation  


3. ONS SDC courses on disclosure control across government which are open to others. Also run bespoke courses on disclosure.

4. ONS writing e-learning materials on anonymisation

5. ONS have a 'helpline'. Questions can be sent to sdc.queries@ons.gov.uk

6. UK Anonymisation Network (ukanon.net) working with ONS, Southampton University, Manchester University, ODI and ICO. They provide consultancies and are collecting case studies, writing a book, holding a symposium.

7. NHS 'Anonymisation standard for Publishing Health and Social Care Data Specification'
http://www.isb.nhs.uk/library/standard/128

8. Statistical Disclosure Control, 2012
“A reference to answer all your statistical confidentiality questions.”

9. Nice example: Department for Work & Pension have StatXplore for public to interrogate data flexibly with built in algorithm to anonymise data.

10. Cabinet Office and Tim Berners Lee have a star rating system for open data.

Don't want too many guidelines, could create confusion and inconsistent advice.

GAPS
Easily digestible resource flagging key things to think about.
Create an overview of what is out there so people know where to look.
Glossary to define terms (do we even have a common language yet?)
Advice for businesses looking to anonymise and release data.

Session on Open Data FOR Privacy – convened by Reuben Binns [Notes emailed by Elizabeth from ORG, who wins the prize for best report!]

Open data can help achieve privacy aims through transparency or by changing business models, e.g. buyers are able to market themselves to sellers. Idea of “self advertising” is that people advertise themselves rather than the other way round. The aim is a positive outcome of innovation as well as protection: “applied privacy”.

Discussion of using the ICO notification register of data controllers to obtain information on how data is processed for e.g. for health, by banks and who the recipients are. The new EU law proposes to get rid of the register as it aims for a light touch and focusses on what organisations are doing internally. It is intended to lead to better data protection outcomes in the longer term.

One tool is freedom of information requests but only for public authorities. Questions can be asked regarding use of data and data breaches. ICO investigations – their parties do not get access to details. Customers may want open publishing of reports – advocacy would be needed to promote this. The downside is this might have a chilling effect and prevent companies from reporting breaches if it is made public.

We should not ignore the private sector discussions around big data and the role of transparency. For customer and stakeholders it is on the corporate agenda. It is about companies being transparent about what tracking they are doing when people use their websites. Certification schemes could be a good idea. Mozilla has done it recently.

Another barrier is the disruption of business models. Many companies trade on having data in a market where customers are the product sold to advertisers. This could be disrupted.

We need new tools for self-determination. People need to get control of their footprints which requires innovation policies for a human-centric society rather than corporation-centric.

It is not clear whether enough citizens care about the issue. The ease of the current route means many people may not want to create their own world of data. But there is demand within the data protection world. It is a social need. Companies need to find competitive advantage from doing it. The ICO's scheme of red, yellow, green for compliance could be developed. The ICO is launching an accreditation scheme. It will relate to specific practices rather than companies. There is discussion around privacy as a competitive advantage e.g. Boston Consulting Group. But many people don't care as long as their data is not lost. The EU data Protection Commissioner produces a report recently. Monetised value of data may outweigh the power of consumers.

There are three advantages:
1) Accountability – companies get found out if they do wrong;
2) Allows customers who care to make rational decisions; and
3) For customers who don't care, transparency in the system may preserve trust in the system in general (like open source software where people don't read the code)

A concern is inappropriately shifting responsibility to the individual. This should not absolve companies of their obligations. Information mediator companies may be needed to get people a good deal. The regulator must oversee it. It could change to be the overseer of the privacy infrastructure. Privacy policies put the responsibility on the individual to read them all. People don't want to make decisions on data everyday. Companies like Google promote transparency for hidden interests. 'Open-washing', like 'green-washing'. Most open data is not demand driven at present.
We need lobbying for 'my data'. The government or regulator should have a role in 'my data' but it may create more work in terms of complaints. We need to see examples of being effective but it is hard when the benefit is intangible such as transparency. We should look at sectors where there is an incentive for individuals to repurpose the information. We could look at the Netherlands in terms of subject access requests to telcos. Customer switching is a measure.

We are looking for a mapping of the space and the barriers. Transparency is a stepping stone but just revealing bad things does not solve the problem.

The 'AHAs':

1) Three uses of transparency:
        a) Accountability – companies get found out if they do wrong;
        b) Allows customers who care to make rational decisions; and
        c) For customers who don't care, transparency in the system may preserve trust in the system         in general (like open source software where people don't read the code)

2) Barriers:
        a) Chilling effect – companies may become secretive.
        b) Transparency itself is not enough, need to do something with the data that is made         transparent.
        c) Engagement – people have other concerns as well as privacy and we should not place too         much responsibility in consumer's hands.



Tuesday afternoon - open data & anonymisation through aggrgegations

Open data which is about say weather is fine to open up - no personal element
but what about when we gave a set of data about people,  can we open it up and if so how?

how does aggregation work?  there must be some degree of 'collapsing' 

if you have a 2x2 table, you can turn that into a record-based system. there's no real distinction between teh two.  Being table vs record doesn't matter. 

from the theory - there's key anonymity. you can collect data until there's enough individuals within each group that sufficient aggregation is achieved. 

but what about practice?

key anonymity works well when database is low dimensional - few columns, many rows.  but if ther'es lots of columns this fails. 

you can collapse some columns, sure but then we must talk about the distance between the columns!  closest neighbour and farthest neighbour may be  very close. 

Example:Dr Foster dataset.  it is aggregated, anonymised, to the last 2-3 digits of postcode.  licensed to people who can provide services, authorised orgs.  
the threat model is very different here.  it's not a n active advsersy in attack; it may be honest but curious person. 

individual desire to give data for medical research but not have it totally identifiable

the degree of collapsing renders the use negligible. 

run the risk of collapsing the big numbers along with the small ones

no magic k - it's all contextual

if you collapse too much - eg national statistics - you can answer some questions but not others

can you work out the best questions to answer with a dataset and collapse in that way?

but with open data the benefit comes from use you can't predict - if you knew the best way it would be used you wouldn't need to open, you could just give the data to teh user!

think of census. huge stats effort to get personal data ready for open release. worth investment - lots of minds to agrgegate appropriattely. 

changes in census metho in UK coming; 10% sample release not 5% and some removal of fields of that
some licence conditions - but very light touch - on the sample data
reduced set of variables is published - useful as teaching/training dataset
also, record swapping, to create uncertainty & create noise

So two ways:  collapse data; or remove fields(columns)

Then we look at how sensitive the data/variables are. 

eg abortion data... 

need to keep data useful but add perturbations. in census there's nothing super sensitive; there's health but it's self-assessed (eg 'are you a carer' )

lots of unceratinty and writein answers  - processing not always perfectly accurate etc, transcription and translation

can try to add records to cover expected abnormalities

record swapping and other forms of perturbation -- when you've done this you release samples of records, and final stats database

use of netflix database in attacks on other things 
the noise was added to netflix but not enough

knowledge of information may be imperfect anyway
-- but even with that you can use eg netflix set to identify people

two different datasets - with various data processes - knowledge is imperfect - that's different from deliberate perturbation (SDC)

these methods can waeken the results but could still leave potential for attack

we haven't seen these datasets released for study by security researchers - we haven't tested the anonymised set 
to carry out SDC tests you need real data - can happen inside an organisation
external researchers different
ethics committee tough -- so it's an internal process to audit as well as to anonymise; no real validation. no way for external experts to assist /audit?

how can you evaluate anonymisation techniques using a meta anonymised data set?
there are guidelines you can use for this depending on risks in data set
how can these methods be tested in a privacy friendly manner?
there are ways to bring in external researchers to evaluate disclosure risk

datasets reproduced after perturbation may have some gaps, some info.  strongest test - you shoudl not be able to find the person in the raw dataset adn in the processed one  ?
but record swappping, it only applies to some proportion (?)

a practical attacker might be interested in one neighbourhood, gather data on that, and then get some set of attributes from that...  depends on data how a practical attack would proceed

the info gathering based attack has been tested by UKAN
predicated on specific sort of knowledge
'level 1 response knowledge' - you know the person is in the data

attack could target a group rather than an individual

underlying models always depend on whether or not you know for sure if a specific target individual is in the data

what's the context of this data? what other data is in the world or available?  this informs attack modelling

you can only simulate one attacker - not the set of several/all attackers... 

in primary risk analysis you assume the attacker has exactly the data you have

what threats are considered?  'can i identify an individual and find out some specific info about them'

false positives...   for some matches there's greater certainty the match is correct
if you have high priority matches, you then retrain your model and repeat 

you can have a theoretical limit with regards to a specific attack model.  that's estimatable accurately.  but it depends on assumptions about adversary

our original question:  if we have personally identifiable info, and you want to release as open data, without (much) potential for identification, what do you have to do?

(assuming it's not about data where you want to be able to identify!)

you must decide your risk appetite

is it different if you ask people 'are you willing to contribute your dataset' - what if you get a subset of data, just a few folks contribute? changes aggegation picture 

it's different for every dataset!   
data about abortion vs census type data vs what's your favourite colour?

Netflix - you can stop putting more data out but you can NEVER retract data. 

netflix - trouble in communicating this is: you may say, hey, my movie data, no worries i'll share. but people never thought that it woudl lead to inference of sexual identity!  really hard to explain risks here

attack audits are about certainty of identification rather than probability of indeitification


 3 ah-has:
      *  hard to do really good auditing because you need to bring an adversary researcher in house to test anonymisation of data (without access to original you can't really test)
      * high levels of aggregation are needed on most data for them to be opened and nondisclosive
      * make a licence instead of open data
      
      


Wednesday

For practical examples of public record data that contains personal data https://docs.google.com/document/d/1yVnTPbTs_u0KIQ4XEM_VxOMUe0jxsfDnCHNsXbPrMes/edit#

Wednesday exercise on public record information:  
    
    We will discuss the following examples exceprted from the above doc (above doc has links etc in some cases):


Notes on Break-out group on Baking Privacy into Funding data4development projects. 
 
Alex, Emma, Malavika ,Chris, Sanne

Some issues we could talk about: 
 
 
1.     Life cycle of development projects when collection, when storage, when etc. Willow just finished a lifecycle as an output of the RDF Oakland – built on that. 
 
2.     What is specific about development context? Are there particular things in development that make it different and how do we address those. 
- one way to approach that is try to think of what the prototypical challenges might be – stand out more in development. 
 
3.     Impact assessment; when you are rolling out an development project are you thinking about what the impact might be in sense of intended consequences. What are the artefacts and legacies you leave behind that could be repurposed. 
 
4.     Are there baselines of any universal conditions and attributes that we can use across every context. Are there red lines? Certain kind of data that we never should collect? Or is it all context dependent?
 
5.     Harm stories and failures; how it went spectacularly wrong?
 
6.     Are there particular technologies or programs that have a higher risk in terms of prioritizing funding. 
 
 
How do you make sure that is not just a box that you tick off –  because often privacy by design ends up a box-ticking thing. 
 
 
Where would we want funders to be? What we want funders to do? As a matter of policy.
-       criteria based policy: projects need to do x-y-z
-       carrots; highlighting projects for inclusion of privacy
-       separate funds – support to inject privacy/capacity 
 
Mechanism that could come in as appropriate as grantees to work with it. 
 
Capacity and familiarity of funders. The engine room just interviewed funders. Nobody feels that these questions are raised enough. Broader questions on ethics and privacy are not addressed. 
 
N
 
They don’t think of them as data-projects. As water project etc. But: everything is a data-project. 
 
The most progressive donors have the most ‘hands-of’  approach. This is a tension – but you are always being paternalistic in some sense – also if you say leave it up to the people – and they decide everything is fine. 
 
Developing countries do not have the legislations but what do we do in the UK in order to fulfil the data-protection act and could we bring these to funders.
1.     IT system goes to an audit by a security editor
2.     Data-processing agreements with suppliers that are matching the datacentres. 
 
When conditionality is most appropriate and when a standing offer is appropriate. If donors have a standing fund for responsible data support then there is money for an audit as well. 
 
Can we make interventions: hey – this is a good project, a safe platform eg and take what we have learned. 
General audit of most popular tools – risks or concerns (frontline sms, magpie or ODK). Here are some questions you should think about before you go further needs to be integrated in to a platform for functionalities what you want. 
 
If donors had a list of the most popular 35 tools with the responsible data risks and concerns – a cheat sheet for popular tools. 
Due diligence on the infrastructure. This creates an incentive and it doesn’t require institutions to change policy but PO can use it. 
 
Open integrity Index:  iilab
Rebecca MacKinnon Ranking Digital Rights. Mini offshoot in the development sector. Tools change so quickly – it needs to be kept up to date. We are interested in Granularity . Potential for something like that – require researches.
 
Matrix for Oxfam Novib on tools and MAVC mapping of who uses most tools. So within a few months we have the most relevant tools – might have greater uptake if it has something about functionalities as well. 
 
Community to review such a tool is willing and able. We could try to get funded some research for the Open Integrity Index to make a draft on the basis of our tools and could follow up with the group of donors we interviewed.  
 
Best would be individual consultant for every project to help think it through – not feasible. But a set of questions and ideas would be good. All your grantees/new grantees have a training.
There appears to – very little people do training on these issues; it is ad hoc, no coordination or sharing of methodologies. Only 3 donors that have a module. Huge room for improvement and coordination .
 
Make it available – the ‘training’. Classic example: The Engine Room grantee OSF and needed help managing our governance policy and they helped us to set up assistance with the audit department. Offer the expertise. 
 
Funders themselves don’t think about these issues – they don’t know what questions to asks. They are required by their policy to send 
 
At the moment the privacy risks that exists – to make people aware of misuse of data; are there other ways then trainings to do something about it. Is there anything that can be done in the funding cycle to mitigate these threats?
 
It needs to be genuinely helpful. It’s better than nothing to have a baseline that people have to consider. If you at least get people to tick the boxes. 
 
You can’t force people to see the relevance of privacy all of a sudden - and you can make people check the boxes. Funders want to be hands-of for reasons of respect and ownership – but sometimes it’s a cop-out not to have to spend to much energy on certain projects. 
 
Focus to have tools for funders for PO’s to use that are motivated. 
 
 
We should have this conversations with other funders round the table and ask them. 
 
3 points:
Awareness: everything is a dataproject. 
Incentives: competing and perverse incentives in funders relationships – box checking exercise. Most immediate activity is to provide PO’s with resources: cheat-sheet with risks and dangers of certain tools, and can get into a conversation with their grantees. 
Sharing of information across organisations and get funders together in a safe environment: Funderdome.


Tuesday Morning Group -  Open Data and Research Data
Mark Taylor
Andy Turner
Helen Wallace
Amelia Andersdotter
 
At the moment too often the availability of data depends on who has the economic power to leverage their own particular interests. 
 
Open data as double-edged sword:
 
a. There is a problem that the “open data” research agenda can be a way of labeling activity, and establishing particular governance, that is in corporate interests rather than public interest. Can reduce individual/ public control.
 
b. At the same time, “open data” can be a way of leveraging public benefit from commercial research activity. It can be a condition of commercial research access that the results are made public.
 
Given that open data can be a double-edged sword, how do we ensure that any mandate to open data leverages public rather than private interests?
 
1. Legislative reform?
 
There are a number of existing research exemptions – routes through to lawful processing. E.g. rules on keeping data – on the purposes for which data can be used – for subject access and requests to be forgotten…
 
Do “research exemptions” appropriately recognize different kinds of research, and research of different public – as opposed to commercial – utility?
 
SUGGESTION: Need to get public and researchers involved in the appropriateness of allowing “research” to do things that would not otherwise be permissible (?) Ensuring the governance operates ‘in the public interest’
 
2. Make the political aspect of open access to research data more transparent.
 
SUGGESTION 2: Need to get rationing decisions – e.g. decisions by funders – more transparent and accountable. Work on who is making what decisions, and what conditions they are applying. Ensure that decisions are made in the public and not corporate interest.
 
This work should inform understanding of improved best practice on ethical training, ethical review and ethical audit of research practices. 

Andy Turner's Notes
https://docs.google.com/document/d/1_FLykBOzA3-ZhFth2VeexHR9N2ev3N_cJEM2Ouk0zsE/edit 


Steve, Sally and Antonio
dystopian views about when individuals own and control their own personal data

future scenarios where individuals controlling their own data can have catastrophic effects example in accessing and releasing health data of their personal challenges with for example cancer  and when in the future insurance or private healthy care companies then use this information to determine the cost of accessing this service


1. Character has control and could no longer manage it
2. Character 2 has right to give away control and this has consequences for who access it in the future
3. Access and control to someones record and they are profiled based on this open Data, lack of control , when there is a problem with the wrong processing of data

1. My niece will be born within 5 years. According to the registries of the health system she will have big probability to suffer a heart attack before the 40s. Other institutions which have access to the DNA registries of her and her family discover that she will have a very creative mind. Segmenting this and analyzing other social aspects of her parents the home office discover a tendency to be a social activist, who must be investigated closely. The education system recommends a specific curriculum for her according to inherited attitudes, more focus on human studies than in science. Her parents prefer her studying Maths, but the authorities said that it would have an extra costs. So in the school and the University she met some people very close to her way of thinking, around a social change. This behavior was detected by the police though her social network and supported by the previous analysis made a lot of years ago by the home office. In some moment on time she detected that the police was following her and she became a bit aggressive in some public demonstrations against the government. One morning she suffered a heart attack when find out she was being followed by a stranger.   2. Jay is a youngman in the University somewhere in Europe. Due to recent legislation which gives him full access to, ownership and control over his data, plus the emerging discourses around how much personal data management is worth economically, Jay decides to offer all of his data for sale to the highest bidder. This included all his personal (social media) conversation via text and emails, as well as all his health, education. financial, geo data, retail, legal records etc. In addition to this, he signs up to offer his full records perpetually. He obtains a sizable amount of money and is quite pleased with this arrangement. Fast-forward to a few year and Jay is out of school and is just been considered for a new position in another country. Unfortunately for hism, the new organ
isation was able to access all his records for a small fee. In it, they were able to find out that Jay falsified some of his school records, had previously harbored a fugitive in his university hostel and had defaulted on his student loan payments. Additionally, he was unable to obtain any reasonable insurance due to the fact that he was perceived as a high risk. Needless to say, he did not get the job or any others.   3.Susan is very protective of her data. So concerned was she about privacy that she manages it all herself. She has total control over each and every single information collected about her and is responsible for storing and managing it, dispensing it only when necessary. Over time, this becomes a huge burden and she had to increase the capacity of her storage devices in order to contain all the data. Additionally she had to take a Masters degree in Computer Science (Information Management) in order to effectively make sense of the ever-increasing data, spending quite a lot of money in the process. Fast-forward to a decade and Susan is diagnosed with a terminal disease. All her property (including the data storage)is passed on to her nephew who knows nothing about data management and was therefore unable to provide doctors with her medical history which would have enabled them to find treatment for her condition. Due to this, she died without a solution been found.


How to communicate or enhance education on Open data and Privacy

communication on privacy as a concept and the considerations around privacy in open data systems needs to happen in a way that engages people especially when the audience are people outside the community, the messages need to be simple and clear with no use of jargon., but not loosing the key  message through over simplifying.  Need to consider the different audiences (for example, when to use blog posts, and when to use participatory forums such as this).  Not a one size fits all. These meetings should be brokered by organisations perceived as neutral.  Common understanding of the terms being used is also necessary to enhance clarity. 
 
When there are concerns about data misuse, independent organisations like (Open Knowledge?) need to  provide the platform to discuss these and not organisations which may have a vested interest. A neutral and open platform to bring a variety of  viewpoints together and not having some few to dominate or polarise the discussion.
 
Case studies and  more evidence needed to   back advocacy activities. Convince people immersed in data management that there are humans in the datasets.
Engagement plus evidence is therefore likely to be more successful : more public discussion, tools needed to communicate in this space is different., foe examples videos might not be too appropriate as people need to feel safe. 
 
At the end day 1, participants called for clarity on the open data brand, the terms, they wanted to go through the open data sets mandated by law though they contained some personal data, and to see the privacy concerns of each.
They  also wanted to explore the other side of the conversation around the obligations placed on people if they had access and control over their own data.
 
 
 Understanding of Open Data:
It's a new name for an old concept; structured; transparency; no challenge about how will be used; access for free; interactive, participation, engagement; there is a gap in the definition- is personal data ever released without controls?; data that has a license and allows all types of use; data that is open to all and in a form that uses open standards; freedom; minimum restrictions on the reuse of data; restrictions on open data may be valid but what are they?;reusing govenment data- but does it get reused?; provenance (open history of use); it is more than government data; vulnerable (in terms of tech/platforms as well as people); are we creating data orphanages (eg. data.gov); facts; confusion with regards to the intersection of open data and big data; online; license; public access to data; democratic knowledge for many, not few; need for suitable anonymization; enabling smaller organisations to benefit (societal value); rebalancing of power; another buzzword; raw material to be explored; available; easily accessed online; about things; open compute; provenance (open history); bad!; government spending, e.g how much of our taxes are spent on what; data that is beneficial to the public; data that is re usable by everyone; data that everyone has access to; data that is under an open licence; five stars for degree of open; interpretation has not being done; free to reuse; machine-redable; moving towards making more public data, open data; legally free; some data is easy to make open; five star rating for accessibility making teh most utility of taxpayers money; look at the data from different angles; making public utilities easier to use; open data gives transparency; difficult to get open data for services; free for anyone, anywhere, for any purpose; holding organisations to account

understanding of privacy

community, health and wellbeing; political, power, relationships, etc; social condition, not a technical issue; boundaries in the relationship with others; information about me that i choose not to share; protection; you have a level of control over who gets access to information and how it is used; ineefective (as a call to action to get thing to change); should only be used when necessary plus proprtionate in line with Human Rights law; control and self-determination over your life; privacy depends on capacity; private sphere- about your personal life; more salient/necessary at difficult times; freedom (to be anybody); diversity of personal spaces, interests; protection from trespass/invasion; variety of laws with regards to privacy; intimacy; enables/supports othetr choices, plus values plus behaviours; open data + privacy not incompatible 9if done well!); who gets access for which purposes?; personal space (closing the door-zone to be private); protection; I don't want intruders; expectations/preferences; who judges privacy: social norms; social convention; privacy is 'home' in the digital space; states of separation; cultural evolution; political rights; control over access; anonymity, secrecy, autonomy; secret; control; creating your own personal space even in public contexts; privacy within open data is a sliding scale; balancing between freedom and freedom; privacy is for the weak; transparency for the strong; ability to control your different identities in different contexts; 'entfaltung der personlichkeit' (unfolding of the personality (German Constitution); protection of the vulnerable; Scott Mcnealy; personal; naked; confidential; autonomy; putting a boundary around a context; selective disclosure; privacy on society level; fundamental human right; privacy as a local and social matter (the hardest issues revolve around family and friends, not governments and corporations!)


We (as a group) should: 
continue to involve more people in discussions about society, data, online, ethics, power, etc.; improve understanding of legal possibilities regarding regarding open personal data; extend CKAN to allow people to attach 'privacy implications' info to datasets; share our projects, insights etc more often; we should develop easily understandable intro to isssues (+ risks/arguments); we should help develop the workshop reader into a good resource; decide if open data means open to everyone- or does it include data open to some people for some purposes; create a 'debunking myths' primer/doc- a light touch intro to scope what open data is/isn't; we should continue to engage with both privacy + open data communities about the issues surrounding openess as against privacy; good practice platforms for info self-determination; we should agree that personal data should not be open data; we should do book sprints on anonymisation or privacy by design; we should define standards for provenance/consent in metadata; we should keep in touch as a group and work on areas of common interest (it has been great getting such a diverse group together);recognise that data is inseperable from compute (need to consider openess of process, usage as well as of data) 

I (individual) will:
Offer up Horizon prototype apps and services (like the marathon example) as case studies for 'privacy and openess' inspections, (and can even provide funding for it!)- SB
Take the lessons learned from this workshop- issues & process- and incorporate them into our phd training programme. I welcome the involvement from external partners e.g.through distinguished lectures.-SB
Advocate for responsible data within Hivos; and also input theses insights into future projects; find and share guidelines for data use-SS
I will raise the question of purpose specification for re-use of personal data with the Working Group; advocate for open register of government data sharing activities- RB
Commission several follow-up papers + tools (e.g. Open data Licensing)-JR
Co-run a session on privacy at OK Festival- JR
Javier will co-write the chapter on Privacy for the Open Gov Guide/OGP (with PI); help create a list of resources;help buid some prescriptive principles/guide for open data privacy- JR
Pilot mydata in traffic systems in Finland; publish mydata policy recommendations (in Finland)- OK
Think through terminology, framing messaging and narratives regarding open data; upgrade my tech and data skills competency- MJ
feed notes/outcomes into recode (EU FP7 Prospect) policy meeting in June 2014-MT
Include learnings from workshop in briefings on use of helath data + genomes (info, contacts, ideas)- HW
Find all the resources/tools referenced at the workshop and share them (where possible); write a blogpost on synthesis of insights from this workshop; help to develop and test the checklist for open data publishing-SD
Define the limits of data control and consent in Open Data
Attend the open data lobby event in Sweden, which I initially dismissed!
Assist with the development of the checklist for the release of open data/coordinate development of the publishing checklist- CW
Get the support of the Anonymisation Network
Help Test out the RDF 'how to' open data checklist-EP
Contribute to a checklist for open data releases; gain greater understanding around EU directives on data including amendments; 
Write a blog post/article on 'everything is a data project'-MJ
Write a blog post on the false dichotomy between open + personal data-WvH
Make use of the WG and UKAn input on upcoming changes to offence, location, anonymisation on data.police.uk-AE

I (individual) would like to:
Stay connected to these discussions and contribute to the RDF checklist for data publishing-AC
Try and establish more local spaces for open data, privacy, power over info debates
Thin and write about Open Data & privacy specifically in a development or a developing country context -MJ
Help with a more nuanced debate that doesn't class openess and privacy as two competing and incompatible values-MJ
Scan all the Open Knowledge groups to see where privacy issues have arisen-RB
Discuss further implications of Open Data for Data protection and health data-
Look into linkages between open data + big data, and the common issues between them-CW
Have conversations with ODI leads about the inclusion of 'privacy considerations' and 'destroy by' dates in ODI guidance and certification- AE
See the expertise of the WG used in actual work 
Write an article about mydata with Brian Arthur and Shohana Zuboff; Set up Open data and privacy workshop in Finland; finalize mydata proposal to ministry of communication, Finland, set up mydata development community- OK
Get a much-more quantified perspective on benefits + risks of open data
Work with others to develop a research project to look at public attitudes to uses of data in research-especially role of commercial interests-HW
More enagagement of this group with the WG; work on producing case studies that evidence the benefits + risks of open data, and also of personal data control by individuals-SD
See a very accessible intro to anonymisation & pseudonymization (and can help)-JR
See a code book of design paterns/code examples/ recipes for PbD-WvH
To have an operational system that offers fine-grained role based access control, which keeps metrics about the usage 9linking) of data for particular purposes and which opens up data for research uses that are ethically reviwed and considered to be for the common good-AT 


Issues in puplic data being opened though containing personal info
Transparency to discover conflict of interest; official public biographies for mp candidates; what is the purpose-is all the included data necessary for the intended purpose of release?; many public filings would become digital-faster access; what is the alternative? sometimes it is pay-for-access, discriminatory?; by making data open data, there is no expiry date; willingness to be transparent is a requirement for public officials; 




Needs;
Agree on understanding of the terminologies; collate resources that are available; have a tool kit or glossary of terms and use cases;investigate what should be done with research data