This is a read only archive of pad.okfn.org. See the
shutdown announcement
for details.
content-mining-workshop
Purpose of document:
To plan practical exercises for content mining to teach relevant software tools and explore possibilities
a) For the Oxford Open Science meeting on 27 Nov 2013 http://science.okfn.org/community/local-groups/oxford-open-science/
b) More generally for a package of content mining training material that can be used online and offline
Contributors:
Jenny Molloy
Meetings (add name to list if you can make that date/time):
Meeting #1
- 30 Oct 19:30 (Jenny Molloy)
Timeline
5 mins - Content mining: Intro and what you are allowed to do
15 mins - Iain Emsley presenting on text mining with python - conference tweets and books
10 mins - The power of content mining (mining graphs)
80 mins - Hands-on
10 mins - Demos and wrap-up
Hands-on Sessions
Twitter mining and visualisation with Iain
https://github.com/austgate/openscience
Systematic mining of the literature using AMI
https://bitbucket.org/petermr/ami/wiki/Oxford_Launch
Pre-installation:
- Enthought Canopy (Python)
- AMI (PMR - Jenny and Ross to test by Sunday evening) ok
- Jenny to send installation instructions on Monday.
Documentation
- Jenny to bring together links and help make package
Moving On
- Iain and Jenny to discuss potential next steps at Oxford Open Science 2014 Planning Meeting
### Possible exercises/problems:
PMR's Idea:
Quick overviews:
* http://chemicaltagger.ch.cam.ac.uk/
* use BMC as corpus (primarily HTML) and choose bioscience where everyone can feel comfortable (e.g. species)
* get people to preload simple tools (we'll use wget, grep, etc.) Linux does this. Windows will need cygwin or better Enthought Canova BashGitHub. I dont grok MAC but people managed it. BTW ppl are thinking of having a SWCarpentry in Ox so there could be some useful contacts there.
* use wget to download several papers
* use grep to extract italic sections with species names in them
...
Then move on to advanced studies
* Tabula - I am working with these people. It's a nice tool for analyzing PDF tables
* AMI - I and Ross will provide. We'll do phylo trees and choose 2-3 which work and then get ppl to find others
### Content Mining Problems - what would you use it for?
Mat Todd's OSM Idea:
https://github.com/OpenSourceMalaria/OSM_To_Do_List/issues/99
IPCC Report stuff - how many references are open access?