content-mining-workshop

This is a read only archive of pad.okfn.org. See the shutdown announcement for details.

content-mining-workshop Purpose of document:

    To plan practical exercises for content mining to teach relevant software tools and explore possibilities
    a) For the Oxford Open Science meeting on 27 Nov 2013 http://science.okfn.org/community/local-groups/oxford-open-science/
    b) More generally for a package of content mining training material that can be used online and offline

Contributors:
    Jenny Molloy


Meetings (add name to list if you can make that date/time):

Meeting #1

30 Oct 19:30 (Jenny Molloy)

Timeline

5 mins - Content mining: Intro and what you are allowed to do
15 mins - Iain Emsley presenting on text mining with python - conference tweets and books
10 mins - The power of content mining (mining graphs)
80 mins - Hands-on
10 mins - Demos and wrap-up

Hands-on Sessions

Twitter mining and visualisation with Iain
https://github.com/austgate/openscience

Systematic mining of the literature using AMI
https://bitbucket.org/petermr/ami/wiki/Oxford_Launch

Pre-installation:

Enthought Canopy (Python)
AMI (PMR - Jenny and Ross to test by Sunday evening) ok
Jenny to send installation instructions on Monday.

Documentation

Jenny to bring together links and help make package

Moving On

Iain and Jenny to discuss potential next steps at Oxford Open Science 2014 Planning Meeting

### Possible exercises/problems:

PMR's Idea:

    Quick overviews:
         * http://chemicaltagger.ch.cam.ac.uk/

* use BMC as corpus (primarily HTML) and choose bioscience where everyone can feel comfortable (e.g. species)
* get people to preload simple tools (we'll use wget, grep, etc.) Linux does this. Windows will need cygwin or better Enthought Canova BashGitHub. I dont grok MAC but people managed it. BTW ppl are thinking of having a SWCarpentry in Ox so there could be some useful contacts there.
* use wget to download several papers
* use grep to extract italic sections with species names in them
...
Then move on to advanced studies
* Tabula - I am working with these people. It's a nice tool for analyzing PDF tables
* AMI - I and Ross will provide. We'll do phylo trees and choose 2-3 which work and then get ppl to find others

### Content Mining Problems - what would you use it for?

Mat Todd's OSM Idea:
    https://github.com/OpenSourceMalaria/OSM_To_Do_List/issues/99
IPCC Report stuff - how many references are open access?