This is a read only archive of pad.okfn.org. See the shutdown announcement for details.

Tools_tools_and_more_tools Tools tools and more tools: building the data pipeline : 1 hour outside

Inspired by the data handbook (http://schoolofdata.org/handbook/ ), this session will collect our communal knowledge about working with data to build a toolbox at the festival, teach and learn about the tools other people are using from collecting to publishing and visualizing data.  The structure of the session will be divided in two days. The first one to collect the tools and the second one to learn about them.

collection and collaboration tools
indaba.io 
ethercalc.org

    
An introduction to the data pipeline
- See more at: http://schoolofdata.org/handbook/#sthash.JCtPfrZN.dpuf



Acquisition 
Three tracks - people based. Crowdsourcing, mechanical turk, open street map, satellite images, questionnaires, surveys, illegal activities 
Government based - journalism agencies, space agencies, governmental websites providing open data, public TV stations, 
Machine generated - scraping public websites, machine learning, analysing publicly available log files from web servers, from data bases, transit bases, sensor Networks, GPS Tracks  




Extraction


Many command line tools or scripting languages like R or Perl

QuickScrape, tool uses node.js , it gets data out of pdfs, metadata will be stored in json

Most  data we extract need further processing

How to automate extraction processes?

- > Scraper
-> Continous Integration

Challenge: propritary / non proprietary formats.

Main challenge to find open tools if you want to work with audio / video / images, e.g. medical images, brain images. You want to track temporaliy with a certain resolution. MatLab / R et al. have libraries for this. 

Another challenge for Open tools is tracking Live Data, e.g. senor instruments

Use Case at SLUB Dresden, Germany.. Scraping data from personal webpages from academic websites. Uses Apache Nutch http://nutch.apache.org/ . project started 1 week ago.
Method 1: Start with a list of persons and a list of websites. Crawler searches for a given name on aprticular, if you find a webpage, you start to crawl data about that person: publications, department, research expertise, interests in specific topics, ... (everything that is typically provided on CVs).  
Method 2: Crawl all academic websites and search for unknown persons by keywords that are typical for that academic field (e.g. art history).
Research question: how to differentiate the academic field (e.g. art history)? approach: find keywords that are specific for this field, that differentiate it from other fields. We are going to use the library catalogue to find these keywords.


Data types:
geo data, text files, pdf, multimedia data, audio, static images, tabular data, CSV, video, 

What are you extracting?
face recognition, video clips, place, unique values

If the data is unstructured, another step is needed to structure first

Scraper WiKi - how to scrape data from PDFs and websites. based in the UK. Built tooling around extracting data.

Talk
- text - entity extraction, regex etc
- what is raw geo data?
- ocr - know little - behind commercial tools 
- pdf - tabula always mentioned - v popular! 
- multimedia - hard and fewer tools - audio to text, or detecting when ads are in your radio, or a scene change in your video - very specialised
- websites - tables import.io, scrapers 
- microtasking - crowdata


GEO DATA - what do you want to extract, and how:
    is an address an unextracted piece of geo data? what is raw geo data? Place names in a text document. a newspaper article that includes place names. tool for this would be Gate - scans documents based on what you tell it to look out for. can recognize entities (nouns, verbs) and then you can start working out what are people, etc. Free and opensource software. UK based.
    
    what else is raw geo data? a shop might say that they have shops in 100 locations - is there a tool for extracting this data? Python script. Would you write all of these things in Java? Regular expressions are an excellent tool not to be overlooked.
    
    
    AUDIO 
Group 5 

- Open Refine - it's a little heavyweight - 
- Microsoft Excel -(elephant in the room!!)
- R 
- Pandas (for python) 
- SciKit learn - easy to use machine learning library 


    


Group 6
Clean and transform: reshaping, create unified format. are there unreasonable data points, how do you code them, localization formats, program only reads certain file formats. 

Fusion tables - like excel spreadsheet with easy georeferencing options, can visualise on map (lat, lan), can run sql (scratcher query language), can handle millions of records, can merge tables,  
Open refine - transform data in a couple of steps and the actions are written in javascript so you can use it afterwards. 
Google docs
Python pandas - not just analysing,
XML transforms
Excel spreadsheet
Pentaho - has an open source version.  kettle - extract, transform 
datawrangler - alternative to open refine 
plyr - perform group-wise transformations 
sas - Doesn't need to read into working memory so can work with large datasets. Expensive and closed 
R - 
Stayta - social scientists use it a lot.
PSPP - open source version of SPSS

How large can these datasets become? If they get very large, it becomes difficult - you might have to use SAS - eg. 10 GB data, but it's proprietary, expensive but powerful 


Group 7 Analyze
Different levels of sophistication... 

Spreadsheets
Good for smaller datasets. Community who uses it: Everyone
* Libre Office
* Google Sheets
* Google Fusion Tables
* Excel

Interactive Environments - allow you to interrogate your data 
* IBM SPSS
* Python Notebook
* Mathe-matica
* Wolfram Language (https://www.youtube.com/watch?v=_P9HqHVPeik)
* Mat Lab
* Orange
* Weka

Programming Languages / Libraries
* R
* SciKit Learn - framework which allows you to build entire pipelines - 
* NLTK
* ElasticSearch
* SQL

Services 


Big Data
Hadoop // Pig // Hive
MapReduce
Apache Storm

Visualisation
TAPOR.CA - Library of text analysis tools. Community who uses it: Mostly academics 

Text Services
Alchemy - sentiment analysis  on your text 
WordNet
dandelion dataTXT API - https://dandelion.eu/products/datatxt/ - Named Entity Extraction / Classification on custom categories / text similarity ...
  

Group 9 
---
keynote

 *timelines.js*
  road data wrappers 
  visualisation 
  
  no great graphing solutions lots of adhoc solutions 
  *tableau* hacky things in *python*
  
  *d3.js* is awful complicated
  
   "many people do things before or after visualisation"
   
    WE DON'T DO VISUALIATION: 50% ETL, (Extract Transform Load)  discussion of what ETL is. 
    
    *Yahoo Pipes* -> RSS/XML - lacking in the L of ETL. 
    HTML/LXML/SQLite  Discussed summarise automatically tool (for SQL)  [ScraperWiki]
    Corporations have specific, well specified datastructures, which, with the customers, determine visulisation required. 
    Different formats, different visualisation needs for different users [wages vs. discipliniary record], who is your audience?  

"We have it easy because our customers have definate needs, vs. just show us cool stuff"  "Scientists ... we just do graphs, we use commercial software, special tools for Physicists"  

*Igor Pro*,*R*, *Matplotlib*, *origin* (2d plots), *processing* (the programming language)  

*SVG* - amazing, can animate. visualise quality of solar cells on a substrate: numbers from database directly into SVG.  

*iPython Notebook* - 'most amazing info vis tool...' - can change the visualisation at any time. (Reminds us of Mathematica) - 'really looks like a notebook', 'code snippets', 'really sweet'. Great for presenting something on the web. Excellent for documenting thought processes - 'then I did this'. Executable documentation. Grew out of the scientific community, open source version of Mathematica.  

Non-expert users: *opendatasoft* - big excel sheet with expressions like 'select a portion of the dataset' 

google libraries for visualisation - few snippets of js : *google graphs*  

*d3.js* 


---
From visualisation --> processing - ETL, extract, transform, load 

Yahoo Pipes - an ETL tool, it's lacking in customisation in the 'l' stage, if you want to customise it, it only has a limited capability (in terms of what input they want) 
Easy tools like Open Data soft (?) 
Wordle- text clouds 
Javascript libraries - D3.js - can give an output in different formats like the .svg format 

Group 10

- AM Map - a commercial mapping tool, free to use
- map.lib - turns out a chart into xkcd 
Many Eyes (proprietary, IBM) 
Bokeh - like d3.js but in python

Questions: what is a good alternative for Access? 
Maybe SQLite with a good interfac
CartoDB, and MapBox are great for mapping 
Plot.ly 


-------
Crazy tools list:
tapor.ca, alchemy, tabula, pdf2text, csvkit, wolfram language, db.js, morph.io, scraper wiki, datawroapper, google charts, excel, crowdata, python, uptom (scraper), leaftlet, timeline.js, datawrapper, openstreetmap, many eyes, "raw" app by density design, regular expressions, fusion tables, mapbox, alchemy, tapo, R, csv2html, tilemill, bokeh, am map, matplotlib, cartodb, sas, nltk, data wrangler, igor pro, Origin, gnuplot, cartodb, mapbox, charts/viz, orange, weka, open refine, google fusion tables, sqp, scikit-learn, rsutio, datapipes, matlab, mathematica, ibm spss, import io, hadoop, pig / hive / mr, import io, alchemy, high charts.js, analice.me, tabula, elastic search, spread sheets, libre office, telescope, talking to people, survey, questions, sensor network, stata analysis, am map, using meta data, indaba, scrapy, pandas, ipython notebook, open petition, open data portal, machine learning, logs from servers, sensor networks, transit data preocessing, gob pages, alavateli, nitro pdf, document cloud, crow data, ocr (tools), foia machine, secure drop, control name, clasification, ushagidi, social media tools, whistle blowers, 
mass media, social media
machine generated sources, 
how to automate process, how to capture live data
opendatasoft, wordle, svg, d3.js, mapplotlib, tableau, 
U