This is a read only archive of pad.okfn.org. See the
Tools tools and more tools: building the data pipeline : 1 hour outside
Inspired by the data handbook (http://schoolofdata.org/handbook/ ), this session will collect our communal knowledge about working with data to build a toolbox at the festival, teach and learn about the tools other people are using from collecting to publishing and visualizing data. The structure of the session will be divided in two days. The first one to collect the tools and the second one to learn about them.
collection and collaboration tools
An introduction to the data pipeline
- See more at: http://schoolofdata.org/handbook/#sthash.JCtPfrZN.dpuf
- Acquisition describes gaining access to data, either through any of the methods mentioned above or by generating fresh data, e.g through a survey or observations.
- In the extraction stage, data is converted from whatever input format has been acquired (e.g. XLS files, PDFs or even plain text documents) into a form that can be used for further processing and analysis. This often involves loading data into a database system, such as MySQL or PostgreSQL.
- Cleaning and transforming the data often involves removing invalid records and translating all the columns to use a sane set of values. You may also combine two different datasets into a single table, remove duplicate entries or apply any number of other normalizations. As you acquire data, you will notice that such data often has many inconsistencies: names are used inconsistently, amounts will be stated in badly formatted numbers, while some data may not be usable at all due to file corruptions. In short: data always needs to be cleaned and processed. In fact, processing, augmenting and cleaning the data is very likely to be the most time- and labour-intensive aspect of your project.
- Analysis of data to answer particular questions we will not describe in detail in the following chapters of this book. We presume that you are already the experts in working with your data and using e.g. economic models to answer your questions. The aspects of analysis which we do hope to cover here are automated and large-scale analysis, showing tips and tricks for getting and using data, and having a machine do a lot of the work, for example: network analysis or natural language processing.
- Presentation of data only has impact when it is packaged in an appropriate way for the audiences it needs to aim at.
Three tracks - people based. Crowdsourcing, mechanical turk, open street map, satellite images, questionnaires, surveys, illegal activities
- Acquisition -- existing classifications, controlled names,
- Various repositories - identity repositories of data
- metadata - access to eg. the images on flickr, but not looking at the metadata
- Sensor Networks
- Public / Private
- Source control system
- Primary source collection - generating your own, but going to those existing collections, rather than working with the library
- Mass media
- Social networks
- Translation and interpretation at the point of acquisition
- Mathematical models, algorithms, teacher / experts/ mentors
- Acquisition - what possible data sources are there
Government based - journalism agencies, space agencies, governmental websites providing open data, public TV stations,
Machine generated - scraping public websites, machine learning, analysing publicly available log files from web servers, from data bases, transit bases, sensor Networks, GPS Tracks
Many command line tools or scripting languages like R or Perl
QuickScrape, tool uses node.js , it gets data out of pdfs, metadata will be stored in json
Most data we extract need further processing
How to automate extraction processes?
- > Scraper
-> Continous Integration
Challenge: propritary / non proprietary formats.
Main challenge to find open tools if you want to work with audio / video / images, e.g. medical images, brain images. You want to track temporaliy with a certain resolution. MatLab / R et al. have libraries for this.
Another challenge for Open tools is tracking Live Data, e.g. senor instruments
Use Case at SLUB Dresden, Germany.. Scraping data from personal webpages from academic websites. Uses Apache Nutch http://nutch.apache.org/ . project started 1 week ago.
Method 1: Start with a list of persons and a list of websites. Crawler searches for a given name on aprticular, if you find a webpage, you start to crawl data about that person: publications, department, research expertise, interests in specific topics, ... (everything that is typically provided on CVs).
Method 2: Crawl all academic websites and search for unknown persons by keywords that are typical for that academic field (e.g. art history).
Research question: how to differentiate the academic field (e.g. art history)? approach: find keywords that are specific for this field, that differentiate it from other fields. We are going to use the library catalogue to find these keywords.
geo data, text files, pdf, multimedia data, audio, static images, tabular data, CSV, video,
What are you extracting?
face recognition, video clips, place, unique values
If the data is unstructured, another step is needed to structure first
Scraper WiKi - how to scrape data from PDFs and websites. based in the UK. Built tooling around extracting data.
- text - entity extraction, regex etc
- what is raw geo data?
- ocr - know little - behind commercial tools
- pdf - tabula always mentioned - v popular!
- multimedia - hard and fewer tools - audio to text, or detecting when ads are in your radio, or a scene change in your video - very specialised
- websites - tables import.io, scrapers
- microtasking - crowdata
GEO DATA - what do you want to extract, and how:
is an address an unextracted piece of geo data? what is raw geo data? Place names in a text document. a newspaper article that includes place names. tool for this would be Gate - scans documents based on what you tell it to look out for. can recognize entities (nouns, verbs) and then you can start working out what are people, etc. Free and opensource software. UK based.
what else is raw geo data? a shop might say that they have shops in 100 locations - is there a tool for extracting this data? Python script. Would you write all of these things in Java? Regular expressions are an excellent tool not to be overlooked.
- - extract text - is there a website?
- - recognize song - shazam is closed - any other
- - Tabula - tables
- - PDF2 Text - text
- - Nitro PDF
- - evernote
- -import.io - extracting tables
- VIDEO/IMAGES/TEXT - crowd sourcing and microtasking to have individuals sort through massive data sets to make basic observations, flag and categorize
- CSV kit - pentaho, for etl extraction, opensource but also a commercial version.
- document cloud - online repository of pdf document files, can upload your own for analzyis. (have to persuade journalists as it is built for them, but could also possiblz set up zour own. It is opensource)
- alchemy api - for named entitz extraction, natural language processing.
- UPTOM (web scraper)
- analice.me - An investigative journalism web platform for data extraction, semantic analysis and structuring information to show relationships in visual ways
- Open Refine - it's a little heavyweight -
- Microsoft Excel -(elephant in the room!!)
- Pandas (for python)
- SciKit learn - easy to use machine learning library
Clean and transform: reshaping, create unified format. are there unreasonable data points, how do you code them, localization formats, program only reads certain file formats.
Fusion tables - like excel spreadsheet with easy georeferencing options, can visualise on map (lat, lan), can run sql (scratcher query language), can handle millions of records, can merge tables,
Python pandas - not just analysing,
Pentaho - has an open source version. kettle - extract, transform
datawrangler - alternative to open refine
plyr - perform group-wise transformations
sas - Doesn't need to read into working memory so can work with large datasets. Expensive and closed
Stayta - social scientists use it a lot.
PSPP - open source version of SPSS
How large can these datasets become? If they get very large, it becomes difficult - you might have to use SAS - eg. 10 GB data, but it's proprietary, expensive but powerful
Group 7 Analyze
Different levels of sophistication...
Good for smaller datasets. Community who uses it: Everyone
* Libre Office
* Google Sheets
* Google Fusion Tables
Interactive Environments - allow you to interrogate your data
* IBM SPSS
* Python Notebook
* Wolfram Language (https://www.youtube.com/watch?v=_P9HqHVPeik)
* Mat Lab
Programming Languages / Libraries
* SciKit Learn - framework which allows you to build entire pipelines -
Hadoop // Pig // Hive
TAPOR.CA - Library of text analysis tools. Community who uses it: Mostly academics
Alchemy - sentiment analysis on your text
dandelion dataTXT API - https://dandelion.eu/products/datatxt/ - Named Entity Extraction / Classification on custom categories / text similarity ...
road data wrappers
no great graphing solutions lots of adhoc solutions
*tableau* hacky things in *python*
*d3.js* is awful complicated
"many people do things before or after visualisation"
WE DON'T DO VISUALIATION: 50% ETL, (Extract Transform Load) discussion of what ETL is.
*Yahoo Pipes* -> RSS/XML - lacking in the L of ETL.
HTML/LXML/SQLite Discussed summarise automatically tool (for SQL) [ScraperWiki]
Corporations have specific, well specified datastructures, which, with the customers, determine visulisation required.
Different formats, different visualisation needs for different users [wages vs. discipliniary record], who is your audience?
"We have it easy because our customers have definate needs, vs. just show us cool stuff" "Scientists ... we just do graphs, we use commercial software, special tools for Physicists"
*Igor Pro*,*R*, *Matplotlib*, *origin* (2d plots), *processing* (the programming language)
*SVG* - amazing, can animate. visualise quality of solar cells on a substrate: numbers from database directly into SVG.
*iPython Notebook* - 'most amazing info vis tool...' - can change the visualisation at any time. (Reminds us of Mathematica) - 'really looks like a notebook', 'code snippets', 'really sweet'. Great for presenting something on the web. Excellent for documenting thought processes - 'then I did this'. Executable documentation. Grew out of the scientific community, open source version of Mathematica.
Non-expert users: *opendatasoft* - big excel sheet with expressions like 'select a portion of the dataset'
google libraries for visualisation - few snippets of js : *google graphs*
From visualisation --> processing - ETL, extract, transform, load
Yahoo Pipes - an ETL tool, it's lacking in customisation in the 'l' stage, if you want to customise it, it only has a limited capability (in terms of what input they want)
Easy tools like Open Data soft (?)
Wordle- text clouds
- AM Map - a commercial mapping tool, free to use
- map.lib - turns out a chart into xkcd
Many Eyes (proprietary, IBM)
Bokeh - like d3.js but in python
Questions: what is a good alternative for Access?
Maybe SQLite with a good interfac
CartoDB, and MapBox are great for mapping
Crazy tools list:
tapor.ca, alchemy, tabula, pdf2text, csvkit, wolfram language, db.js, morph.io, scraper wiki, datawroapper, google charts, excel, crowdata, python, uptom (scraper), leaftlet, timeline.js, datawrapper, openstreetmap, many eyes, "raw" app by density design, regular expressions, fusion tables, mapbox, alchemy, tapo, R, csv2html, tilemill, bokeh, am map, matplotlib, cartodb, sas, nltk, data wrangler, igor pro, Origin, gnuplot, cartodb, mapbox, charts/viz, orange, weka, open refine, google fusion tables, sqp, scikit-learn, rsutio, datapipes, matlab, mathematica, ibm spss, import io, hadoop, pig / hive / mr, import io, alchemy, high charts.js, analice.me, tabula, elastic search, spread sheets, libre office, telescope, talking to people, survey, questions, sensor network, stata analysis, am map, using meta data, indaba, scrapy, pandas, ipython notebook, open petition, open data portal, machine learning, logs from servers, sensor networks, transit data preocessing, gob pages, alavateli, nitro pdf, document cloud, crow data, ocr (tools), foia machine, secure drop, control name, clasification, ushagidi, social media tools, whistle blowers,
mass media, social media
machine generated sources,
how to automate process, how to capture live data
opendatasoft, wordle, svg, d3.js, mapplotlib, tableau,