pro-iBiosphere_pilots

This is a read only archive of pad.okfn.org. See the shutdown announcement for details.

pro-iBiosphere_pilots_20131008 pro-iBiosphere workshop Berlin
October 8, 2013
http://wiki.pro-ibiosphere.eu/wiki/Workshop_Berlin_2_(8_October_2013)_-_Promote_and_foster_the_development_%26_adoption_of_common_mark-up_standards_and_interoperability_between_schemas

M4.2 Pilots: mark-up issues (pilots

Notes

09:35 - 9:40 Jeremy Miller: Spiders (pilot)

World Spider Catalogue
http://research.amnh.org/iz/spiders/catalog/
Background spider catalogue that includes references to various elements (figures, etc.)
use case to show databased catalogue with links
Idea: Combine data from different treatements to one concept (if they represent the same thing)

Figures are all uploaded to MorphBank

if all the links in one place they can be easily redistributed.
how to go from here: tackle the most important journals (ie Zootaxa) and offer tools (eg goldenGate) to convert the literatue

ca. 12,000 taxonomic papers have been published on spiders, 75% of them in 223 journals
But is journal selection a useful part of the workflow? What are the advantages of selecting papers from a single journal? In my experience, a lot of lay-out and formatting isses are not under editorial control. Not even in Blumea, where I am Editor-in-Chief...

We should find metrics to assess and describe heterogenity of layouts?
In my experience, around 80-90% of formats is approximately (! the devil is in the details) the same, the problem is that the remaining 10-20% is so heterogeneous that it gives a lot of work.

With regards to the details, this can be circumvented by using a gradual approach that starts off with the formats that are most common, and then build on those formats by taking care of more and more divergences. Unfortunately is is impossible to get a perfect result...

Granted all that, it is still not a given that selecting papers by journal is the most efficient way to assemble sets of papers with similar mark-up issues. Maybe a quick-pass initial analysis could be developed that identified issues?

I agree with this.

09:40 - 9:45 Sylvia Mota de Oilveira: Bryophyta - Campylopus pilotSuriname is part of the Guianas and data from Flora of Suriname and other literature sources for the region can be used as a platform for the completion of Flora of the Guianas. Mark-up and importation of data into CDM must garantee that Flora of the Guianas supersedes older Flora of Suriname.

use the flora as platform offered to scientists to add more results
improve markup
links to treatments already in place
structure of CDM a problem: how to superseed earlier treatments *suriname vs guianas
use scripts instead of Goldengate to markup entire corpus
How much configuration of GoldenGate was required? How much modification of FM PERL scripts was necessary?

We still have to make the modifications to the scripts, but all in all I estimate it's less than a week work and testing. Apparently the use of a specific script (such as Perl scripts FlorML) works for a very standard and consistent type of literature, such as some Floras (the main problem is that Flora of the Guianas uses a different format for citations, which get atomised, and that has to happen properly - so testing is important). It also helps dealing with the large amount of pages to be marked-up. GoldenGate is efficient in a different context, where literature sources have heterogeneous formats and are spread over different sources - such as the example given by Jeremy Miller.

09:45 - 9:50 Teodor Georgiev: Eupolybothrus Chilopoda

about 25 valid species
36 taxon names
around 280 treatments, 130 of which are already marked up (mostly through GoldenGate)
goal is to produce a cybertaxonomic checklist that accumulates all knowledge about each taxon
Markup workflow is fine for simple documents with treatments
Need Release Notes/"what's new?" brief description with each new version of GoldenGATE I support that suggestion, adding that good documentation is also a priority!

Suggestion to redesign the user interface of GoldenGate.

09:50 - 9:55 Donat Agosti: Ants

Goal is to link treatment citations to the cited treatments
tool: stable httpURIs
Character data are parsed from descriptions
pilot group

Anochetus from Madagascar

cost/benefit of legacy conversion relative to encoding/enhacement of new literature
Are morphological legacy descriptions uniform enough to justify the efforts towards parsing them? Maybe not - we need a stopping rule.

This depends on the legacy work. Some are very good, some are absolutely awful (it's dependent on editorial policies and the author of the work).Yes - that's why we need the stopping rule. When to stop trying to parse the awful ones. And I suspect that even good descriptions can be difficult to unify in one scheme - but I'm speaking from experience in ferns, not ants.
I'm not sure it's possible to unify them in one scheme. Bob Morris can probably explain that very well.Sorry - with scheme I merely meant the same set of character and character-states. Not the kind of schemes (I think) that Bob is dealing with. The other scheme (the one you mean) also is a big problem... :(That's what I said...

See GeoLocate for best practices in representing imperfect localtiy data http://www.museum.tulane.edu/geolocate/
Better text capture needed; perhaps from professional services: best quality OCR requires tuning of OCR engine for the specific document, for example

Professional services are not necessarily better. What is more important is whether the special symbols used in taxonomic literature are properly recognised.
If the special symbols are not in the specified character set then the symbols cannot be recognised by an OCR engine. For example, specifying English will omit male, female and Latin ligatures from the valid symbol set.

09:55 - 10:00 Quentin Groom: Chenopodium (Phytogeographic records from literature )

Distribution of vascular plants
What is the value of extracting georef information from legacy literature
Example: Chenopodium vulvaria L.
biogeographic information is very biased, temporally, spatially, taxonomically
use literature to understand distribution patterns and compare the literature against herbarium specimen records: Literature better for old records, less (relatively) important for modern records. One reason is that old records are less well documented on the herbarium sheets but better in the literature (?!) What is the tool being used to mark-up? How far is the pilot - almost finished

I have not been able to notice much difference in quality in distribution patterns belonging to old and new specimens mentioned in literature. It can be very good, it can be very bad. It seems to be better the closer to civilization the collectors were.

Just an example of how bad it can be: in Flore du Gabon we have specimens that have half a set of coordinates as the only locality information....that's a circle around the earth.

Literature accounts for larger percentage of older records, but these are smaller in absolute terms -- how skewed are they relative to their own time periods

Be focused on prioritze before you start the work.
unusual case with a lot of data vs most of us having very few records and thus have no clue of collecting bias: Do a gap analysis within EU-BON

10:00 - 10:05 Peter Hovenkamp Nephrolepis - a revision of ferns containing treatments with varying detail and lots of synonyms

increase outreach : world would be a better place if they would recognize what I wrote about the world - well, just about the ferns would do.
nephrolepis an economically important group
Goal: mark -up of one article Tool: GoldenGate; very large amount of taxonomic names to be parsed. Pilot finished, doc uploaded to Plazi, but with a number of issues due to single path approach GoldenGate.
hybrids and suspected hybrids
"problem Doc"

10:05 - 10:10 Tony Walduck families_Loranthaceae_and_Viscaceae.29 Loranthaceae

floras of Tropical Africa
make kew flora data available via Pl
Plazi and CDM
Share structured data with the TRY database

10:10 - 10:15 Don Kirkup: Flora of Tropical East Africa, etc.

Flora Zambesiaca

http://apps.kew.org/efloras/search.do

200 pages in one week; 20 pages/50 species in 7 hours using GG
scalability
what does a user want to know? Do we really know? Does the user want simple questions answered? This defines the level one markup and establish priorities
What are the characters, that leads into identificaiton
Relationships within treatments.a treatment, eaten by, parasitiyzed by
Link to gray literature
USER FIRST: We need to constrain.
Quantification of user requests by logging user requirments: names are very important. Followed by occurrence, operational data and habitat (medium importance)
granularity
Think about pre-processing documents to mark up document parts (e.g. treatments, paragraphs) before feeding that into GoldenGate

10:15 - 10:20 Thomas Hamann: Mark-up of Flora Malesiana and Flore du Gabon using FlorML

https://github.com/thoha/FlorML
Production, not pilot any more. Started during EDIT
very high degree of atomisation, mark-up script using Perl, specifically writen to the Floras using regular expressions
process as much as possible to every sinlge element
9 volumes in 2 month 2200 pages of high level of granularity - 9 months elapsed time, how many people months effort?

I meant 9 volumes (2200 pages total) in 2 months. In 9 months I can do a lot more... ;-) Thank you for the clarification.

So that's 1 person doing the work, including script development, excluding the OCR (which was outsourced).

figures processed separately (photoshop)
Flora Gabon, FM

10:20 - 10:25 Hong Cui: Charaparse

Generate taxon-character matrices from character data gained by using CharaParser to mark up a set of ant publications
inform revision efforts, publish through Pensoft
OCR errors and variation in mark-up adequacy are always issues to be dealt with
Is the development of a parser for diagnosis paragraphs worth the effort - considering that diagnoses mostly list a few characters only which are relativey easily transcribed by humans?
it needs continuous interaction between the person doing the markup and biologist
How much can be combined - problem of white spots.

10:25 - 10:45 General Discussion
prioritization of literature

POINTS TO BE CONSIDERED

markup strategy

how to decide what to convert first: Take the top journals (Miller) by number of articles on target topic
do one journal which makes mark up easier; only if consistent journal style
Journal families by a single editor might also work.
Complementary approach is to facilitate a distributed network of users to markup content from sources where taxonomic articles are not concentrated, to fill in the gaps
And what level of detail is required for what purposes? Is it in all cases necessary to identify all occurrences of a taxonomic name, or to parse all citations in the synonymy of a taxon to their reference?
Is it interesting to have different strategies for different levels of data consistency in the literature?
I believe so. You could even have complimentary strategies where low granularity mark-up is done according to one strategy and high granularity mark-up extends

markup Costs

how much does it cost to mark up?
Who does the mark-up? If post-doc, will it help their career, ie can they publish anything to help them if their task is only to mark-up a document not contribute new knowledge?
Cost and strategy: given a sufficient incentive, mark-up could develop into one of the tools used during a literature search. Taxonomists would then, as a matter of routine, mark-up protologuas and other important treatments they need to index anyway in order to be able to cite them. What direct benefits could mark-up offer to make itself indispensable as a tool?

Scaling

Do we have timings per page / per mark-up item / etc from the pilots? Not for trained operators - most of us were beginners when we did this, and acquired experience as we went along. Presumably getting faster with experience? Again timings could be useful, as well as time taken to become a proficient operator.

How to compare costs need to consider the information content per page (FM has 4x content as usual flora)t

Just noting that # of taxa is not a good metric, because the length of taxon descriptions can be very different depending on the publication. In some publications they take half a page, in others two pages or more.

Nomenclatorial issues

how to use concepts with

Post-processing

Best strategies to fix problems identified in existing markup (e.g. if georeference does not fit into known locality of collection/ observation)
Potentially leads to version control of mark-up if having raw and fixed, possibly several fixed, versions.

Schemas used

can we have a look at Thomas (FM, FdG) schema?

Yes.

Kew
See taxon paper on the Kew worklow
See Zookeys paper on compring literature schemas

GoldenGATE

versioning: are documents still be readable in new versions of GoldenGATE
redesign GoldenGATE interface, professionally and tested (less looks under the hood needed)
see Hovenkamps "problem doc" This has been taken care of in a number of updates - the inital problem was a PDF that would not be read.
steps depending on each others in the markup
alertness for long stretches (you can't save intermediary step")
dependence on proper markup in previous stages
alertness: selection of apples in markets: no more than 4 hours in a stretch, and only 20-30 minutes at a stretch for EU simultaneous translators I can manage that.
more error tolerance

GoldenGATE should learn from error corrections made by user when marking up a document and automatically apply that correction to rest of document

errors should be possilbe at all diferent steps
more flexibilty in formats (make it more obvious how to write analyzers that are specific) learning feature in gg
nesting issues of items
want to know what happens under the hood so to understand and write analyzers?
User intervention too much
get a combination of various
Decoupling GoldenGATE. Eg get treatment boundaris from preprocessing
Described formatting (documenting is a general problem)
GoldenGATE Web Services

GG modules are available as individual modules callable as web services through OBOE (Oxford Batch Operation Engine, https://oboe.oerc.ox.ac.uk/)
Services need to be called from scripts but does liberate you from formal GG workflow yet you can still benefit from GG automation

Creation of wiki markup via XSLT is challenging

There is a tool for converting JATS to MediaWiki: https://github.com/konrad/JATS-to-Mediawiki

Not yet menitoned:
Quality control of documents: how do we assure that the markup content corresponds with the original source.
How to handle errors in original source document, for example, Tachgs for Tachys? This example error is described in an accompanying errata for the source document. Do we always want the mark-up to match the source, or use it as an opportunity to correct it ;-) ?

For errata that are known prior to mark-up, adding them in prior to mark up is handy.

     Quality control: help user to recognize/remember what has already been done, what he still should do, what would be nice to markup. This could be listed in a small area, with 3 columns (TO DO / DONE / NICE TO DO)
      Also, by saving the document, GG should display pop-up like "hey, there is no nomenclatural tag in your treatment, you should improve that" or "you didn't mark up any citation nor reference - aren't they any?"

Use of encoded documents at other end of interchange
    Adequacy and completeness of markup for consuming application
    What is the measure of success? How is it measured?

Priorization

What makes sense to mark up?
Can we create stop rules?
Do we understand that costs of markup?
See Kews approach. Ask users

hints nomenclature, geographical information kezs, traits

Markup> How far to go?

If there is no markup: Who is going to do the georeferencing? Where shall it occur, in the markup process (see Who is doing what?)
What do markup> Terminology

Ontologies

Need to be able to talk between projects

Who is doing what?

Granularity
Georeferencing
Language and translation
Nomenclature
Incomplete data: not dates on distribution records often the case.

Workflow

Can we define a markup workflow?
Get experts involved for respective steps> Get the right experts in place. Eg preprocessing
FM workflow
Interaction wth scientists ok,

Tools

erformance necessary

User Interfaces are extremely important, not for us, but for the Users.

OCR

need to be taken care of. But at what stage?

Implementation of worklow

it needs both, software engineers and biologists (and probably some more)

Documentation as an issue

XML not for taxonomists - make it nice an smooth

input literature

terminology can be very idisyncratic
very loose editorical control problematics

Next steps

RDF: if you hace questions to be asked from your documents and want to use the next generation of tools, please deposit questions here http://wiki.pro-ibiosphere.eu/wiki/Competency_Questions_for_RDF_Treatments

PILOT

not enough time

Interoperability

workshop at MfN on February 10-11: http://wiki.pro-ibiosphere.eu/wiki/MS12_-_Workshop_on_mark-up_of_biodiversity_literature

Define workflow
Use a OCR that recognizes all characters etc. that isaccurate text capture -> farm this step to pro vendors

FM: done in INdia but not homegenous results. But company so no qualitz control: QC need to be conditions.
Get the right accuracy
Build up a relationship with the vendor
Needs a quality control by return
Share experience of specifications
Share the same vendor
Generic markup: find out how much can be done by vendor
Do not do anything in isolation!
This is for large scale operation

Crowdsourcing for OCR

markup should be advantageous: the immediate benefit should be more clear, and the process should be relatively easy to tip the scale for the individual taxonomist.
Flora of Northumberland: use wiki,

Quentin's use of wikisource:
http://en.wikisource.org/wiki/Index:Transactions_of_the_Natural_History_Society_of_Northumberland,_Durham,_and_Newcastle-upon-Tyne_1838_Vol.2.djvu

Lego tactic
Taxonomists don-t want to do anything that don-t last

Propagate workflows in the public

both for new digitization projects and for floras and faunas currently being produced

markup through simpe MSWord macros

increase the incentices to do the markup

Granularity

increased marke up

13:15 - 14:45 M4.2 interoperability of mark-up schemas A105
This section will present a brief overview and characterization of the exchange of data between our systems (Pensoft, Plazi, EOL, EDIT-CDM, GBIF, Antweb, HNS, KEW, Naturalis) and an assessment of pros and cons and where an emphasis should be given to enhance the exchanges and hopefully indicate a best practice.
Notes will be taken by all the participants using Etherpad http://new.okfnpad.org/p/pro-iBiosphere_integration_20131008
13:15 - 13:20   Introduction
13:20 - 13:25   Patricia Kelbert: [hhttp://wiki.pro-ibiosphere.eu/wiki/Pilots#Pilot_3]
13:25 - 13:30   Guido Sautter: [1]
13:30 - 13:35   Thomas Hammann: Flora Malesiana-CDM
13:35 - 13:40   Don Kirkup: Flora of Tropical East Africa, etc.-CDM
13:40 - 13:45   Hong Cui: Charaparser
13:45 - 13:50   Markus Doering> published Materials Citation import into GBIF
13:55 - 14:00   Katja Schulz: Plazi / Pensoft - EOL (Darwin Core Archive)
14:00 - 14:05   Jordan Biserkov: Common query/response model
14:45 - 15:00 coffee break
15:00 - 16:30 M4.2 interoperability of mark-up schemas (ctd.: Solutions and steps forward)
This section will be used to discuss content, granularity and quality control issues: Who on which side will be responsible for what. Elements of best practices will be developed.