This is a read only archive of pad.okfn.org. See the
shutdown announcement
for details.
pro-iBiosphere_pilots_20131008
pro-iBiosphere workshop Berlin
October 8, 2013
http://wiki.pro-ibiosphere.eu/wiki/Workshop_Berlin_2_(8_October_2013)_-_Promote_and_foster_the_development_%26_adoption_of_common_mark-up_standards_and_interoperability_between_schemas
M4.2 Pilots: mark-up issues (pilots
Notes
09:35 - 9:40 Jeremy Miller: Spiders (pilot)
- World Spider Catalogue
- http://research.amnh.org/iz/spiders/catalog/
- Background spider catalogue that includes references to various elements (figures, etc.)
- use case to show databased catalogue with links
- Idea: Combine data from different treatements to one concept (if they represent the same thing)
Figures are all uploaded to MorphBank
- if all the links in one place they can be easily redistributed.
- how to go from here: tackle the most important journals (ie Zootaxa) and offer tools (eg goldenGate) to convert the literatue
- ca. 12,000 taxonomic papers have been published on spiders, 75% of them in 223 journals
- But is journal selection a useful part of the workflow? What are the advantages of selecting papers from a single journal? In my experience, a lot of lay-out and formatting isses are not under editorial control. Not even in Blumea, where I am Editor-in-Chief...
- We should find metrics to assess and describe heterogenity of layouts?
- In my experience, around 80-90% of formats is approximately (! the devil is in the details) the same, the problem is that the remaining 10-20% is so heterogeneous that it gives a lot of work.
- With regards to the details, this can be circumvented by using a gradual approach that starts off with the formats that are most common, and then build on those formats by taking care of more and more divergences. Unfortunately is is impossible to get a perfect result...
- Granted all that, it is still not a given that selecting papers by journal is the most efficient way to assemble sets of papers with similar mark-up issues. Maybe a quick-pass initial analysis could be developed that identified issues?
09:40 - 9:45 Sylvia Mota de Oilveira: Bryophyta - Campylopus pilotSuriname is part of the Guianas and data from Flora of Suriname and other literature sources for the region can be used as a platform for the completion of Flora of the Guianas. Mark-up and importation of data into CDM must garantee that Flora of the Guianas supersedes older Flora of Suriname.
- use the flora as platform offered to scientists to add more results
- improve markup
- links to treatments already in place
- structure of CDM a problem: how to superseed earlier treatments *suriname vs guianas
- use scripts instead of Goldengate to markup entire corpus
- How much configuration of GoldenGate was required? How much modification of FM PERL scripts was necessary?
- We still have to make the modifications to the scripts, but all in all I estimate it's less than a week work and testing. Apparently the use of a specific script (such as Perl scripts FlorML) works for a very standard and consistent type of literature, such as some Floras (the main problem is that Flora of the Guianas uses a different format for citations, which get atomised, and that has to happen properly - so testing is important). It also helps dealing with the large amount of pages to be marked-up. GoldenGate is efficient in a different context, where literature sources have heterogeneous formats and are spread over different sources - such as the example given by Jeremy Miller.
09:45 - 9:50 Teodor Georgiev: Eupolybothrus Chilopoda
- about 25 valid species
- 36 taxon names
- around 280 treatments, 130 of which are already marked up (mostly through GoldenGate)
- goal is to produce a cybertaxonomic checklist that accumulates all knowledge about each taxon
- Markup workflow is fine for simple documents with treatments
- Need Release Notes/"what's new?" brief description with each new version of GoldenGATE I support that suggestion, adding that good documentation is also a priority!
- Suggestion to redesign the user interface of GoldenGate.
09:50 - 9:55 Donat Agosti: Ants
- Goal is to link treatment citations to the cited treatments
- tool: stable httpURIs
- Character data are parsed from descriptions
- pilot group
- Anochetus from Madagascar
- cost/benefit of legacy conversion relative to encoding/enhacement of new literature
- Are morphological legacy descriptions uniform enough to justify the efforts towards parsing them? Maybe not - we need a stopping rule.
- This depends on the legacy work. Some are very good, some are absolutely awful (it's dependent on editorial policies and the author of the work).Yes - that's why we need the stopping rule. When to stop trying to parse the awful ones. And I suspect that even good descriptions can be difficult to unify in one scheme - but I'm speaking from experience in ferns, not ants.
- I'm not sure it's possible to unify them in one scheme. Bob Morris can probably explain that very well.Sorry - with scheme I merely meant the same set of character and character-states. Not the kind of schemes (I think) that Bob is dealing with. The other scheme (the one you mean) also is a big problem... :(That's what I said...
- See GeoLocate for best practices in representing imperfect localtiy data http://www.museum.tulane.edu/geolocate/
- Better text capture needed; perhaps from professional services: best quality OCR requires tuning of OCR engine for the specific document, for example
- Professional services are not necessarily better. What is more important is whether the special symbols used in taxonomic literature are properly recognised.
- If the special symbols are not in the specified character set then the symbols cannot be recognised by an OCR engine. For example, specifying English will omit male, female and Latin ligatures from the valid symbol set.
09:55 - 10:00 Quentin Groom: Chenopodium (Phytogeographic records from literature )
- Distribution of vascular plants
- What is the value of extracting georef information from legacy literature
- Example: Chenopodium vulvaria L.
- biogeographic information is very biased, temporally, spatially, taxonomically
- use literature to understand distribution patterns and compare the literature against herbarium specimen records: Literature better for old records, less (relatively) important for modern records. One reason is that old records are less well documented on the herbarium sheets but better in the literature (?!) What is the tool being used to mark-up? How far is the pilot - almost finished
- I have not been able to notice much difference in quality in distribution patterns belonging to old and new specimens mentioned in literature. It can be very good, it can be very bad. It seems to be better the closer to civilization the collectors were.
- Just an example of how bad it can be: in Flore du Gabon we have specimens that have half a set of coordinates as the only locality information....that's a circle around the earth.
- Literature accounts for larger percentage of older records, but these are smaller in absolute terms -- how skewed are they relative to their own time periods
- Be focused on prioritze before you start the work.
- unusual case with a lot of data vs most of us having very few records and thus have no clue of collecting bias: Do a gap analysis within EU-BON
10:00 - 10:05 Peter Hovenkamp Nephrolepis - a revision of ferns containing treatments with varying detail and lots of synonyms
- increase outreach : world would be a better place if they would recognize what I wrote about the world - well, just about the ferns would do.
- nephrolepis an economically important group
- Goal: mark -up of one article Tool: GoldenGate; very large amount of taxonomic names to be parsed. Pilot finished, doc uploaded to Plazi, but with a number of issues due to single path approach GoldenGate.
- hybrids and suspected hybrids
- "problem Doc"
10:05 - 10:10 Tony Walduck families_Loranthaceae_and_Viscaceae.29 Loranthaceae
- floras of Tropical Africa
- make kew flora data available via Pl
- Plazi and CDM
- Share structured data with the TRY database
10:10 - 10:15 Don Kirkup: Flora of Tropical East Africa, etc.
- 200 pages in one week; 20 pages/50 species in 7 hours using GG
- scalability
- what does a user want to know? Do we really know? Does the user want simple questions answered? This defines the level one markup and establish priorities
- What are the characters, that leads into identificaiton
- Relationships within treatments.a treatment, eaten by, parasitiyzed by
- Link to gray literature
- USER FIRST: We need to constrain.
- Quantification of user requests by logging user requirments: names are very important. Followed by occurrence, operational data and habitat (medium importance)
- granularity
- Think about pre-processing documents to mark up document parts (e.g. treatments, paragraphs) before feeding that into GoldenGate
10:15 - 10:20 Thomas Hamann: Mark-up of Flora Malesiana and Flore du Gabon using FlorML
- https://github.com/thoha/FlorML
- Production, not pilot any more. Started during EDIT
- very high degree of atomisation, mark-up script using Perl, specifically writen to the Floras using regular expressions
- process as much as possible to every sinlge element
- 9 volumes in 2 month 2200 pages of high level of granularity - 9 months elapsed time, how many people months effort?
- I meant 9 volumes (2200 pages total) in 2 months. In 9 months I can do a lot more... ;-) Thank you for the clarification.
- So that's 1 person doing the work, including script development, excluding the OCR (which was outsourced).
- figures processed separately (photoshop)
- Flora Gabon, FM
10:20 - 10:25 Hong Cui: Charaparse
- Generate taxon-character matrices from character data gained by using CharaParser to mark up a set of ant publications
- inform revision efforts, publish through Pensoft
- OCR errors and variation in mark-up adequacy are always issues to be dealt with
- Is the development of a parser for diagnosis paragraphs worth the effort - considering that diagnoses mostly list a few characters only which are relativey easily transcribed by humans?
- it needs continuous interaction between the person doing the markup and biologist
- How much can be combined - problem of white spots.
10:25 - 10:45 General Discussion
prioritization of literature
POINTS TO BE CONSIDERED
markup strategy
- how to decide what to convert first: Take the top journals (Miller) by number of articles on target topic
- do one journal which makes mark up easier; only if consistent journal style
- Journal families by a single editor might also work.
- Complementary approach is to facilitate a distributed network of users to markup content from sources where taxonomic articles are not concentrated, to fill in the gaps
- And what level of detail is required for what purposes? Is it in all cases necessary to identify all occurrences of a taxonomic name, or to parse all citations in the synonymy of a taxon to their reference?
- Is it interesting to have different strategies for different levels of data consistency in the literature?
- I believe so. You could even have complimentary strategies where low granularity mark-up is done according to one strategy and high granularity mark-up extends
markup Costs
- how much does it cost to mark up?
- Who does the mark-up? If post-doc, will it help their career, ie can they publish anything to help them if their task is only to mark-up a document not contribute new knowledge?
- Cost and strategy: given a sufficient incentive, mark-up could develop into one of the tools used during a literature search. Taxonomists would then, as a matter of routine, mark-up protologuas and other important treatments they need to index anyway in order to be able to cite them. What direct benefits could mark-up offer to make itself indispensable as a tool?
Scaling
- Do we have timings per page / per mark-up item / etc from the pilots? Not for trained operators - most of us were beginners when we did this, and acquired experience as we went along. Presumably getting faster with experience? Again timings could be useful, as well as time taken to become a proficient operator.
- How to compare costs need to consider the information content per page (FM has 4x content as usual flora)t
- Just noting that # of taxa is not a good metric, because the length of taxon descriptions can be very different depending on the publication. In some publications they take half a page, in others two pages or more.
Nomenclatorial issues
Post-processing
- Best strategies to fix problems identified in existing markup (e.g. if georeference does not fit into known locality of collection/ observation)
- Potentially leads to version control of mark-up if having raw and fixed, possibly several fixed, versions.
Schemas used
- can we have a look at Thomas (FM, FdG) schema?
- Kew
- See taxon paper on the Kew worklow
- See Zookeys paper on compring literature schemas
GoldenGATE
- versioning: are documents still be readable in new versions of GoldenGATE
- redesign GoldenGATE interface, professionally and tested (less looks under the hood needed)
- see Hovenkamps "problem doc" This has been taken care of in a number of updates - the inital problem was a PDF that would not be read.
- steps depending on each others in the markup
- alertness for long stretches (you can't save intermediary step")
- dependence on proper markup in previous stages
- alertness: selection of apples in markets: no more than 4 hours in a stretch, and only 20-30 minutes at a stretch for EU simultaneous translators I can manage that.
- more error tolerance
- GoldenGATE should learn from error corrections made by user when marking up a document and automatically apply that correction to rest of document
- errors should be possilbe at all diferent steps
- more flexibilty in formats (make it more obvious how to write analyzers that are specific) learning feature in gg
- nesting issues of items
- want to know what happens under the hood so to understand and write analyzers?
- User intervention too much
- get a combination of various
- Decoupling GoldenGATE. Eg get treatment boundaris from preprocessing
- Described formatting (documenting is a general problem)
- GoldenGATE Web Services
- GG modules are available as individual modules callable as web services through OBOE (Oxford Batch Operation Engine, https://oboe.oerc.ox.ac.uk/)
- Services need to be called from scripts but does liberate you from formal GG workflow yet you can still benefit from GG automation
- Creation of wiki markup via XSLT is challenging
Not yet menitoned:
Quality control of documents: how do we assure that the markup content corresponds with the original source.
How to handle errors in original source document, for example, Tachgs for Tachys? This example error is described in an accompanying errata for the source document. Do we always want the mark-up to match the source, or use it as an opportunity to correct it ;-) ?
- For errata that are known prior to mark-up, adding them in prior to mark up is handy.
Quality control: help user to recognize/remember what has already been done, what he still should do, what would be nice to markup. This could be listed in a small area, with 3 columns (TO DO / DONE / NICE TO DO)
Also, by saving the document, GG should display pop-up like "hey, there is no nomenclatural tag in your treatment, you should improve that" or "you didn't mark up any citation nor reference - aren't they any?"
Use of encoded documents at other end of interchange
Adequacy and completeness of markup for consuming application
What is the measure of success? How is it measured?
Priorization
- What makes sense to mark up?
- Can we create stop rules?
- Do we understand that costs of markup?
- See Kews approach. Ask users
- hints nomenclature, geographical information kezs, traits
Markup> How far to go?
- If there is no markup: Who is going to do the georeferencing? Where shall it occur, in the markup process (see Who is doing what?)
- What do markup> Terminology
Ontologies
- Need to be able to talk between projects
Who is doing what?
- Granularity
- Georeferencing
- Language and translation
- Nomenclature
- Incomplete data: not dates on distribution records often the case.
Workflow
- Can we define a markup workflow?
- Get experts involved for respective steps> Get the right experts in place. Eg preprocessing
- FM workflow
- Interaction wth scientists ok,
Tools
- User Interfaces are extremely important, not for us, but for the Users.
OCR
- need to be taken care of. But at what stage?
Implementation of worklow
- it needs both, software engineers and biologists (and probably some more)
Documentation as an issue
- XML not for taxonomists - make it nice an smooth
input literature
- terminology can be very idisyncratic
- very loose editorical control problematics
Next steps
PILOT
Interoperability
Define workflow
Use a OCR that recognizes all characters etc. that isaccurate text capture -> farm this step to pro vendors
- FM: done in INdia but not homegenous results. But company so no qualitz control: QC need to be conditions.
- Get the right accuracy
- Build up a relationship with the vendor
- Needs a quality control by return
- Share experience of specifications
- Share the same vendor
- Generic markup: find out how much can be done by vendor
- Do not do anything in isolation!
- This is for large scale operation
Crowdsourcing for OCR
- markup should be advantageous: the immediate benefit should be more clear, and the process should be relatively easy to tip the scale for the individual taxonomist.
- Flora of Northumberland: use wiki,
- Lego tactic
- Taxonomists don-t want to do anything that don-t last
Propagate workflows in the public
- both for new digitization projects and for floras and faunas currently being produced
markup through simpe MSWord macros
increase the incentices to do the markup
Granularity
increased marke up
13:15 - 14:45 M4.2 interoperability of mark-up schemas A105
This section will present a brief overview and characterization of the exchange of data between our systems (Pensoft, Plazi, EOL, EDIT-CDM, GBIF, Antweb, HNS, KEW, Naturalis) and an assessment of pros and cons and where an emphasis should be given to enhance the exchanges and hopefully indicate a best practice.
Notes will be taken by all the participants using Etherpad http://new.okfnpad.org/p/pro-iBiosphere_integration_20131008
13:15 - 13:20 Introduction
13:20 - 13:25 Patricia Kelbert: [hhttp://wiki.pro-ibiosphere.eu/wiki/Pilots#Pilot_3]
13:25 - 13:30 Guido Sautter: [1]
13:30 - 13:35 Thomas Hammann: Flora Malesiana-CDM
13:35 - 13:40 Don Kirkup: Flora of Tropical East Africa, etc.-CDM
13:40 - 13:45 Hong Cui: Charaparser
13:45 - 13:50 Markus Doering> published Materials Citation import into GBIF
13:55 - 14:00 Katja Schulz: Plazi / Pensoft - EOL (Darwin Core Archive)
14:00 - 14:05 Jordan Biserkov: Common query/response model
14:45 - 15:00 coffee break
15:00 - 16:30 M4.2 interoperability of mark-up schemas (ctd.: Solutions and steps forward)
This section will be used to discuss content, granularity and quality control issues: Who on which side will be responsible for what. Elements of best practices will be developed.