b5OdI88b8Q

This is a read only archive of pad.okfn.org. See the shutdown announcement for details.

b5OdI88b8Q Journal Metadata Federation -- EADH AO Forum

Skype call with Daniel Zöllner (Würzburg / Bibsonomy), 2017-08-22

English summary:

I have talked with the computer scientists in Würzburg an we have come up with a new idea for a light-weight technical solution. They are running the "Bibsonomy" bibliographical data project which me may be able to use. See it here: https://www.bibsonomy.org/
Importing: You can add publication data manually, via a DOI or as a bulk upload of Bibtex files. They have so-called scrapers, which collect bibliographical data from websites, and we could pay a student to scrape or otherwise collect metadata on a one-off basis for the journals which are no longer active. For the active journals, the idea would be to analyze their RSS feeds regularly, identify new articles and then scrape the metadata for import. This would work in a semi-automatic fashion.
Metadata: They offer a lot of fields for almost all the metadata we need (much more than just Dublin Core, as in standard RDF). And you can define custom fields, which we could use for the abstract in the original language, for example. Tadirah keywords, if they exist, could be added using the normal keyword scheme. The back-end format is Bibtex.
Retrieval: They have an API (but it is read-write, there is no read-only option, which is a bit risky) as well as several nice plugins (for Typo3 and Wordpress at least) which allow you to make a query on bibsonomy and display a nicely formatted bibliography on a website. That uses Citations Stylesheet Language for formatting, so another standard. That could be a way of letting journals display the metadata.
This is not the professional RDF-version, but it is a lot more manageable. And I know the people behind Bibsonomy well, they will certainly be willing to help us or even add a feature if we need it.

German notes

Question: Could Bibsonomy fulfil our requirements?
Metadata requirements: notably, abstracts in two langauges; TaDiRAH keywords
API requirements: read-only API access, structured queries
Ingest/Scraping capabilities: Forum Computerphilologie (once; teils automatisch, teils händisch?), Humanististk Data (once), ZfdG (DHd => exponierte Metadaten verbesserungsfähig! könnten die jeden Monate einen Dump veröffentlichen? RSS-feed mit merkwürdigen Daten; wenn korrekt datiert, könnte man den RSS-feed nutzen, um neue Artikel zu finden), specific journals on revues.org (Humanistica; Bestand einmals scrapen, dann neue Einträge über RSS-feed: http://jtei.revues.org/backend?format=rssdocuments), OJS generally (AIUCD- https://umanisticadigitale.unibo.it/index)
ADHO? oder DARIAH? Infrastruktur, auf dem das RSS-Tool läuft: Virtuelle Maschine mti 1 Kern und ewas RAM, etwas Java-Code ausführen, abspeichern welcher RSS-feed schon angeschaut wurde;

---------------------
Ausschreibung: Werksvertrag für Studierende/n

Für die "European Digital Humanities Journal Metadata Federation" (EDHJMF) suchen wir eine/n Studierende/n der Informatik mit Interesse an Webcrawlern, Metadaten und Bibsonomy. Die Aufgaben im Einzelnen:

Für den aktuellen Bestand von drei Zeitschriften:

Die Webseiten mehrere Online-Zeitschriften crawlen und die Seiten identifizieren, die Zeitschriften-Artikeln entsprechen
Aus den jeweiligen Seiten die Metadaten der Artikel möglichst vollständig extrahieren
Die Metadaten in eine Bibsonomy-Gruppe posten

Für eine der drei und eine weitere, neue Zeitschrift:

Skript entwickeln, das regelmäßig die RSS-Feeds ausliest, neue Einträge identifiziert und die Metadaten in die Bibsonomy-Gruppe postet
Skript so vorbereiten, dass es in einer Virtuellen Maschine selbstätig laufen kann

Die Aufgaben sollen als Werkvertrag vergeben werden. Fragen zu weiteren Details, zum Ablauf und dem Umfang des Werkvertrags können im persönlichen Gespräch geklärt werden. Bitte nehmen Sie bei Interesse Kontakt zu Daniel Zoller (zoller@informatik.uni-wuerzburg.de) und/oder Christof Schöch (c.schoech@gmail.com) auf.

----------------------------

TODOs

Find student who would like to work on this project

Tasks for the student

Based on our wish-list of metadata items, find a suitable way to represent this in an OAI-compatible format (e.g. qualified Dublin Core expressed in XML)
Create a schema to validate according to our precise requirements
Manually gather the metadata from Forum Computerphilologie (or get it from Fotis Jannidis after he had someone gather it)
Gather the metadata from Humanistisk Data
Transform the incoming metadata into the standard XML and validate it
Discuss with DARIAH-DE people in Göttingen how to get this into the DARIAH-DE Repository; or find an alternative solution.

List of metadata items
Basic metadata

author first name and surname (mandatory, repeatable)
article title (mandatory)
journal name (mandatory)
journal volume (optional)
journal issue (optional)
publication year (format: yyyy; mandatory)
article pages (format: 123-456; optional)
article language (optional)
article URL (optional)
article DOI (optional)

Further metadata

Keywords (optional)
TaDiRAH keywords: activity (optional)
TaDiRAH keywords: object (optional)
Abstract in English (optional)
Abstract in original language of the article (optional)

=> It is actually not that straightforward to represent all of this in Dublin Core! Journal name, page numbers, DOI, TaDiRAH keywords, explicit language versions, are all not provided out-of-the-box.
See this for a critique: https://reprog.wordpress.com/2010/09/03/bibliographic-data-part-2-dublin-cores-dirty-little-secret/

Journal article metadata in XML (hypothetical, non-valid example!!)

<?xml version="1.0" encoding="UTF-8"?>
<metadata
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:dcterms="http://purl.org/dc/terms/">
<dc:title>Digitalisierung - Geisteswissenschaften - Medienwechsel? Hypertext als fachgerechte Publikationsform</dc:title>
<dc:title xml:lang="eng">Digitsation - Humanities - Transformation? Hypoertext as an appropriate form of publication</dc:title>
<dc:creator>Baasner, Rainer</dc:creator>
<dc:creator>Buchsner, Raimund</dc:creator>
<dc:subject type="free">Digitalisierung</dc:subject>
<dc:subject type="TaDiRAH_goal">creation</dc:subject>
<dcterms:abstract xml:lang="ger">das hier wäre der deutsche Abstract.</dcterms:abstract>
<dcterms:abstract xml:lang="eng">This would be the English-language abstract.</dcterms:abstract>
<dc:publisher>mentis</dc:publisher>
<dcterms:issued>1999</dcterms:issued>
<dc:type>Text</dc:type>
<dc:format>HTML</dc:format>
<dc:identifier>http://computerphilologie.digital-humanities.de/jahrbuch/jb1/baasner.html</dc:identifier>
<dc:identifier>doi</dc:identifier>
<dcterms:bibliographicCitation>Jahrbuch für Computerphilologie 1</dcterms:bibliographicCitation>
<dc:language xsi:type="dcterms:ISO639-3">ger</dc:language>
<dc:rights>not specified</dc:rights>
</metadata>

Skype-Meeting, Christof and Max from SUB Göttingen (June 27, 2017)
Goal: get more information on practical implementation of collecting, storing and providing OAI-compatible metadata records about journals
Questions

What server-side software could we use to become a OAI-compatible metadata provider?
Will a simple, Dublin Core-based data model (header + DC-data) be sufficient for our requirements? We want to be able to represent: basic metadata (author, title, journal title, number and issue, year, pages, language, URL, DOI), TaDiRAH keywords, free keywords, abstract in English, abstract in original language, abstract in further language(s).
Can we create the initial set of OAI entries manually, before we start using a harvester? How do we best come up with a decent XML schema for this?

Outcomes

DARIAH offers to host the metadata in the DARIAH repository, where an OAI-PMH compatible API is already there and where long-term access will be guaranteed
There is the possibility to either manually enter metadata or to bulk ingest metadata
A separate question is how to collect the metadata, either using a harvester (fully automated), or by having the journals send metadata to us for semi-automatic ingestion into the DARIAH repository
It looks like most of our required metadat should be easy to implement using Dublin Core, which is at the heart of OAI;
It looks like requirements to further specify the language of an entry (English vs. Italian abstract) can be done when using qualified DC in an XML-based format (e.g. XML/RDF)
We still need to clarify whether the fact that a keywords comes from TaDiRAH can be specified.
One first step will be to design an XML schema for the OAI/DC metadata
Another first step will be to collect sample metadata from the "historical" journals and try to represent them in a valid way using the schema

Skype-Meeting 16 June 2017 (13:00-13:45)

News from Human IT (editor: Mats Dahlström)

Rather an information science / library and archive journal than a digital humanities journal
But: DHN conference 2017 in Gothenburg will publish some papers there; maybe an opening towards more DH there

Reasons for delay in realising the Journal Metadata Federation

Among the three European journals scheduled to join the JMF, only one is currently fully active
The Italian journal will come out with the first issue soon; the French journal is not online yet, will probably launch in early 2018; Human IT is not directly linked to DH (at least yet)
Practical issue: capacity of the journals to provide abstracts in more than one language (edit: and to provide TaDiRAH keywords)

Other thoughts

What about other, now discontinued journals? E.g. "Humanistisk data"; or "Forum Computerphilologie"; with similar strategy of collecting metadata.

the journal Humanistisk Data (Norwegian) is fully digitized (according to Espen S. Ore), Annika has investigated the issue and the following will happen:

(1) If the digitized issues and useful metadata is available via the former Computational Humanities Unit at Uni Bergen, this will be used by us and hosted by Uni Bergen;
(2) If the digitized issues are not in a well-structured, easily accessable format, the Unit for Editing and Documentation (EDD) at Uni Oslo offers to host the entire run with (fresh) metadata and full content for us to use
preliminary, there's a link to some digitized issues from the 1990s: http://www.hd.uib.no/humdata/index.html

the journal Human IT (Swedish/English) is online in its entirety here: http://etjanst.hb.se/bhs/ith/humanit.htm, the current editor in chief is ~~Mats Dahlström~~ Lars Höglund. The website says about the goals of the journal: "Tidskriftens målgrupp återfinns hittills främst inom den akademiska världen, både inom och utom Norden, men Human IT hoppas också nå en bredare läsekrets intresserad av en avancerad och mer eftertänksam diskussion om nya medier och människans villkor i nätverkssamhället." (The target audience of the journal is first and foremost to be found in the academic sphere, both inside and outside of the Northern Countries. However, Human IT hopes that it will get a broader readership of people interested in a thought-through discussion about new media and mankinds agency in the digital society." (translated by Annika) – I think this counts as DH more than it counts as regular library and information science, thus, it should be included in our collection!
The journal "Forum Computerphilologie" is available online

URL: http://computerphilologie.digital-humanities.de/ejournal.html
They would be very happy if we could host the metadata for them
There is no structured metadata, but they will but a student helper to the task of collecting them, once we tell them what information and in which format we need

Should we start small and then build up? Or should we simply postpone this activity? Use the money for something else now or just keep it for later?

Solution so we can start already even before several current journals are online

Get a Virtual Machine from ADHO infrastructure committee
Collect metadata from currently existing journals and from legacy journals in OAI-PMH and store them on the ADHO server
Ask ZfdG to provide their metadata in OAI-PMH to us
Ask ZfdG to pull the metadata they want from our OAI repository and show it on a page there
Annika to ask "Humanistisk data" about metadata from them (DONE); Christof to ask Fotis for metadata about "Forum Computerphilologie" (DONE)
Christof tries to find someone to do this now; if he can't find someone, he will let Fabio now
What do we collect? Standard metadata, TaDiRAH keywords (activities and objects), abstract in original and in English
It is also a good way of testing the idea in a first step

Future of the disbursement model via the EADH AO forum

We see the need to somehow perpetuate the one-off arrangement in a more regular way - @Elisabeth: any concrete proposal?
Basic idea could be to keep the current model of splitting the money in two parts: one of which is then split among the AOs, the other is used for some specific, joint purpose
For the time being, this joint purpose could also be the JMF once more, if there is need for more funds; but other ideas are of course welcome

@Elisabeth: as far as I can see you have not spend any of the money of the first arrangement yet. If I am not mistaken, we are in a second phase (year) already. Could you draw up a budget and calculate expenditure?

Next meeting: in Montréal with Annika via Skype.

Description
Build a loose federation of interconnected journals rather than one large multi-lingual journal. So, set up a small technical infrastructure to share metadata about current journal issues among the various European journals of DH, so that each individual journal website can have a section where it displays the current tables of contents of the other journals. This could be expanded to international DH journals, of course. Why is it useful: Apart from the existing German journal "Zeitschrift für digitale Geisteswissenschaften" (zfdg.de), the French Humanistica is in the process of starting a journal on "Humanités numériques". So it appears there is a wish to have journals in a local language, something which gives visibility and recognition to DH in a local context and keeps a strong connection with the respective linguistic communities. But of course, it would be a pity if the linguistic communities were separated.
Automatically connecting all journals would give international visibility to all journals and demonstrate the (linguistic and otherwise) diversity of the DH network. Many people who know one or two DH journals may not know the others, and this "journal federation" could counter that. The more journals join, the higher the positive network effect. The main counter-argument I see right now is that this kind of sharing of metadata could also be done manually, by just sending the table of contents around to the other journals. After all, compared to getting an issue ready, this is not a big effort. An automatic process would ensure this exchange happens regularly and quickly, but has a higher technological overhead. Note that the German journal would be very interested in this and the French (planned) journal as well. The DHN doesn't have a designated journal, yet. Discussions of the prior DHN board came to the conclusion that if need be, the (Swedish) journal Human IT could be asked to 'host' DHN content. However, this year's conference and discussions of the new board have shown that there is little interest in creating a Nordic (linguistically speaking) journal. One reason for this is the publication reward system of Norway where new journals have to undergo a national evaluation process to be regarded as scholarly and then voted for by a national committee to get 'high impact' status. Since DH is not a discipline in this system, there is no way to have a designated Nordic DH journal ranked high enough to be an interesting publication outlet. The Swedish, Danish, and Finnish (and Icelandic) systems are somewhat different, however, most prefer publication in English anyway and people tend to publish with international, already established journals. (Annika)
How much would it cost? It all depends on how the journals manage their metadata in the backend. If they have well-structured metadata available via an API anyway, then it shouldn't be too difficult. However, OJS for example does not appear to be perfect in this respect, at least not out-of-the-box: http://forum.pkp.sfu.ca/t/rest-api-for-metadata/7949/3 I'll ask the German journal how difficult they believe this to be. (The French may or may not use revues.org for their journal with OJS / Lodel in the background.) The second issue is how to distribute the metadata: The solution with the least converting is probably to collect them all into a shared format and make them available for reuse by other journals from there. That means running a small server that does this. Then, each journal can pull the other journals' metadata into a page in whatever way they see fit.

Journals to involve
(Strategy: First make it work with several European journals, then expand beyond, e.g. DHQ.)

ZfdG - Zeitschrift für digitale Geisteswissenschaften. http://zfdg.de/ Status: journal is online and the editors are very interested.
Italy. Status: In the process of being started. Will probably have first issue out in autumn 2017.
France/Humanistica. Status: In the process of being started. Will probably have first issue out in early 2018. Will use revues.org (hence, OJS/Lodel)
DHN: ? Human IT (Sweden) -- https://humanit.hb.se/about (See notes above)
Studia Digitalia (Romania): http://digihubb.centre.ubbcluj.ro/journal/index.php/digitalia

Functional requirements

Collect article-level metadata from each participating journal
Collect abstracts in more than one language (including English)
Store metadata and abstracts somewhere.
Make metadata and abstracts available via an API or RSS feed.
Help journals display metadata and abstracts on their website

Organisational details

Christof to find an IT person / student to do the work
Billed directly to EADH if possible?