Workshops, Meetings

Jump to: navigation, search

DINA Workshop, Stockholm-Edinburgh 29.-31. July 2014

Background Discussions

2014-07-29

Summary: Rob and Elspeth described BG-BASE with respect to database structure, data entry and import procedures. They described RBGE digitization procedures and results as well as RBGE methods for establishing digitization priorities. Discussion points emphasized system constraints, related issues of system and data management, and alternative strategies for import of Herbarium data to Specify 6. The issues discussed remain important for ongoing migration and digitization at institutions supporting the activities of most DINA consortium members.

Participants:

  • Elspeth Haston, Assistant Curator, Digitisation, RBGE
  • Robert Cubey, Plant Records Officer, RBGE
  • Marios Theodoropoulos, IT Specialist, RBGE
  • Karin Karlsson, Bioinformatics Group Leader, Swedish Museum of Natural History (NRM)
  • Kevin Holston, Bioinformatics Systems Analyst, Swedish Museum of Natural History (NRM)
General Discussion

RGBE digitization overview and migration status. BG-BASE V6.8 is currently the collections management database at the RBGE produced by BG-BASE Inc. & BG-BASE UK Ltd. BG-BASE software sits on the database management system Open Insight v8.0.8. This software uses multi-value fields are used to handle record characteristics such as common names, synonyms, data sources, flower color, etc. [similar to the repeating fields in several FileMaker Pro databases managed at the NRM, including the Botany department]. In the software forms cannot be easily modified for users or data workflows, and users have to navigate through all fields and subforms to arrive at target fields – i.e., many mouse clicks or (shortcut) keyboard strokes.

Daily exports from BG-BASE synchronize changes to a MySQL database that is used for the RBGE web database presentation portal. The purpose of the MySQL database is to make data available for the online catalogue and provides a more flexible system for attaching accessory datasets required for mapping and higher-level database management. The MySQL database structure differs from BG-BASE by including, for example, tables for managing multi-value field data. Each multi-value is stored as a single record in the SQL database. The current structure would need to be improved in order to adapt it to other purposes. Some features of the SQL database may be useful in preparing BG-BASE collection object datasets for import to Specify, especially the BG-BASE data stored in multi-value fields.

Approximately 670 000 (22 %) of the 3 million herbarium sheets at RBGE (E) has a record registered in the database. A total of 250 000 images of RBGE herbarium sheets are available online. These images are provided in tiled format (layers for providing an image at different resolutions); the original files are saved in .RAW & .tif formats.

RBGE priorities for defining digitization sub-projects have focused on type collections, taxonomic research at the RBGE, and geographic strengths. All of the approximately 50 000 type specimens have been digitized. An example of taxon-based priority is digitization of the RBGE ginger family collection, motivated by staff research interests. Similarly, ongoing research and exceptional faunal representation for Chile, Argentina, Southwest Asia, and Australia motivated digitization of specimens collected from these geographic regions. Standards for geographic units categorizing the world flora, proposed by the Kew Gardens, have been modified at the RBGE to with respect to collecting emphasis. For example, RBGE recognize “12a”, “12b “, and “12c” rather than Kew unit “12”. Another important prioritization factor is the fluctuating availability of resources. Funding initiatives that cover digitization can focus on particular taxonomic groups, regions, datasets compilations, IT solutions, etc. Funding for images, for example, can sometimes be directed towards groups with partially completed collection object datasets from previous efforts – these new images are “retro-linked” to collection object records, which may be further improved during the digitization project.

Therefore we face challenges in identifying collection subsets for digitization that are related to collections management practices and organization. It can be much more difficult to focus on geographic digitization than on taxonomic digitization; when herbarium sheets arranged primarily by taxonomic group, it may be difficult to gather all exemplars collected from a country or region. Digitization as part of RBGE loan procedures involves additional prioritization methods. Because it is difficult to tell at a glance whether one sheet in a curated batch is imaged, therefore RBGE staff may digitize a batch, the whole cabinet, or a section of a family in response to loan requests for single specimens. Some digitization has proceeded in efforts to update determinations and rearrange specimens. For example, approximately 40 000 specimens were digitized during recent re-curation of RBGE Compositeae.

Workshop Discussion

RGBE Specify evaluations are intended to document the ability of Specify 6 to handle RGBE Herbarium collections management. This includes tasks handled poorly by BG-BASE. For example, integrating the Kew World Plant Index taxonomic names with BG-BASE data has not been accomplished during the years that the Kew Index has been available at the RBGE due to BG-BASE import limitations, making efforts to integrate Kew World Plant Index taxonomic names with BG-BASE data an important test

Further discussions on database management for the living collections are ongoing, both at the RGBE and within the DINA consortium. The general strategy is to focus on the RGBE Herbarium datasets and potentially incorporate the Living Collections at a later stage, defined primarily by the scope of available resources. In particular, this includes the availability of RGBE staff to participate in evaluating workflows, testing new procedures, and contributing constructive feedback for system development.

The collection management evaluation is relevant to the RBGE IT team; Bioinformatics staff & curators in the RBGE herbarium. Further discussions and system implementation are likely to include some of these staff, especially in the context of data sharing with GBIF and other institutions with botanical datasets.

2014-07-30

Summary: Workshop participants meet with Prof. Pete Hollingsworth, Director of Science. Dave, Rob and Elspeth summarized the objectives of the visit in the context of RGBE interests, which center on improving Herbarium collections management through migration to Specify. Karin and Kevin described the DINA project, progress in system development and implementation, and the scope and formal structure of this open-source collaboration. Discussion points emphasized the scope of Specify 6 schema as well as system sustainability, security, and interoperability.

Participants:

  • Pete Hollingsworth, Director of Science, RBGE
  • David Harris, Herbarium Curator and Deputy Director of Science, RBGE
  • Elspeth Haston, Robert Cubey, Marios Theodoropoulos, Karin Karlsson, Kevin Holston
Workshop discussion

Summary: Discussions resulted in a working list of technical recommendations. Export source data as UTF-8 rather than manually replacing each incorrect character discovered in the dataset. Verify source (authority) indexes for taxonomic names, when interpreting labels. Shift the method for creating .csv files from a union (merge) of BG-BASE and Kew Index names to MySQL JOIN procedures, selecting Kew Index record data as preferred in cases of source dataset overlap (note: homonyms must be considered when matching between the two source datasets, many of which are identified in the RBGE database).

Although the Specify UI can be used to merge taxonomic names (i.e., to re-associate child taxa and collection objects with one name and delete the duplicate name from the database), pre-import handling of duplicate names is preferred. A final note on synonymies - Workbench imports of collection objects are more complicated when synonyms are present in the database, requiring the user to manually select which name (the taxon name or its preferred synonym) is used for a determination.

A name/name characteristic that requires more discussion among RBGE staff is “cultivar”, used in BG-BASE determinations. Standardization of this taxonomic name category was not addressed previously because BG-BASE names and author citations below genus rank are contained in a single free-form text field. Names at several taxonomic ranks (e.g., species, subspecies) have been distinguished by an additional cultivar name, and it is unclear how this name information should be recorded to support data entry procedures and database queries for determined collection objects. Further discussion may include DINA consortium members working with herbaria (James Macklin and Heather McNairn, Ottawa) as well as Specify developers in Kansas. A one possible solution that avoids establishing artificial ranks in the taxonomic tree is to treat “cultivars” in the Specify database schema as common names, associated with an available scientific name.

Pre-Workshop Activities

Summary: Previous conference calls and related discussions at the RBGE resulted in a decision to concentrate on Betulaceae records for import tests. A preliminary version of the Betulaceae dataset was modified using Specify Workbench tools and imported to a local database installation (Mario's laptop). Queries (Excel spreadsheet searches) of the import file to find exact name matches resulted in 37 duplicates; including author field strings resulted in over 200 duplicate names. Attempts to upload were unsuccessful due to errors. Further troubleshooting was addressed during the workshop.

The migration strategy is to import taxonomic names in BG-BASE and the Kew Index followed by an import of BG-BASE collection object records. Imports follow standard Specify Workbench procedures which include field and record validation protocols. SQL scripts will be used to update taxonomic name records in the Specify database to reflect synonymies recorded in the source datasets. The scripts will be made available on the DINA-wiki once this subproject is completed.

Participants:

  • Elspeth Haston, Robert Cubey, Marios Theodoropoulos

MySQL and Specify 6 were installed on a new server, and the Specify Wizard was used to set up a new database instance. The authority taxonomic names list for Poaceae was imported during collection setup, and the server database was accessed with Specify 6 user authentication procedures, i.e. via the Specify UI.

Betulaceae name records from BG-BASE and the Kew World Plant Names Index were exported to .csv format and merged. This file was a union of source records that included duplicate records for names below the rank of genus. There were discrepancies in style or format between source file data e.g., all caps for higher-level BG-BASE name spellings, abbreviations for authors and punctuation to distinguish among series of authors. Each row includes a limited classification above genus rank, and record content from source files, such as ID references to senior synonyms, were not exported. In BG-BASE, all names epithets below genus are stored in a single column (a multi-value field) , which includes infraspecific taxa. Infraspecific taxa below subspecies were mapped to a single field.

Because it is a world list, the Kew Index includes a more extensive list of names per taxonomic group than the RBGE, but there are some RBGE Herbarium names that are absent from the Kew Index. Other taxonomic groups with a larger proportion of the world flora represented in the RBGE Herbarium will have a greater overlap with Kew Index names than Betulaceae. The final exports of Kew Index records will include fields to cover additional nomenclatural (e.g, status and name use) or taxon (e.g., distribution) data.

RBGE Migration Workshop

2014-07-29

Summary: Following a series of test imports, a new server was installed prior to the workshop. The test datasets were restricted to Betulaceae records, imported using the Specify Workbench. Discussion points concerning practical issues involved field selection for import tests, consequences of alternative mapping to Workbench fields, and strategies for handling content errors and duplicate name records. Additional Workbench functionality was tested, and more detailed procedures for imports were identified. Strategies for import were examined in greater detail, allowing tasks for follow-up investigation to be identified.

Workshop tasks were 1) to test Specify Workbench import of taxonomic names .csv file (merged names from BG-BASE and Kew World Names Index) and 2) to troubleshoot import file and import strategy, which involved the identification of fields and values required to run SQL update scripts. Import results informed decisions on additional fields required for import files, procedures for handling duplicates, and strategies for synonym imports.

Participants:

  • Elspeth Haston, Robert Cubey, Marios Theodoropoulos, Karin Karlsson, Kevin Holston
Workshop Session, Task 1.

The Betulaceae names file was saved to the Specify database on the new server as a Workbench record set. Field inclusion and mapping was identical to the previous import attempt, with Workbench tools used for final editing (e.g., global deletion of default string values “NULL”). “Cultivar” names were mapped to the rank “subforma” – this rank was added to the Specify taxon tree definition.

There were no errors when saving field modifications although removing rows using the Workbench delete tool resulted in a system failure (the dataset could not be saved). The target of manual deletions were records for names above the species rank, which had mapping errors and the dataset was not verified by the Workbench for upload. These rows were not removed – the dataset was saved successfully after all data mapped to species fields was manually removed.

Using MySQL Workbench tools, the Specify database was saved as a self-contained dump file for restoring the database to pre-import status. The Workbench record set was uploaded and viewed without committing the records to the database.

Workshop Session, Task 2.

Synonyms can be established after import of taxonomic names using Specify tree editing tools, but this is impractical considering the size and scope of nomenclatural data managed in the current RBGE database and the amount of data coming from the Kew checklists. Using source database synonym relationship data for post-import SQL UPDATE statements has not been attempted during previous DINA migrations. Establishing procedures for such updates will be, however, an important component of upcoming DINA imports for NRM Botany and Zoology collections. Import of name IDs from source datasets was investigated, therefore, to preserve RBGE synonymies and implement scripts to update records with new Specify database IDs.

Relationships among names in the source datasets can be imported effectively because synonyms in the source databases are managed as record associations, which are based on database IDs. Synonymy in Specify is also managed using record IDs to link name records and their senior synonyms, but the IDs for Specify 6 records are available only after the record imports are completed. Importing the source database record IDs to Specify, mapped to user-defined text fields in the Taxon table, would preserve these synonym relationships.

The Workbench Schema Configuration does not include these user-defined text fields, so a further task was defined during this session: to determine how to accomplish this and implement the proposed post-import strategy for recreating RBGE synonymies. After the workshop, email responses (quick and thorough!) from Tim Noble and Andy Bentley from the Specify development group at Lawrence, Kansas, provided detailed instructions for modifying the Specify Workbench XML to include these fields. Implementing these XML changes was scheduled for the second day of the workshop.

2014-07-30

Summary: The Betulaceae dataset (2 017 records) was validated and successfully uploaded to the Specify database with required fields available following Specify Workbench schema XML modifications. Duplicate species records were created due to the mapping of new test values to species rank fields, such as source ID and synonym status. Various migration issues were discussed, including procedures for handling character sets during dataset imports and managing taxonomic names.

Participants:

  • Elspeth Haston, Robert Cubey, Marios Theodoropoulos, Karin Karlsson, Kevin Holston

A new Betulaceae .csv file was prepared following procedures from the previous workshop session, with additional fields for source database record IDs for each record and its senior synonym as well as the Kew Index synonym status values. The XML changes added the user-defined fields “Text1” and “Text2” to the Workbench schema and mapped to all ranks below family.

The instructions from the Specify group (Tim Noble) for making these changes are:

Find the section of the file config/specify_workbench_datamodel.xml that starts with
<table classname="edu.ku.brc.specify.datamodel.Taxon" table="taxononly" tableid="4000">
and, for each rank that you want to upload text1/text2 add definitions for those fields, for example, for species:
<field column="Species Text1" name="speciesText1" type="java.lang.String" length="32"/>
<field column="Species Text2" name="speciesText2" type="java.lang.String" length="32"/>
so the species fields would be
<field column="Species" name="species" type="java.lang.String" length="64"/>
<field column="Species Author" name="speciesAuthor" type="java.lang.String" length="128"/>
<field column="Species CommonName" name="speciesCommonName" type="java.lang.String" length="128"/>
<field column="Species Source" name="speciesSource" type="java.lang.String" length="64"/>
<field column="Species GUID" name="speciesGUID" type="java.lang.String" length="128"/>
<field column="Species Text1" name="speciesText1" type="java.lang.String" length="32"/>
<field column="Species Text2" name="speciesText2" type="java.lang.String" length="32"/>
Then, in the file specify_workbench_upload_def.xml add mappings for those fields:
<field table="Taxon" name="species" treename="taxon" maptofield="name" rankid="220"/>
<field table="Taxon" name="speciesAuthor" treename="taxon" maptofield="author" rankid="220"/>
<field table="Taxon" name="speciesCommonName" treename="taxon" maptofield="commonName" rankid="220"/>
<field table="Taxon" name="speciesSource" treename="taxon" maptofield="source" rankid="220"/>
<field table="Taxon" name="speciesGUID" treename="taxon" maptofield="guid" rankid="220"/>
<field table="Taxon" name="speciesText1" treename="taxon" maptofield="text1" rankid="220"/>
<field table="Taxon" name="speciesText2" treename="taxon" maptofield="text2" rankid="220"/>
The pattern is the same for other ranks.


Kew Index synonym status was mapped to the species “source” field. Non-UTF-8 characters exported from source datasets were not replaced for this file version, and all source database IDs were mapped to species rank fields. Parsing and reformatting scripts used to prepare the .csv file did not separate infraspecific name strings or associate source database IDs with the correct taxonomic ranks (correctly assigning source database IDs to the related infraspecific fields). Preparing a dataset that contained source database IDs, so the XML changes and Workbench validation could be tested, was a higher priority for this workshop.

Several genus names were uploaded with all capital letters although they appeared with only the first letter in upper case after previous imports. Genus names were not duplicated, however, so Workbench recognition of new name strings was case-insensitive. This import discrepancy may reflect the import sequence of new Specify name records, where the BG-BASE genus name (in caps) is accepted before the Kew Index record.

Workshop discussion – recommendations. Export source data as UTF-8 rather than manually replacing each incorrect character discovered in the dataset. Verify source (authority) indexes for taxonomic names, when interpreting labels. Shift the method for creating .csv files from a union (merge) of BG-BASE and Kew Index names to MySQL JOIN procedures, selecting Kew Index record data as preferred in cases of source dataset overlap (note: homonyms must be considered when matching between the two source datasets, many of which are identified in the RBGE database).

Although the Specify UI can be used to merge taxonomic names (i.e., to re-associate child taxa and collection objects with one name and delete the duplicate name from the database), pre-import handling of duplicate names is preferred. A final note on synonymies - Workbench imports of collection objects are more complicated when synonyms are present in the database, requiring the user to manually select which name (the taxon name or its preferred synonym) is used for a determination.

A name/name characteristic that requires more discussion among RBGE staff is “cultivar”, used in BG-BASE determinations. Standardization of this taxonomic name category was not addressed previously because BG-BASE names and author citations below genus rank are contained in a single free-form text field. Names at several taxonomic ranks (e.g., species, subspecies) have been distinguished by an additional cultivar name, and it is unclear how this name information should be recorded to support data entry procedures and database queries for determined collection objects. Further discussion may include DINA consortium members working with herbaria (James Macklin and Heather McNairn, Ottawa) as well as Specify developers in Kansas. A one possible solution that avoids establishing artificial ranks in the taxonomic tree is to treat “cultivars” in the Specify database schema as common names, associated with an available scientific name.

2014-07-31

Summary: Further discussion on import fields, export procedures for source databases, and the SQL UPDATE synonyms script. Documentation for RBGE meetings and case studies will be posted on the DINA Biowikifarm, probably under “Botany” or “herbaria”. Synonymies were successfully established (post-workshop) in the Specify taxon table using an SQL UPDATE script, which recovers these important taxonomic name relationships after importing names through the Specify Workbench.

Participants:

  • Elspeth Haston, Robert Cubey, Marios Theodoropoulos, Karin Karlsson, Kevin Holston

The SQL script that will be used to update records to establish synonymies is being tested. The current iteration, which successfully updates the imported RBGE taxon records, is:

UPDATE `dina_nrm`.`taxon` as t
       INNER JOIN
   (SELECT 
       t1.`Text1` AS t1SOURCE_TAXON_ID,
           t1.`Text2` AS t1SOURCE_SYNONYM_ID,
           t1.`TaxonID` AS t1ID,
           t1.`FullName` AS t1TAXON_FullName,
           t1.`IsAccepted` AS t1VALID_STATUS,
           t1.`AcceptedID` AS TARGET_SYNONYM_ID,
           t2.`TaxonID` AS t2TARGET_TAXON_ID,
           t2.`FullName` AS t2SYNONYM_FullName
   FROM
       `dina_nrm`.`taxon` AS t1
   LEFT JOIN `dina_nrm`.`taxon` AS t2 ON t1.`Text2` = t2.`Text1`
   WHERE
       t1.`Text1` IS NOT NULL) as temp ON t.`TaxonID` = temp.t1ID 
SET 
   t.`AcceptedID` = t2TARGET_TAXON_ID,
   t.`IsAccepted` = 0
WHERE
   t.`AcceptedID` IS NULL
       AND NOT (t.`Text2` IS NULL)

This page was last modified on 6 July 2015, at 11:30. Content is available under Attribution-Share Alike Non-commercial 2.5 or later, Unported unless otherwise noted.