October 26, 2011

Prioritizing Data Management Goals

Question: If the UMBS Information Manager could have an intern/assistant for at least a semester period, what would that person do? What are goal is the most pressing and would lead to the greatest payoff?

Potential Goals:

Data Forensics project - identify completed research projects with unarchived, high quality data sets and begin placing them into the Research Gateway

Target IGERT-BART Data - In someways a subset of the Data Forensics Project but with a greatly restricted target population and timespan.

Research Gateway Rampage - Crank away on improvements (i.e., New Table Wizard module, Drupal 7 migration)

Current Research Metadata Entry - Work to increase contributions from ongoing research projects

Housework - Monitor incoming housing applications, update database, search for and add new publications

Great exercise. Now let us order by priority:

1. Target IGERT-BART Data - In someways a subset of the Data Forensics Project but with a greatly restricted target population and timespan.

2. Current Research Metadata Entry - Work to increase contributions from ongoing research projects

3. Housework - Monitor incoming housing applications, update database, search for and add new publications

4. Research Gateway Rampage - Crank away on improvements (i.e., New Table Wizard module, Drupal 7 migration)

5. Data Forensics project - identify completed research projects with unarchived, high quality data sets and begin placing them into the Research Gateway

1-3 could conceivably be part of a single internship although 1 warrants more than a semester while 2 and 3 are ongoing.

Posted by kkwaiser at 01:12 PM | Comments (0) | TrackBack

April 05, 2011

On DOI's and Data

I am going to waste a great title on a boring post. One of my pie-in-the-sky hopes is to advance to the point where datasets within the Research Gateway are assigned Digital Object Identifiers (DOI). I know this would help to establish legitimacy in the eyes of our researchers but, honestly, I do not know much about the underpinnings of the DOI system. Here are a few pointers:

There is a website DOI but you are better off starting at the Wikipedia page.

DataCite is a DOI service specifically for datasets (Wikipedia page).

If a DOI is a unique identifier then, my understanding is that, the Handle System makes sure the DOI points to the right place (Wikipedia page).

Questions to answer:

Q: Is the University of Michigan associated with DataCite at all? Is DeepBlue?

Yes, ICPSR is a member of DataCite and contributed datasets receive a DOI. I don't think DeepBlue has this capability but I asked them anyways. They should.

How complicated and expensive would an automated DOI registration system be?

How complicated and expensive would a manual DOI registration system be?

Posted by kkwaiser at 01:29 PM | Comments (0) | TrackBack

January 31, 2011

Cleaning up the data

Now that the housing application is "done" I am going to catch up on data stuffs. Things that need to happen:

- Switch back to DataSet/DataFile approach
- Update Panel template and Views to pull info from DataFile
- Clean up data from REUs and Frontiers students
- Load data from Bob
- Load bibliography entries from Bob
- Break for lunch...

Posted by kkwaiser at 04:37 PM | Comments (0)

November 11, 2010

Notes on NSF Programs related to bio-research collections

Advancing Digitization of Biological Collections (ADBC) - "This program seeks to create a national resource of digital data documenting existing biological collections...The national resource will be structured at three levels: a national hub, thematic networks based on collaborative groups of collections, and the physical collections."

Home Uniting Biocollections (HUB) - forming the coordinating scientific team...oversee implementation of standards and best practices for the collections, plan for the long-term sustainability of the national resource, facilitate communication and standards for training, and assure that results are disseminated to the scientific community utilizing collections, the collections community, and other similar efforts internationally.

Thematic Collections Networks (TCN) - "will conduct the digitization effort at a number of collections...justified by a research theme...integrate with other ongoing digitization activities, such as the new collaborative networks funded under the Improvements to Biological Research Collections Program (BRC)...BRC-funded collaborative projects will be expected to become part of this national resource"

Improvements to Biological Research Collections (BRC) - Next due date: July 22, 2011

Posted by kkwaiser at 01:07 PM | Comments (0)

September 09, 2010

Images directly into an access database

The proposed workflow for the digitization of our vascular plants collection currently looks like this:

1) Volunteer takes photos of specimen label
2) Volunteer places barcode onto specimen sheet
3) Put both barcode and photo into same row of database
4) Send database to UM herbarium, where they will sync the labels/barcodes with the records they have already digitized.

Problems:

Is there a way to get the photos to feed directly into a database field? The barcode scanner can feed directly in. This plan falls apart if the image (or a pointer to) is not automagically inserted into the database.

References:

This article explains how to store images in the database, link to the images, and use VB to display the images.

This old post (Access 2000?) claims you can feed an image from a webcam directly into an OLE field. Doesn't appear to translate directly to Access 2007.

The comment from Razorking in this post indicates that it is simple to feed barcodes into databases, why not photos?

Another post, not sure how useful it is. Talks about feeding webcam images into access.

Some VB code examples here and here.

Posted by kkwaiser at 08:57 AM | Comments (0)

June 23, 2010

Limnology course data

Troy Keller, 2010's Limnology instructor, stopped by with the idea of collaborating to archive data collected by his students this summer. Here are some notes:

Lakes:
Douglas
Burt
Long
Black
1 other, I believe

Variables:
secchi depth
chlorophyll
plankton (zoo)
physio-chemical - tp, tn, (pH, DO, temp profile)
benthic grab sample (dredge bottom for inverts)

Classes on Tues and Thursday. Data collection will commence on Tues, June 29th on Douglas Lake. GPS locations will be taken at each point, several points per lake.

One option is to create a project for the class and put their work (abstract, methods, data, personnel) under there. Will need to figure out how to organize the data (e.g., by lake) and how/whether to keep it separate from non-class data.

Posted by kkwaiser at 05:12 PM | Comments (0)

June 09, 2010

Building a "Sensor" Content Type

I've been toying with the idea of creating a "Sensor" content type for the Drupal site. As of now, the fields would mimic what is generally found on a sensor spec sheet but the Alliance for Coastal Technologies also has an implementation worth copying.

Once a sensor is created, then research projects, research sites and data files (or data sets?) that use a sensor could (node) reference it. There are several reasons to pursue this idea:

- A GMap mash-up could easily show where we have sensors deployed
- A dynamic list of the sensors, with their particulars, could be easily generated.
- At some point, the sensor metadata could be included in the overall metadata sheet that accompanies a data set and data file.

A use case example would be a researcher who wants to know if we are collecting a particular variable. A search of deployed sensors for that variable could indicate the general location, data sets and contact personnel.

Posted by kkwaiser at 04:24 PM | Comments (0)

Building a "Sensor" Content Type

I've been toying with the idea of creating a "Sensor" content type for the Drupal site. As of now, the fields would mimic what is generally found on a sensor spec sheet but the Alliance for Coastal Technologies also has an implementation worth copying.

Once a sensor is created, then research projects, research sites and data files (or data sets?) that use a sensor could (node) reference it. There are several reasons to pursue this idea:

- A GMap mash-up could easily show where we have sensors deployed
- A dynamic list of the sensors, with their particulars, could be easily generated.
- At some point, the sensor metadata could be included in the overall metadata sheet that accompanies a data set and data file.

A use case example would be a researcher who wants to know if we are collecting a particular variable. A search of deployed sensors for that variable could indicate the general location, data sets and contact personnel.

Posted by kkwaiser at 04:24 PM | Comments (0)

March 08, 2010

Digitizing biological collections

I received an email from our Resident Biologist, Bob Vande Kopple, pointing me to this website.

This spurred me to hash out a quick plan for digitizing our biological collections. I posted it at the above site as a comment but am reposting it here for posterity's sake:

Great timing. Our biological station has a small collection (~20,000 floral and faunal specimens) which we are just beginning to digitize. Given staffing constraints and variable confidence in taxonomic identifications we are using the following, low-overhead approach.

We are beginning with our largest and highest quality collections - vascular plants and bryophytes. We are collaborating with a larger institution (the University of Michigan Herbarium) from which we are receiving a database schema, hardware and software recommendations, and training for the digitization and QA/QC process.

In exchange, the UM Herbarium will receive unique specimen records and more precise location information for known collection sites in our region. Once this phase is complete, we will be better able to tackle our smaller and more complicated collections. We will also use this time improve the quality of the identifications in those collections.

My thoughts for the project outlined on this website are as follows:

For the digitization process, I believe a regional (i.e., dispersed) effort that pairs complementary institutions is best. A mentor-mentee relationship, if you will.

However, how to get our collection online, integrated with multi-institutional databases and how to leverage these databases is an outstanding question. For this phase of the project I think the ability to consult with technical experts located at a centralized institute would be best. I think this is the phase where uniformity should be enforced. I could foresee help in mapping our database schema to whichever data standard is adopted as well as how to best serve up and leverage data from an IT perspective.

Kyle Kwaiser, Information Manager
University of Michigan Biological Station

Posted by kkwaiser at 10:11 AM | Comments (0)

February 22, 2010

Mapping EndNote to RIS

I exported the same bibliographic entrees from EndNote into the Endnote tagged format and the RIS tagged format. I then mapped them 1:1 and noted inconsistencies as they arose.

Part of the reason I did this is because when you import one these tagged documents into different bibliographic software (e.g., EndNote, Refworks, ProCite) the outcome is different. Some fields are smashed together and others seem to be lost!

It's worth noting that this is how EndNote maps to RIS and not how ProCite or Refworks would map to RIS.


Endnote to RIS
%0 to TY -> Indicates Reference Type
%A to AU -> Indicates Author
%E to A2 -> Indicates Editor
%D to PY -> Indicates Year
%T to TI -> Indicates Title
%! to ST -> Indicates Short Title
%J to T2 -> Indicates Journal
%B to T2 -> Indicates Book Title
%V to VL -> Indicates Volume, Degree
%N to M1 -> Indicates Issue (should be IS in RIS)
%P to SP -> Indicates page number
%8 to DA -> Indicates Date
%@ to SN -> Indicates ISSN
%R to DO -> Indicates DOI
%M to AN -> Indicates Accession Number
%F to LB -> Indicates Lable
%K to KW -> Indicates Keyword
%X to AB -> Indicates Abstract (note: may also be N2 in RIS specs)
%Z to N1 -> Indicates Notes
%< to RN -> Indicates Research Notes (doesn't appear to be defined in RIS specifications?)
%U to UR -> Indicates URL
NaN to ID -> Indicates an ID number, not included in Endnote format
%I to PB -> Indicates Publisher
%C to CI -> Indicates City
%& to SE -> Indicates Chapter

References:
http://www.refman.com/support/docs/ReferenceManager12.pdf
http://www.refman.com/support/risformat_tags_01.asp

This website is a good resource. It lists and defines the EndNote tagging scheme:

http://www.cardiff.ac.uk/insrv/educationandtraining/guides/endnote/endnote_codes.html

Posted by kkwaiser at 08:08 AM | Comments (0)

February 16, 2010

Tackling the Gazetteer

Our ability to avoid formatting the Gazetteer for the website has been truncated since I figured out how to bulk load data into Drupal via a combination of Drupal modules and direct database insertions (there will be a blog post on this later.)


My approach to formalizing the Gazetteer is as follows:

1) Synchronize our spelling and lat/long information with the USGS Geographic Names Information System. It appears that 306 of 530 sites are listed by the GNIS meaning we immediately have location and other information we can leverage.

2) Incorporate information already collated by Bob. This includes site descriptions, synonyms and location information. For the latter, I re-projected his GAZ.shp file to NAD83, and used ArcMap's ADD XY tool to get coordinates.

The approach has changed:

1) Identify the final list of names. This involves reconciling synonyms and making minor edits to current names. Bob will help finalize this.

2) Add references to the appropriate GNIS feature classes. To find this list, google "gnis feature classes". I added one custom class - Road - to accomodate the UMBS' needs. This categorization should help others explore our research sites.

3) Get lats and longs for the research sites. Bob is working on finalizing this. Re-project the lat/longs to match up with Google Maps projection.

4) Collate all pertinent information into one spreadsheet. This should become our 'working' copy of the gazetteer.

5) Make appropriate changes to the bibliography. This should only involve updating research site names and looking for typos.

6) Adjust the Research Site content type in Drupal to minimize information loss upon uploading the research sites spreadsheet. This will involve adding township/range, county, township and other categories.

7) Once the Research sites and the bibliography are uploaded then create the node references. This is a process unto itself. May need to upload lat/long data separately.

Posted by kkwaiser at 09:17 AM | Comments (0)

January 14, 2010

Formatting the Gazetteer

Among the list of fundamental resources that must be worked into shape prior to being placed on a new website is the list of research sites studied by Biological Station researchers. The list was compiled by our Resident Biologist and is closely tied to the Biological Station bibliography.

Here is a list of steps that need to be taken:

- Deal with synonyms

Many of the gaz sites are referred to with different names (synonyms) in the Biological Station publications. An extra column (or field) will be used for synonyms allowing for duplicate rows to be removed. field can be added

- Synchronize the gaz with the USGS Geographic Names Information System list. The GNIS appears to be the authoritative list of places for the United States.

The GNIS database for Michigan is available in two places. Getting it directly from the USGS yields a text file with 51060 rows (i.e., places.) The datum is NAD 83. To set this in ArcGIS: ArcCatalog > Define Projection Tool > Coordinate System > Geographic Coordinate Systems > North America > North American Datum 1983.prj.

The GNIS for Michigan is also available at the Michigan Geographic Data Library however, I do not recommend this version because the shape file appears to list only 32022 places. The coordinate system is NAD 1983 Michigan GeoRef (Meters) although the metadata on the MiGDL does not explicitly tell you this. In ArcGIS it is under Projected Coordinate Systems > State Systems > NAD 1983 Michigan GeoRef (Meters).prj.


Any changes to the gaz as it currently stands will require updates to the bibliography so the naming schemes are consistent.

More steps are undoubtedly needed but I think my brain needs to stew on it before the most efficient path comes to mind.

Posted by kkwaiser at 03:44 PM | Comments (0)

December 14, 2009

Pro's and con's of using a controlled (keyword) vocabulary

One of the many information resources that we need to standardize at the Biological Station is the list of index terms (i.e. keywords). In contemplating the creation of a controlled vocabulary for documenting data sets, bibliographic entries, study sites and research projects many decision points have arisen. Here is a summary.

Pro's of building a controlled vocabulary for indexing:

- Consistency within terminology can improve search ability. E.g., use of "forests" instead of "forest", "forests" instead of "trees", "carbon dioxide" instead of CO2

- Consistency across information resources. The same terminology will be used to describe data sets, bibliographic entries, study sites and research projects

- Incorporating external controlled vocabulary (LTER keywords) will facilitate integration of UMBS data resources with network-scale databases.

- Use of keyword auto-complete when creating metadata may yield use of more descriptive terms as compared to "top-of-the-head" categorization

- UMBS can make a contribution to the creation of a controlled vocabulary for use by other field stations and for ecology in general

Cons:

- Building a controlled vocabulary can be time consuming

- Potential that keyword lists will not adequately represent new research directions

- No guarantee that anonymous users will use correct terms

Posted by kkwaiser at 01:00 PM | Comments (0)

Plan to create a UMBS controlled (keyword) vocabulary

Here is an outline of what will need to happen for the UMBS to have an established list of keywords.

1. Extract keywords from the UMBS bibliography, this will be the starting point.
- This list is not comma-delimited, meaning multi-term keywords will need to be manually identified and computer scripts will need to be used to reformat the terms and add commas

2. Parse the raw keyword list into 3 parts:
- a) keywords redundant with the LTER list (including synonyms and lexical variants)
- b) taxonomic descriptors (latin names and species-specific common names?)
- c) candidate-keywords for a UMBS keyword list.

3. Build UMBS keyword list using the candidate-keyword list:
- Identify how to treat hyphens, spaces and plurals
- Declare as equivalent lexical variants (e.g. analyze vs analyse)
- Identify synonyms
- Remove candidate-keywords that require context to make sense (e.g. "change", "description")



References

Savoy, J. (2005). Bibliographic database access using free-text and controlled vocabulary: An evaluation. Information Processing & Management, 41(4), 873-890.

Svenonius, E. (1986). Unanswered questions in the design of controlled vocabularies. Journal of the American Society for Information Science, 37(5), 331-430.

Svenonius, E. (2003). Design of controlled vocabularies Taylor & Francis.

Posted by kkwaiser at 10:59 AM | Comments (0)

December 11, 2009

Summary of initial keyword situation

An interesting question has arisen while formatting the Biological Station bibliography. What to do with all of those keywords? A quick and dirty analysis shows the keyword list from the bibliography has the following attributes:

3158 - number of keywords in the UMBS bibliography
2562 - number of keywords in the UMBS bibliography used 5 times or less
639 - number of keywords in the LTER Keyword list (v.0.9)
283 - number of overlapping keywords between UMBS and LTER lists


leaf graphic


(Right-click > view image for the fullsize image)

Here is a list of the 100 most used keywords in the UMBS Bibliography:









































































































































invertebrates 488
parasites 355
insects 317
aquatic 277
birds 272
distribution 270
water 267
species 260
behavior 252
plants 234
trematodes 231
description 229
forest 213
history 192
vascular 176
carbon 148
fungi 148
life 144
chemistry 141
snails 141
algae 132
succession 127
fishes 124
nutrients 122
vegetation 119
communities 118
breeding 114
growth 105
nesting 104
temperature 100
populus 99
change 97
nitrogen 94
bryophytes 92
dioxide 87
protozoans 86
mosses 85
reproductive 84
artificial 82
climate 81
success 80
development 76
diatoms 75
limnology 75
substrates 75
chemical 74
morphology 74
coleoptera 71
diptera 71
global 71
quality 71
soils 71
variation 70
aspen 69
biology 69
ecology 68
colonization 67
flora 67
vertebrates 67
benthic 66
predation 65
taxonomy 64
infection 63
population 63
biomass 62
production 62
trees 62
schistosomes 61
crustaceans 60
beetles 58
herbivory 58
size 58
larus 57
gulls 55
range 55
atmospheric 54
composition 54
deposition 54
feeding 54
periphyton 54
photosynthesis 51
acer 50
amphibians 50
lepidoptera 50
snakes 50
habitat 49
organic 49
wetlands 49
key 48
streams 48
leaf 47
structure 47
competition 46
light 46
mammals 46
molluscs 46
stagnicola 46
larvae 45
rana 45


Posted by kkwaiser at 12:52 PM | Comments (0)

November 25, 2009

Creating a final version of the UMBS bibliography

Here's a laundry list of things that need to happen in order to get the UMBS bibliography into a final version that will be meticulously maintained and updated:


- Meet with Melissa Gomis/ Laurie Sutch of the UM library to address the following:
http://www.lib.umich.edu/knowledge-navigation-center

- What are the pro's and con's of different software options? EndNote,
Refworks, a relational database built from scratch.

- How to guard against loss of information when transferring between
tagging formats (e.g., EndNote to RIS)?

- Other best practices to implement to protect the integrity of the
bibliography as it grows?

Other points to address:
- Need to develop a final version of Gazetteer where the exact spelling is noted.
- Then adjust the Gaz locations in the Bibliography to match exactly.
- Gaz locations in Bibliography will be moved to Research notes (Field %< in Endnote, RN in RIS)

- Remove reference to any Procite Fields
- What is "ProCite Field [11]"? Usually has an "e" or "f" and is called "Title" inside ProCite software.

- Add Accession Numbers with pattern umbs.x (where 'x' is a unique, incremented number: umbs.1, umbs.2, umbs.3)

- Remove Short Title (%! in EndNote, ST in RIS) if it exactly matches Title (%T/TI)
- Remove Date (%8/DA) if it exactly matches Year (%D/PY)
- Add any articles Bob has added to his version of the Bibliography
- Place a copy of the Final version of the Bibliography onto DeepBlue
- Maintain ONE working copy, store it on the department space so it is accessible to Bob and Kyle

- Go to UMBS to meet with Bob and hammer out the approximately 50 entrees that are still unresolved. Also add any entrees that Bob has added to his version since August.

Posted by kkwaiser at 10:57 AM | Comments (0)

September 08, 2009

NAS committee on data preservation

A committee organized by the National Academy of Sciences recently published a book entitled "Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age." The UM's Provost Teresa Sullivan was a member of the committee.

The grist of the (free) Executive Summary is that the preservation of data and documentation behind peer-reviewed literature is the responsibility of the authors and the affiliated institutions. Many of the challenges the Summary broaches are faced by the Biological Station and it brings up the question of how far-reaching our data management practices should be.

For example, the Flathead Lake Biological Station archives everything behind the research process (data, field notebooks, computer code, paper drafts, etc) whereas the current UMBS modus operandi is to archive the underlying data and the methods behind the data. In other words, our responsibilities extend to the data collected at the Biological Station, not to the analysis and interpretation of data that rounds out a peer-reviewed publication.

Do we have a responsibility to go further? Do researchers have a responsibility to pursue data preservation just as they pursue funding to conduct studies? Or, is it the responsibility of peer-reviewed journals to implement much of this process?

Posted by kkwaiser at 11:33 AM | Comments (0)