October 26, 2011
Prioritizing Data Management Goals
Question: If the UMBS Information Manager could have an intern/assistant for at least a semester period, what would that person do? What are goal is the most pressing and would lead to the greatest payoff?
Potential Goals:
Data Forensics project - identify completed research projects with unarchived, high quality data sets and begin placing them into the Research GatewayTarget IGERT-BART Data - In someways a subset of the Data Forensics Project but with a greatly restricted target population and timespan.
Research Gateway Rampage - Crank away on improvements (i.e., New Table Wizard module, Drupal 7 migration)
Current Research Metadata Entry - Work to increase contributions from ongoing research projects
Housework - Monitor incoming housing applications, update database, search for and add new publications
Great exercise. Now let us order by priority:
1. Target IGERT-BART Data - In someways a subset of the Data Forensics Project but with a greatly restricted target population and timespan.2. Current Research Metadata Entry - Work to increase contributions from ongoing research projects
3. Housework - Monitor incoming housing applications, update database, search for and add new publications
4. Research Gateway Rampage - Crank away on improvements (i.e., New Table Wizard module, Drupal 7 migration)
5. Data Forensics project - identify completed research projects with unarchived, high quality data sets and begin placing them into the Research Gateway
1-3 could conceivably be part of a single internship although 1 warrants more than a semester while 2 and 3 are ongoing.
Posted by kkwaiser at 01:12 PM | Comments (0) | TrackBack
April 05, 2011
On DOI's and Data
I am going to waste a great title on a boring post. One of my pie-in-the-sky hopes is to advance to the point where datasets within the Research Gateway are assigned Digital Object Identifiers (DOI). I know this would help to establish legitimacy in the eyes of our researchers but, honestly, I do not know much about the underpinnings of the DOI system. Here are a few pointers:
There is a website DOI but you are better off starting at the Wikipedia page.
DataCite is a DOI service specifically for datasets (Wikipedia page).
If a DOI is a unique identifier then, my understanding is that, the Handle System makes sure the DOI points to the right place (Wikipedia page).
Questions to answer:
Q: Is the University of Michigan associated with DataCite at all? Is DeepBlue?
Yes, ICPSR is a member of DataCite and contributed datasets receive a DOI. I don't think DeepBlue has this capability but I asked them anyways. They should.
How complicated and expensive would an automated DOI registration system be?
How complicated and expensive would a manual DOI registration system be?
Posted by kkwaiser at 01:29 PM | Comments (0) | TrackBack
January 31, 2011
Cleaning up the data
Now that the housing application is "done" I am going to catch up on data stuffs. Things that need to happen:
- Switch back to DataSet/DataFile approach
- Update Panel template and Views to pull info from DataFile
- Clean up data from REUs and Frontiers students
- Load data from Bob
- Load bibliography entries from Bob
- Break for lunch...
Posted by kkwaiser at 04:37 PM | Comments (0)
November 11, 2010
Notes on NSF Programs related to bio-research collections
Advancing Digitization of Biological Collections (ADBC) - "This program seeks to create a national resource of digital data documenting existing biological collections...The national resource will be structured at three levels: a national hub, thematic networks based on collaborative groups of collections, and the physical collections."
Home Uniting Biocollections (HUB) - forming the coordinating scientific team...oversee implementation of standards and best practices for the collections, plan for the long-term sustainability of the national resource, facilitate communication and standards for training, and assure that results are disseminated to the scientific community utilizing collections, the collections community, and other similar efforts internationally.
Thematic Collections Networks (TCN) - "will conduct the digitization effort at a number of collections...justified by a research theme...integrate with other ongoing digitization activities, such as the new collaborative networks funded under the Improvements to Biological Research Collections Program (BRC)...BRC-funded collaborative projects will be expected to become part of this national resource"
Improvements to Biological Research Collections (BRC) - Next due date: July 22, 2011
Posted by kkwaiser at 01:07 PM | Comments (0)
September 09, 2010
Images directly into an access database
The proposed workflow for the digitization of our vascular plants collection currently looks like this:
1) Volunteer takes photos of specimen label
2) Volunteer places barcode onto specimen sheet
3) Put both barcode and photo into same row of database
4) Send database to UM herbarium, where they will sync the labels/barcodes with the records they have already digitized.
Problems:
Is there a way to get the photos to feed directly into a database field? The barcode scanner can feed directly in. This plan falls apart if the image (or a pointer to) is not automagically inserted into the database.
References:
This article explains how to store images in the database, link to the images, and use VB to display the images.
This old post (Access 2000?) claims you can feed an image from a webcam directly into an OLE field. Doesn't appear to translate directly to Access 2007.
The comment from Razorking in this post indicates that it is simple to feed barcodes into databases, why not photos?
Another post, not sure how useful it is. Talks about feeding webcam images into access.
Some VB code examples here and here.
Posted by kkwaiser at 08:57 AM | Comments (0)
June 23, 2010
Limnology course data
Troy Keller, 2010's Limnology instructor, stopped by with the idea of collaborating to archive data collected by his students this summer. Here are some notes:
Lakes:
Douglas
Burt
Long
Black
1 other, I believe
Variables:
secchi depth
chlorophyll
plankton (zoo)
physio-chemical - tp, tn, (pH, DO, temp profile)
benthic grab sample (dredge bottom for inverts)
Classes on Tues and Thursday. Data collection will commence on Tues, June 29th on Douglas Lake. GPS locations will be taken at each point, several points per lake.
One option is to create a project for the class and put their work (abstract, methods, data, personnel) under there. Will need to figure out how to organize the data (e.g., by lake) and how/whether to keep it separate from non-class data.
Posted by kkwaiser at 05:12 PM | Comments (0)
June 09, 2010
Building a "Sensor" Content Type
I've been toying with the idea of creating a "Sensor" content type for the Drupal site. As of now, the fields would mimic what is generally found on a sensor spec sheet but the Alliance for Coastal Technologies also has an implementation worth copying.
Once a sensor is created, then research projects, research sites and data files (or data sets?) that use a sensor could (node) reference it. There are several reasons to pursue this idea:
- A GMap mash-up could easily show where we have sensors deployed
- A dynamic list of the sensors, with their particulars, could be easily generated.
- At some point, the sensor metadata could be included in the overall metadata sheet that accompanies a data set and data file.
A use case example would be a researcher who wants to know if we are collecting a particular variable. A search of deployed sensors for that variable could indicate the general location, data sets and contact personnel.
Posted by kkwaiser at 04:24 PM | Comments (0)
Building a "Sensor" Content Type
I've been toying with the idea of creating a "Sensor" content type for the Drupal site. As of now, the fields would mimic what is generally found on a sensor spec sheet but the Alliance for Coastal Technologies also has an implementation worth copying.
Once a sensor is created, then research projects, research sites and data files (or data sets?) that use a sensor could (node) reference it. There are several reasons to pursue this idea:
- A GMap mash-up could easily show where we have sensors deployed
- A dynamic list of the sensors, with their particulars, could be easily generated.
- At some point, the sensor metadata could be included in the overall metadata sheet that accompanies a data set and data file.
A use case example would be a researcher who wants to know if we are collecting a particular variable. A search of deployed sensors for that variable could indicate the general location, data sets and contact personnel.
Posted by kkwaiser at 04:24 PM | Comments (0)
March 08, 2010
Digitizing biological collections
I received an email from our Resident Biologist, Bob Vande Kopple, pointing me to this website.
This spurred me to hash out a quick plan for digitizing our biological collections. I posted it at the above site as a comment but am reposting it here for posterity's sake:
Great timing. Our biological station has a small collection (~20,000 floral and faunal specimens) which we are just beginning to digitize. Given staffing constraints and variable confidence in taxonomic identifications we are using the following, low-overhead approach.
We are beginning with our largest and highest quality collections - vascular plants and bryophytes. We are collaborating with a larger institution (the University of Michigan Herbarium) from which we are receiving a database schema, hardware and software recommendations, and training for the digitization and QA/QC process.
In exchange, the UM Herbarium will receive unique specimen records and more precise location information for known collection sites in our region. Once this phase is complete, we will be better able to tackle our smaller and more complicated collections. We will also use this time improve the quality of the identifications in those collections.
My thoughts for the project outlined on this website are as follows:
For the digitization process, I believe a regional (i.e., dispersed) effort that pairs complementary institutions is best. A mentor-mentee relationship, if you will.
However, how to get our collection online, integrated with multi-institutional databases and how to leverage these databases is an outstanding question. For this phase of the project I think the ability to consult with technical experts located at a centralized institute would be best. I think this is the phase where uniformity should be enforced. I could foresee help in mapping our database schema to whichever data standard is adopted as well as how to best serve up and leverage data from an IT perspective.
Kyle Kwaiser, Information Manager
University of Michigan Biological Station
Posted by kkwaiser at 10:11 AM | Comments (0)
February 22, 2010
Mapping EndNote to RIS
I exported the same bibliographic entrees from EndNote into the Endnote tagged format and the RIS tagged format. I then mapped them 1:1 and noted inconsistencies as they arose.
Part of the reason I did this is because when you import one these tagged documents into different bibliographic software (e.g., EndNote, Refworks, ProCite) the outcome is different. Some fields are smashed together and others seem to be lost!
It's worth noting that this is how EndNote maps to RIS and not how ProCite or Refworks would map to RIS.
Endnote to RIS
%0 to TY -> Indicates Reference Type
%A to AU -> Indicates Author
%E to A2 -> Indicates Editor
%D to PY -> Indicates Year
%T to TI -> Indicates Title
%! to ST -> Indicates Short Title
%J to T2 -> Indicates Journal
%B to T2 -> Indicates Book Title
%V to VL -> Indicates Volume, Degree
%N to M1 -> Indicates Issue (should be IS in RIS)
%P to SP -> Indicates page number
%8 to DA -> Indicates Date
%@ to SN -> Indicates ISSN
%R to DO -> Indicates DOI
%M to AN -> Indicates Accession Number
%F to LB -> Indicates Lable
%K to KW -> Indicates Keyword
%X to AB -> Indicates Abstract (note: may also be N2 in RIS specs)
%Z to N1 -> Indicates Notes
%< to RN -> Indicates Research Notes (doesn't appear to be defined in RIS specifications?)
%U to UR -> Indicates URL
NaN to ID -> Indicates an ID number, not included in Endnote format
%I to PB -> Indicates Publisher
%C to CI -> Indicates City
%& to SE -> Indicates Chapter
References:
http://www.refman.com/support/docs/ReferenceManager12.pdf
http://www.refman.com/support/risformat_tags_01.asp
This website is a good resource. It lists and defines the EndNote tagging scheme:
http://www.cardiff.ac.uk/insrv/educationandtraining/guides/endnote/endnote_codes.html
Posted by kkwaiser at 08:08 AM | Comments (0)
February 16, 2010
Tackling the Gazetteer
Our ability to avoid formatting the Gazetteer for the website has been truncated since I figured out how to bulk load data into Drupal via a combination of Drupal modules and direct database insertions (there will be a blog post on this later.)
My approach to formalizing the Gazetteer is as follows:
1) Synchronize our spelling and lat/long information with the USGS Geographic Names Information System. It appears that 306 of 530 sites are listed by the GNIS meaning we immediately have location and other information we can leverage.
2) Incorporate information already collated by Bob. This includes site descriptions, synonyms and location information. For the latter, I re-projected his GAZ.shp file to NAD83, and used ArcMap's ADD XY tool to get coordinates.
The approach has changed:
1) Identify the final list of names. This involves reconciling synonyms and making minor edits to current names. Bob will help finalize this.
2) Add references to the appropriate GNIS feature classes. To find this list, google "gnis feature classes". I added one custom class - Road - to accomodate the UMBS' needs. This categorization should help others explore our research sites.
3) Get lats and longs for the research sites. Bob is working on finalizing this. Re-project the lat/longs to match up with Google Maps projection.
4) Collate all pertinent information into one spreadsheet. This should become our 'working' copy of the gazetteer.
5) Make appropriate changes to the bibliography. This should only involve updating research site names and looking for typos.
6) Adjust the Research Site content type in Drupal to minimize information loss upon uploading the research sites spreadsheet. This will involve adding township/range, county, township and other categories.
7) Once the Research sites and the bibliography are uploaded then create the node references. This is a process unto itself. May need to upload lat/long data separately.
Posted by kkwaiser at 09:17 AM | Comments (0)
January 14, 2010
Formatting the Gazetteer
Among the list of fundamental resources that must be worked into shape prior to being placed on a new website is the list of research sites studied by Biological Station researchers. The list was compiled by our Resident Biologist and is closely tied to the Biological Station bibliography.
Here is a list of steps that need to be taken:
- Deal with synonyms
Many of the gaz sites are referred to with different names (synonyms) in the Biological Station publications. An extra column (or field) will be used for synonyms allowing for duplicate rows to be removed. field can be added
- Synchronize the gaz with the USGS Geographic Names Information System list. The GNIS appears to be the authoritative list of places for the United States.
The GNIS database for Michigan is available in two places. Getting it directly from the USGS yields a text file with 51060 rows (i.e., places.) The datum is NAD 83. To set this in ArcGIS: ArcCatalog > Define Projection Tool > Coordinate System > Geographic Coordinate Systems > North America > North American Datum 1983.prj.
The GNIS for Michigan is also available at the Michigan Geographic Data Library however, I do not recommend this version because the shape file appears to list only 32022 places. The coordinate system is NAD 1983 Michigan GeoRef (Meters) although the metadata on the MiGDL does not explicitly tell you this. In ArcGIS it is under Projected Coordinate Systems > State Systems > NAD 1983 Michigan GeoRef (Meters).prj.
Any changes to the gaz as it currently stands will require updates to the bibliography so the naming schemes are consistent.
More steps are undoubtedly needed but I think my brain needs to stew on it before the most efficient path comes to mind.
Posted by kkwaiser at 03:44 PM | Comments (0)
December 14, 2009
Pro's and con's of using a controlled (keyword) vocabulary
One of the many information resources that we need to standardize at the Biological Station is the list of index terms (i.e. keywords). In contemplating the creation of a controlled vocabulary for documenting data sets, bibliographic entries, study sites and research projects many decision points have arisen. Here is a summary.
Pro's of building a controlled vocabulary for indexing:
- Consistency within terminology can improve search ability. E.g., use of "forests" instead of "forest", "forests" instead of "trees", "carbon dioxide" instead of CO2
- Consistency across information resources. The same terminology will be used to describe data sets, bibliographic entries, study sites and research projects
- Incorporating external controlled vocabulary (LTER keywords) will facilitate integration of UMBS data resources with network-scale databases.
- Use of keyword auto-complete when creating metadata may yield use of more descriptive terms as compared to "top-of-the-head" categorization
- UMBS can make a contribution to the creation of a controlled vocabulary for use by other field stations and for ecology in general
Cons:
- Building a controlled vocabulary can be time consuming
- Potential that keyword lists will not adequately represent new research directions
- No guarantee that anonymous users will use correct terms
Posted by kkwaiser at 01:00 PM | Comments (0)
Plan to create a UMBS controlled (keyword) vocabulary
Here is an outline of what will need to happen for the UMBS to have an established list of keywords.
1. Extract keywords from the UMBS bibliography, this will be the starting point.
- This list is not comma-delimited, meaning multi-term keywords will need to be manually identified and computer scripts will need to be used to reformat the terms and add commas
2. Parse the raw keyword list into 3 parts:
- a) keywords redundant with the LTER list (including synonyms and lexical variants)
- b) taxonomic descriptors (latin names and species-specific common names?)
- c) candidate-keywords for a UMBS keyword list.
3. Build UMBS keyword list using the candidate-keyword list:
- Identify how to treat hyphens, spaces and plurals
- Declare as equivalent lexical variants (e.g. analyze vs analyse)
- Identify synonyms
- Remove candidate-keywords that require context to make sense (e.g. "change", "description")
References
Savoy, J. (2005). Bibliographic database access using free-text and controlled vocabulary: An evaluation. Information Processing & Management, 41(4), 873-890.
Svenonius, E. (1986). Unanswered questions in the design of controlled vocabularies. Journal of the American Society for Information Science, 37(5), 331-430.
Svenonius, E. (2003). Design of controlled vocabularies Taylor & Francis.
Posted by kkwaiser at 10:59 AM | Comments (0)
December 11, 2009
Summary of initial keyword situation
An interesting question has arisen while formatting the Biological Station bibliography. What to do with all of those keywords? A quick and dirty analysis shows the keyword list from the bibliography has the following attributes:
3158 - number of keywords in the UMBS bibliography
2562 - number of keywords in the UMBS bibliography used 5 times or less
639 - number of keywords in the LTER Keyword list (v.0.9)
283 - number of overlapping keywords between UMBS and LTER lists

(Right-click > view image for the fullsize image)
Here is a list of the 100 most used keywords in the UMBS Bibliography:
| invertebrates | 488 |
| parasites | 355 |
| insects | 317 |
| aquatic | 277 |
| birds | 272 |
| distribution | 270 |
| water | 267 |
| species | 260 |
| behavior | 252 |
| plants | 234 |
| trematodes | 231 |
| description | 229 |
| forest | 213 |
| history | 192 |
| vascular | 176 |
| carbon | 148 |
| fungi | 148 |
| life | 144 |
| chemistry | 141 |
| snails | 141 |
| algae | 132 |
| succession | 127 |
| fishes | 124 |
| nutrients | 122 |
| vegetation | 119 |
| communities | 118 |
| breeding | 114 |
| growth | 105 |
| nesting | 104 |
| temperature | 100 |
| populus | 99 |
| change | 97 |
| nitrogen | 94 |
| bryophytes | 92 |
| dioxide | 87 |
| protozoans | 86 |
| mosses | 85 |
| reproductive | 84 |
| artificial | 82 |
| climate | 81 |
| success | 80 |
| development | 76 |
| diatoms | 75 |
| limnology | 75 |
| substrates | 75 |
| chemical | 74 |
| morphology | 74 |
| coleoptera | 71 |
| diptera | 71 |
| global | 71 |
| quality | 71 |
| soils | 71 |
| variation | 70 |
| aspen | 69 |
| biology | 69 |
| ecology | 68 |
| colonization | 67 |
| flora | 67 |
| vertebrates | 67 |
| benthic | 66 |
| predation | 65 |
| taxonomy | 64 |
| infection | 63 |
| population | 63 |
| biomass | 62 |
| production | 62 |
| trees | 62 |
| schistosomes | 61 |
| crustaceans | 60 |
| beetles | 58 |
| herbivory | 58 |
| size | 58 |
| larus | 57 |
| gulls | 55 |
| range | 55 |
| atmospheric | 54 |
| composition | 54 |
| deposition | 54 |
| feeding | 54 |
| periphyton | 54 |
| photosynthesis | 51 |
| acer | 50 |
| amphibians | 50 |
| lepidoptera | 50 |
| snakes | 50 |
| habitat | 49 |
| organic | 49 |
| wetlands | 49 |
| key | 48 |
| streams | 48 |
| leaf | 47 |
| structure | 47 |
| competition | 46 |
| light | 46 |
| mammals | 46 |
| molluscs | 46 |
| stagnicola | 46 |
| larvae | 45 |
| rana | 45 |
Posted by kkwaiser at 12:52 PM | Comments (0)
November 25, 2009
Creating a final version of the UMBS bibliography
Here's a laundry list of things that need to happen in order to get the UMBS bibliography into a final version that will be meticulously maintained and updated:
- Meet with Melissa Gomis/ Laurie Sutch of the UM library to address the following:
http://www.lib.umich.edu/knowledge-navigation-center
- What are the pro's and con's of different software options? EndNote,
Refworks, a relational database built from scratch.
- How to guard against loss of information when transferring between
tagging formats (e.g., EndNote to RIS)?
- Other best practices to implement to protect the integrity of the
bibliography as it grows?
Other points to address:
- Need to develop a final version of Gazetteer where the exact spelling is noted.
- Then adjust the Gaz locations in the Bibliography to match exactly.
- Gaz locations in Bibliography will be moved to Research notes (Field %< in Endnote, RN in RIS)
- Remove reference to any Procite Fields
- What is "ProCite Field [11]"? Usually has an "e" or "f" and is called "Title" inside ProCite software.
- Add Accession Numbers with pattern umbs.x (where 'x' is a unique, incremented number: umbs.1, umbs.2, umbs.3)
- Remove Short Title (%! in EndNote, ST in RIS) if it exactly matches Title (%T/TI)
- Remove Date (%8/DA) if it exactly matches Year (%D/PY)
- Add any articles Bob has added to his version of the Bibliography
- Place a copy of the Final version of the Bibliography onto DeepBlue
- Maintain ONE working copy, store it on the department space so it is accessible to Bob and Kyle
- Go to UMBS to meet with Bob and hammer out the approximately 50 entrees that are still unresolved. Also add any entrees that Bob has added to his version since August.
Posted by kkwaiser at 10:57 AM | Comments (0)
September 08, 2009
NAS committee on data preservation
A committee organized by the National Academy of Sciences recently published a book entitled "Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age." The UM's Provost Teresa Sullivan was a member of the committee.
The grist of the (free) Executive Summary is that the preservation of data and documentation behind peer-reviewed literature is the responsibility of the authors and the affiliated institutions. Many of the challenges the Summary broaches are faced by the Biological Station and it brings up the question of how far-reaching our data management practices should be.
For example, the Flathead Lake Biological Station archives everything behind the research process (data, field notebooks, computer code, paper drafts, etc) whereas the current UMBS modus operandi is to archive the underlying data and the methods behind the data. In other words, our responsibilities extend to the data collected at the Biological Station, not to the analysis and interpretation of data that rounds out a peer-reviewed publication.
Do we have a responsibility to go further? Do researchers have a responsibility to pursue data preservation just as they pursue funding to conduct studies? Or, is it the responsibility of peer-reviewed journals to implement much of this process?
Posted by kkwaiser at 11:33 AM | Comments (0)