February 11, 2013

Anonymizing medical data

Just a few thoughts copped from a recent email exchange on the Research Dataman list:

It's one thing to be aware of the risks - it's another to decide how to
manage them. Refusing to disclose *any* data except under very carefully
controlled circumstances is one approach, and it's probably valid for data
where the reuse potential is likely to be limited to a few instances at most.
For data with greater reuse potential techniques adopted for some government
datasets may be appropriate. These include perturbation of some of the numbers
or suppression of some numbers in cases that might lead to disclosure even in
aggregated data. Both need expert statistical advice to ensure that the
resultant data can still be used to do something useful but isn't disclosive.

Examples of perturbation include varying a subject's age by a few years in
either direction. An example of suppression I am aware of comes from the Schools
Census - in any school where the number of pupils receiving free school meals
is below 5, the exact total is redacted from the published data.

Ultimately the only way to prevent identification of individuals by combining datasets (i.e. which include sufficiently sensitive data items to permit identification but not actual confidential=identifiable data) is through the Data Sharing or Re-Use Agreements between data controllers and data processors.

Websites that were mentioned:

Association of Research Ethics Committees

The Ethox Centre

Anonymisation of data from UK Data Archive

Posted by kkwaiser at 10:48 AM | Comments (0)

August 06, 2012

List of data archiving services

From the Research-Dataman email list:

There are a number of current and emerging endeavours to list such data archives:

DataCite: http://datacite.org/repolist
DataBib: http://databib.org/
Re3Data: http://www.re3data.org/

Posted by kkwaiser at 08:49 AM | Comments (0)

June 14, 2012

Digital Curation Bibliography

Digital Curation Bibliography

In a rapidly changing technological environment, the difficult task of ensuring long-term access to digital information is increasingly important. The Digital Curation Bibliography: Preservation and Stewardship of Scholarly Works presents over 650 English-language articles, books, and technical reports that are useful in understanding digital curation and preservation. This selective bibliography covers digital curation and preservation copyright issues, digital formats (e.g., data, media, and e-journals), metadata, models and policies, national and international efforts, projects and institutional implementations, research studies, services, strategies, and digital repository concerns. Most sources have been published from 2000 through 2011; however, a limited number of key sources published prior to 2000 are also included. The bibliography includes links to freely available versions of included works, such as e-prints and open access articles.

Posted by kkwaiser at 08:57 AM | Comments (0)

May 15, 2012

British data software


Other new features in v3.0 include:

- Ability to share plans, and to edit them jointly with colleagues
- Simultaneous viewing of multiple custom guidance notes
- More flexible project stages (phases) for templates
- User maintainable profile/login details
- XLSX output

Over the coming months we will be rolling out a number of additional features, and further announcements will flag their release:

- A facility for boilerplate text to be included within templates
- Display of funder constraints on output (e.g. number of pages, word count etc)
- Increased institutional customisability, including a new ‘administrator’ user type
- Support for non-English Language versions of the tool


DataStage is a secure personalized 'local' file management environment for use at the research group level, appearing as a mapped drive on the end-user's computer.

It can be deployed on a local server, or on an institutional or commercial cloud. Once the software has been installed on the server, there is no additional software for the end-user to install


DataBank is a scalable data repository designed for institutional deployment.

DataBank will provide a definitive, sustainable, referenceable location for (potentially large) research datasets and allow researchers to store, reference, manage and discover datasets.

Posted by kkwaiser at 08:41 AM | Comments (0)

May 10, 2012

EPA Data Mandate?

A colleague forwarded this document from the EPA in which they review all federal data management policies in preparation for creation of an over-arching EPA SDM (scientific data management) policy. FYI, ORD is an office within the EPA.

This review demonstrates that, in general, federal agencies have yet not developed comprehensive policies and approaches for managing the burgeoning amount of scientific data that they create. Nevertheless, this compilation of resources provides a solid base of information for beginning to develop a set of ORD SDM policies and guidance.

The introduction to this report laid out a general, long-term approach for two broad goals: (1) developing a SDM policy framework and (2) developing policies, guidance, and tools that fit within this framework.

Posted by kkwaiser at 10:41 AM | Comments (0)

May 09, 2012

Further reading?

American Journal of Economics and Business Administration 3 (1): 112-119, 2011
ISSN 1945-5488
© 2010 Science Publications

Using Metadata Analysis and Base Analysis Techniques
in Data Qualities Framework for Data Warehouses

Azwa Abdul Aziz, Md Yazid Mohd Saman and Mohd Pouzi Hamzah
Faculty of Informatics, Department of Computer Science,
University Sultan Zainal Abidin (UniSZA),
21030, Gong Badak, Terengganu Malaysia

the framework will use Metadata
Analysis to gain the target qualities value and Base
Analysis Techniques to view actual values in data
sources. A gap analysis technique will provide the
strategies to reduce the gap between the target and
actual values. This study also proposes a DQ matrix
strategy in DW design.

Posted by kkwaiser at 04:15 PM | Comments (0)

February 29, 2012

Persistent Identifier Resources

I have previously identified my ignorance regarding persistent URLs. Here are a few resources that may alleviate the condition:

New Drupal module call Persistent Identifiers? Yes, please.

Implementing Persistent Identifiers - a useful pdf

Posted by kkwaiser at 02:02 PM | Comments (0)

January 25, 2012

Citing Data Sets

[First Author Last Name], [First Author Initials], [2nd Author Initials] [2nd Author Last Name], [3rd Author Initials] [3rd Author Last Name]. [Year of Data Set Completion or Publication]. [Data Set Title]. [Publisher Location]:[Data set publisher]. Accessed on [Date data set retrieved YYYY-MM-DD] at [Website URL].

Welch, P.S., F.E. Eggleton. 2010. South Fishtail Bay Profiles 1913-1950. Pellston, MI USA: University of Michigan Biological Station Research Gateway. Accessed on 2012-01-25 at http://umbs.lsa.umich.edu.


Posted by kkwaiser at 09:30 AM | Comments (0) | TrackBack

January 13, 2012

ToDo: EML feeds for metadata

Alter LTER bit:

uid=UMBS, o=lter, dc=ecoinformatics, dc=org

This block automatically adds "_ref" to the end of node references. Problem is I don't always append "_ref" so those references don't end up in the metadata. From eml_variables.php (line 134):

foreach ($dataset_reference_names as $dataset_reference_name) {
$ref_nodes = Array();
$field_name = "field_" . $dataset_reference_name . "_ref";

Here's the new version of this:

      //  refs  
      $dataset_reference_names = array(


foreach ($dataset_reference_names as $dataset_reference_name) {
$ref_nodes = Array();
$field_name = "field_" . $dataset_reference_name;
if (isset($node->$field_name)) {
$ref_nid_array = $node->$field_name;
if ($dataset_reference_name == 'dataset_site_ref' &&
$node->field_dataset_site_ref[0]['nid']) {
$ref_nodes = eml_get_site_information($ref_nid_array);
else {
foreach ($ref_nid_array as $v) {
foreach ($v as $ref_nid) {
$ref_nodes[] = node_load($ref_nid);

In eml_config/eml_config_form.inc change the maxLength of acronym to 4

$form['acronym'] = array( '#type' => 'textfield', '#title' => t('Site name acronym'), // '#required' => TRUE, '#size' => 4, '#maxlength' => 4e, '#default_value' => variable_get('acronym', $last_settings['last_acronym']), // '#description' => t('Site name acronym'),

in views-bonus-eml-export-eml.tpl.php remove closing tags at end of file:

eml_indent(1); eml_close_tag('eml:eml'); ');

In views-bonus-eml-export-eml.tpl.php add a test to the check for a code-definition variable. I'm not sure why but without this test any variables that lack units or dates ends up going into this loop

}elseif ($var->code_definition[0][value] != NULL) {

In views-bonus-eml-export-eml.tpl.php change knb to UMBS:


//TODO: access tag group - from config file, or from site variable, or... here is my take !!!
if ($acr) {
$access_string = "uid=$acr, o=umich, dc=ecoinformatics, dc=org";

Posted by kkwaiser at 03:25 PM | Comments (0) | TrackBack

November 18, 2011

UMBS Tidas Buoy Metadata

Just a few emails to document. But, remember this group as one that could potentially teach others about metadata standards.

Hi Susan,

I'll take a stab at this.

DP - The Depth of the corresponding Temperature Profile Node (in meters)
TP - Temperature Profile Node
TP001 - 14.34 ft
TP002 - 23.26 ft
TP003 - 32.19 ft
TP004 - 41.11 ft
TP005 - 50.00 ft
TP006 - 58.92 ft
TP007 - 67.85 ft
TP008 - 76.77 ft
Note: Previous communication with Heidi indicates that the TP006, TP007, and TP008 should be used with caution as they may have been in the muck.

FM - Appears to be a metadata standard code. Here's what I dug up via Google. I've attached a document located here: ftp://compsweb.marine.usf.edu/pub/misc/eriks/TESAC%20XML%20tags.rt

For all platforms that measure salinity and/or water temperature and/or currents, include data with these XML tags:

<fm64iii> set to 820 if temperature and/or currents are measured

set to 830 if salinity is also measured

<fm64xx> set to 99

<fm64k1> set to 7, indicates measurements are at fixed depths

<fm64k2> 0, indicates salinity is not measured

1, indicates salinity accuracy > 0.02 ppt.

2, indicates salinity accuracy < 0.02 ppt.


Kyle Kwaiser, Information Manager
University of Michigan Biological Station
2541 Chemistry Bldg.
930 North University Ave.
Ann Arbor, MI, 48109-1055 USA
Ph: 734-615-5005

Quoting Susan Hendricks :

[Hide Quoted Text]
Hi Kyle:

Hope you are doing well.

I have been working with this year's DL buoy data. There are some
parameters however that I can't quite figure out such as the ones listed
below. I sent this to Heidi, but she's so busy, I was wondering if you
could decipher them? Particularly the FMs, DPs TPs; the YSIs are
straightforward as were the meteorological data.



From: Susan Hendricks [mailto:shendricks@murraystate.edu]
Sent: Tuesday, November 15, 2011 2:06 PM
To: 'Purcell, Heidi'
Subject: column/parameter codes

Hi again Heidi,

I was able to download the data and separate 2011 from 2010. I do, however,
need to know what the following column headings represent. Some of them are
obviously depths and others are YSI parameters...Are all the YSI readings
(except for temp) collected at 1 m depth?

FM64III ()

FM64XX ()

FM64K1 ()

FM64K2 ()

DP001 ()

TP001 ()

DP002 ()

TP002 ()

DP003 ()

TP003 ()

DP004 ()

TP004 ()

DP005 ()

TP005 ()

DP006 ()

TP006 ()

DP007 ()

TP007 ()

DP008 ()

TP008 ()













Posted by kkwaiser at 09:50 AM | Comments (0) | TrackBack

October 31, 2011

Image Transfer Fun

Time to move images from here to here. This here is a list (and a recursive link!):

Douglas Lake Region

- The data set Property Boundaries of UMBS - incorporates this image
- The data set Soils of UMBS - incorporates this image

No transferred (yet):
- the UMBS Campus or siteuse images yet.
- Gorge trails
- Mapped trees of UMBS
- NAIP orthophoto - must boost file upload size!!

Not going to transfer:

- the satellite images of UMBS because of copyrights. This topo map is also copyrighted but it must apply only to the digital version used here.
- DEM of all of N. Michigan - need to get one subset to UMBS property to accompany our DEM data set.

Sugar Island

- The Sugar Island Research Site received the following Osborn and Public Lands and Land Use

Not transferring

- Copyrighted topo map of SI

Ecosystem and Cover Type Maps

- The data set Major and Minor Landforms of UMBS - incorporates this
- The data set Vegetative Cover Types of UMBS - incorporates this UMBS image and a similar Colonial Point image.
-The data set Ecosystems of UMBS - incorporates this UMBS image and the Colonial Point image.

No transferred (yet):
- aerial photo of colonial point yet.

USGS and Historical Maps

- Not transferring these as of now.

Posted by kkwaiser at 01:32 PM | Comments (0) | TrackBack

October 26, 2011

Information Specialist Internship

Position overview: Work on projects central to University of Michigan Biological Station's data management goals. Specifically, this position will work closely with the UMBS Information Manager and affiliated researchers to identify and archive completed data sets.

Responsibilities: Communicate with and receive metadata and data from researchers, harvest metadata from peer-reviewed literature and dissertations, enter metadata into a Drupal-based information management system, run quality control on incoming data sets

Required skill sets: Knowledge of metadata standards, familiarity with data quality assurance and quality control practices, ability to communicate efficiently with scientists on a wide variety of research topics, exceptional organizing skills, ability to balance need for detail with overarching program goals.

Desired skill sets: Experience with the Drupal content management system, knowledge of a scripting language (e.g., PHP, python, R).

Posted by kkwaiser at 02:36 PM | Comments (0) | TrackBack

IGERT-BART Data Project

IGERT-BART Data Legacy Project - UMBS received two successive rounds of IGERT funding for graduate student researchers. That means a lot of data sets were created and, lo-and-behold, where are they? Let's map how we might find this out:

1. Collate list of BART fellows, dates at UMBS, topics researched, publications produced, contact information, likely data sets, etc.

2. Build "most wanted data" target by prioritizing list 1.

3. Contact BART fellows to request data submission.

4. Conduct data interviews with BART fellows to establish number and scope of potential data sets.

5. Metadata entry - harvested from publications/dissertations and correspondence with researcher

6. Data quality control - iterate with researcher to develop a final, archivable version of the dataset

Posted by kkwaiser at 02:00 PM | Comments (0) | TrackBack

August 15, 2011

Data Intensive Science Request for Ideas

Maybe, someday, I will have time to read over this interesting request for ideas.

Posted by kkwaiser at 11:24 AM | Comments (0) | TrackBack

July 20, 2011

Archiving audio files

Just making a record of an email. We have a researcher collecting bird song data and the question is how to best archive the data. Opensource vs proprietary; compressed vs uncompressed, lossless vs lossy, ubiquity of the format.


Thanks for taking the initiative on this. Here are notes on a small bit of research:

Quick Primer on Audio File Formats:

- WAV = Microsoft propreitary, no compression, lossless
- MP3 = patented, compression, lossy

Lossless indicates no data (sound waves in this case?) are lost. Lossy is the opposite. You can have a format that compress the data but is lossless.

Notes on the file you sent:

Original size: 7.3 MB, 1 min 15 secs (.wav format)
Zipped size: 6.1 MB (.wav format)

Using Audacity, I exported to mp3: 1.3MB
Zipped size: 1.2MB (.mp3 format)

Obvious question, is the quality of the mp3 sufficient for scientific analysis of the data? Are you aware of best practices guides out of the ornithology world?

Posted by kkwaiser at 12:07 PM | Comments (0) | TrackBack

June 06, 2011

Summer Activities?

Time to think aloud. One of my goals for the summer is to engage students and researchers more. Get them thinking data and metadata early on. Here are a few ideas:

Data Management Presentation - Introduce basics of data mgmt. Potential targets = Courses with research project foci, Frontiers, REU, grad students
Data Entry overview - Introduce the Research Gateway to groups (Frontiers, REU) likely to contribute data.
GIS for mapmakers - Basic intro to ArcGIS with focus on local datasets and creating map layouts to describe research locale
GPS Scavenger Hunt - pre-allocate lats/longs for multiple groups and have them terminate at the same location. First group to the final location wins...something. Real prizes of somewhat-real value.

Mandatory IM Meeting - Discuss accomplishments, needed improvements, future goals, data collection, data policy

EML Module - deploy on Research Gateway

NSF DMP Language - draft for M. Hunter and others.

Posted by kkwaiser at 11:46 AM | Comments (0) | TrackBack

May 03, 2011

Data Turbine Notes

Background research:

Interesting video Automated installation and operation of sensors in an IP network - "Just plug it in and it works." HA.

Mark Mail Archive of mailing list

Deploying DT Software - pretty sparse if you ask me.

Posted by kkwaiser at 08:55 AM | Comments (0) | TrackBack

April 05, 2011

Dryad and DSpace

It seems everyone has their own data management solution these days. Dryad is an example of a larger system designed to serve as data repository for data from peer-reviewed articles. It is built on an open source platform called DSpace. Sure would be nice if they had a sandbox available.





Sure, there's the DSpace demo site.

See below the fold for more of my IRC chat.

< kbk1> Hi there. Is DSpace completely custom code or is it built on top of existing code? I just watched a preview video and the theme of the demo site recalled wordpress.
< kbk1> From what I am seeing so far, it looks as if it is built from the ground-up.
< tdonohue> Hi kbk1. DSpace is completely custom code. It was initially built by MIT and Hewlett Packard back in 2002, and since then was open sourced and is community maintained code.
< td> (so, it actually pre-dates WordPress, by about a year, I believe)
< kbk1> I don't suppose there is a sandbox available? I've worked with a few field stations to build an information management system on Drupal and am curious about what other solutions look like.
< td> There is a sandbox/demo site. It's at http://demo.dspace.org
< kompewter> [ DSpace 1.7.0 Demonstration Repository ] - http://demo.dspace.org
< kbk1> Awesome!
< td> From there, you'd want to visit *either* the XMLUI (XML-based UI) or JSPUI (JSP-based UI). Those are the two offered UIs for DSpace.
< td> If you visit one of those UIs, there are actual sample logins provided on the homepage (e.g. read the intro text of the XMLUI, which provides you with sample logins to the demo system: http://demo.dspace.org/xmlui )
< kompewter> [ Community List ] - http://demo.dspace.org/xmlui
< kbk1> I will take a look. I found out about DSpace while reading a paper which mentioned Dryad - which is built on DSpace. Dryad claims the ability to assign DOI's to datasets. Is this functionality within DSpace?
< td> kbk1: DSpace does not specifically assign DOIs by default. But, it will assign Handles (http://handle.net/), which are a part of the DOI system -- see: http://en.wikipedia.org/wiki/Digital_object_identifier
< kompewter> [ Digital object identifier - Wikipedia, the free encyclopedia ] - http://en.wikipedia.org/wiki/Digital_object_identifier
< td> PeterDietz -- That's fine. A GSoC project need not cover all UIs, to be honest. It could be specific to one UI. As of 1.7, we no longer have complete "UI parity", so it doesn't matter if we scoped around XMLUI (as long as the project didn't do anything that would "break" another UI)
< td> PeterDietz -- that being said, if we still had concerns about the project, or no interested mentors, we could 'pull' the project and suggest the student look at other GSoC projects.
< td> kbk1: also if you have specific questions about the Dryad project, one of it's repository programmers, ks, is currently "lurking" in this chat channel. So you might want to ask him, if he's got time.
< ksclarke> yep, I'm around
< ks> DOI assignment takes place through a module Dryad has developed and we're working to make it work within the identity services module atmire is developing (maybe moving our code into that, eventually, if it sees uptake in the dspace community)
< ks> our DOIs have a particular form (meaning embedded in them, unfortunately (imo), related to our modeling of data packages and data files) so are not completely generic like a regular dspace module should be
< kbk1> Thanks ks. If I could add one feature to our IMS it would be DOI assignment. I think it would really encourage researcher buy-in.
< kbk1> I like the workflow for item submission on the demo site.
< ks> I did notice recently that the ezid service (that CDL provides and that we use to register) allows minting now in addition to registering... it might not be much work to build something over that for data-centric DOI registration
< ks> yes, we chose DOIs over other schemes because we though the buy-in would be more significant for that reason
< ks> we thought
< kbk1> Did you think correctly?
< ks> we are seeing uptake in our submissions; we're also working with journals though who are now requiring deposition in a data repository like dryad -- so we can't (I don't think) tease apart what's contributing to our growth
< ks> and I'm not front lines (interacting with folks) so I don't have much ancedotal evidence
< ks> I could ask our curator though so see if she's had feedback about the assignment of DOIs
< ks> I know our workflow passes the DOI back to the submittor so that it can be included in their article
< ks> so we're definitely presenting it as a selling point ("here is your DOI for your data package so people can reference you")
< kbk1> Right, the paper by Vision (BioScience 2010) was the only one I've found thus far that explicitly advocated a DOI for datasets. I work with a lot graduate students and would like to tell them to add Contributed Datasets to their CV's.
< ks> Yeah, that would be great! My believe is having a DOI will do more to encourage that... I'll ask our curator
< ks> my belief
< ks> arg, can't type today
< kbk1> But I can't tell them to do that if it is not a more widely applied practice.
< ks> Yes, I understand... we're definitely hoping to encourage people in that direction but it's not a widely applied practice yet
< kbk1> Either way, if we do move in this direction it may be helpful to talk with someone who has been there. Is IRC your preferred forum?
< ks> IRC works for me but I'll only be with the project for about another month (moving on to something else); you could join the Dryad mailing list... it's low volume and questions, etc. are welcome there: https://lists.nescent.org/mailman/listinfo/wg-digitaldata
< kompewter> [ Wg-digitaldata Info Page ] - https://lists.nescent.org/mailman/listinfo/wg-digitaldata
< ks> I believe that's the open list
< kbk1> Thanks. I just joined. If you are interested, here is the Drupal-based IMS I have built: http://umbs.lsa.umich.edu/research/
< kompewter> [ University of Michigan Biological Station ] - http://umbs.lsa.umich.edu/research/
< ks> there is also a dryad-dev list that's intended to be for dev-focused discussion: http://groups.google.com/group/dryad-dev?pli=1 (also low volume... though we're trying to use it more)
< kompewter> [ dryad-dev | Google Groups ] - http://groups.google.com/group/dryad-dev?pli=1
* ks goes to take a look
< ks> nice, my last place of work was moving towards drupal
< ks> this looks like a nice example of what can be done with it
< ks> btw, just had a colleague tell me the dryad-dev list is intended to be the open public list for discussions, etc.
< kbk1> So favor that over the nescent list?
< ks> yes, I guess so
< ks> I'm surprised you were able to join if the other is not intended to be open but perhaps they were hoping for security through obscurity
< ks> which I've now foiled
< kbk1> Oops. Let the spamming commence.
< kbk1> I'm off to lunch but I appreciate the chance to chat. Looks like you guys are up to good.

Posted by kkwaiser at 10:47 AM | Comments (0) | TrackBack

March 14, 2011

Dataset todo's

Nothing but a boring todo list to make sure things don't fall between the cracks.

Received but not processed:

Done but up as an xlsx file:

lindsfp and lindpell
dailyppt.xlsx - not loaded
Cheb Climate Records - xlsx format
Mackninaw Climate Records - xlsx format
PELLPPT.xlsx - xlsx format
PELLTEMP.xlsx - xlsx format
solar.xlsx - xlsx format
umbstemp.xlsx -
Secchi Depth Readings

Douglas Lake levels.xlsx -
umbsppt.xlsx -

Posted by kkwaiser at 04:33 PM | Comments (0)

March 06, 2011

Cloud based GIS server

Possible solution to our need for a GIS-based data management solution.

Original ESRI press release.

More on ESRI's relationship with Amazon.

Suggested read from SpatiallyAdjusted.

PostGIS on a Window 2008 Server installed on the cloud.
What about running Postgres on Amazon's EC2?

Email exchange on Amazon EC2 and geoservers.

Posted by kkwaiser at 01:29 PM | Comments (0)

March 04, 2011

Good Best Practices for data management resource

Best Practices for Preparing Environmental Data Sets to Share and Archive

Posted by kkwaiser at 03:59 PM | Comments (0)