September 30, 2011
DataOne Workshop - Investigator ToolKit
Investigator Toolkit Overview - Chris Jones
...getting tired. quality of notes falling rapidly...
irc.ecoinformatics.org
#dataone
DataONE drive - mount dataone network on desktop
Visualization Software
FragStats
GRASS/ESRI
EstimateS - http://viceroy.eeb.uconn.edu/estimates
Posted by kkwaiser at 07:05 PM | Comments (0) | TrackBack
DataOne Workshop - Member Nodes
Current institutions interested in membership
Posted by kkwaiser at 06:22 PM | Comments (0) | TrackBack
DataOne Workshop - Installation Instructions
Installation notes
------------------
data - example data set
d1_common_python - types and service methods
d1_libclient_python - library of utility methods for calling d1 common
d1_client_cli - build a command line parse
Prerequisites
=============
0. Ubuntu 10.04 stock with patches (installed from the Ubuntu Server CD, with OpenSSH Server selected at the Software Selection screen)
1. Java
-- add deb "http://archive.canonical.com/ lucid partner" to
/etc/apt/sources.list
$ sudo aptitude update
$ sudo aptitude install sun-java6-jdk
2. Install certificates
-- Copy certificate files to /etc/ssl/certs
$ sudo cp dataone*.crt /etc/ssl/certs
$ sudo cp test_dataone_org.crt /etc/ssl/certs
$ sudo cp cilogon-*pem /etc/ssl/certs
$ sudo c_rehash /etc/ssl/certs
-- Copy private key to /etc/ssl/private
$ sudo cp test_dataone_org.nopassword.key /etc/ssl/private
-- Add certs to Java keystore
$ cd /usr/lib/jvm/java-6-sun/jre/lib/security
$ sudo keytool -import -alias DataOneCA -keystore ./cacerts -file
/etc/ssl/certs/dataone-ca.crt
$ sudo keytool -import -alias DataOneTestCA -keystore ./cacerts -file
/etc/ssl/certs/dataone-test-ca.crt
$ sudo keytool -import -alias CILogonSilver -keystore ./cacerts -file
/etc/ssl/certs/cilogon-silver.pem
$ sudo keytool -import -alias CILogonBasic -keystore ./cacerts -file
/etc/ssl/certs/cilogon-basic.pem
$ sudo keytool -import -alias CILogonOpenID -keystore ./cacerts -file
/etc/ssl/certs/cilogon-openid.pem
3. Tomcat 6
$ sudo aptitude install tomcat6
-- Edit /etc/tomcat6/server.xml to enable the AJP connector on 8009
$ sudo /etc/init.d/tomcat6 restart
4. Apache
$ sudo aptitude install apache2 libapache2-mod-jk
Modify metacat workers.properties to point at Java and Tomcat, then:
$ sudo cp -i debian/jk.conf /etc/apache2/mods-available/
$ sudo cp -i debian/workers.properties /etc/apache2/
$ sudo a2dismod jk
$ sudo a2enmod jk
$ sudo a2enmod rewrite
$ a2enmod ssl
$ sudo cp -i debian/knb-ssl /etc/apache2/sites-available/
$ sudo a2dissite 000-default
-- Modify knb and knb-ssl to fit the local host
$ sudo a2ensite knb
$ sudo a2ensite knb-ssl
$ sudo /etc/init.d/apache2 restart
5. Subversion
$ sudo apache2 libapache2-mod-jk
6. Set up user account
$ sudo adduser demo
7. Install ant
$ sudo apt-get install --no-install-recommends ant
8. Install maven2
$ sudo aptitude install maven2
9. Postgres
$ sudo aptitude install postgresql
Add "host metacat metacat 127.0.0.1/32 password" to pg_hba.conf
10. Create LDAP account
Via KNB web site, username = d1demo
11. Curl
$ sudo aptitude install curl
12. Python libraries
$ sudo aptitude install python-setuptools
$ sudo aptitude install python-dateutil
$ sudo aptitude install python-lxml
$ sudo easy_install PyXB
$ sudo easy_install minixsv
$ sudo aptitude install python-argparse python-argparse-doc
-- Also install the DataONE Python client libraries
$ cd d1_common_python
$ sudo python setup.py develop
$ cd ../d1_libclient_python
$ sudo python setup.py develop
$ echo "alias d1=~/d1_client_cli/src/d1_client_cli/dataone.py" >> ~/.bashrc
13. R system
$ sudo aptitude install r-base-core
$ sudo R CMD javareconf
$ R
> install.packages("rJava")
> q()
$
Metacat install
----------------
0. Set up postgres
$ sudo -s
# su - postgres ##switch to postgres user
$ createdb metacat ##empty postgres database
$ psql metacat ##login to metacat database
## create db user
> CREATE USER metacat WITH UNENCRYPTED PASSWORD 'metacat';
> \q
$ exit
# /etc/init.d/postgresql-8.4 restart
# exit
1. Create metacat storage directory
$ sudo mkdir -p /var/metacat/
$ sudo chown -R tomcat6 /var/metacat ##recursively give permissions to director for tomcat user
2. Servlet installation
$ cd metacat-1.10.0-snapshot10 ##dev snapshot of metacat
$ sudo cp knb.war /var/lib/tomcat6/webapps/ ##java web files to web server
$ sudo cp geoserver.war /var/lib/tomcat6/webapps/
$ sudo /etc/init.d/tomcat6 restart
3. Configure metacat
-- Open Metacat site in browser
-- https://demoX.test.dataone.org/knb/
-- admin is: uid=d1demo,o=unaffiliated,dc=ecoinformatics,dc=org
- "Metacat Administrator"
## To create own admin - set up own LDAP server or use KNB
-- Global properties
-- Set Database user/pw to metacat/metacat
-- set Context to knb
-- DataONE section
Node name: Demonstration Node 1
Node ID: DEMOX -- for example, 'DEMO1', ##unique and persistent identifier
Node Subject: CN=DEMOX, DC=dataone, DC=org
-- Note that this automatically registers as a MN
## Node account representing the MN during authentication actions
## True = CNs should approach MNs
-- Restart tomcat
$ sudo /etc/init.d/tomcat6 restart
####################################3
####################################
####################################
Run a few DataONE Web services
#refer to http://mule1.dataone.org/ArchitectureDocs-current/ for API
------------------------------
https://demoX.test.dataone.org/knb/d1/mn/v1/node ##our MN
https://cn-dev.dataone.org/cn/v1/node ##all nodes
https://demoX.test.dataone.org/knb/d1/mn/v1/object
Logon to CILogon
----------------
1. Visit: https://cilogon.org/?skin=DataONE
2. Choose your provider and log in
-- Likely: LTER, Google, or Protect Network
2a. If you don't have an account, create one
-- Either Google or Protect network
3. Note the name of the certificate file downloaded to your machine
4. Note: there is a preinstalled certificate on your demo machine
Insert data files and metadata files
------------------------------------
First set some defaults for client operation:
$ cd ~
$ d1 \
--mn-url https://demo2.test.dataone.org/knb/d1/mn/v1 \
--cn-url https://cn-dev.dataone.org/cn/v1 \
--dataone-url https://cn-dev.dataone.org/cn/v1 \
--sysmeta-submitter "CN=DEMO2,DC=dataone,DC=org" \
--sysmeta-rightsholder "CN=DEMO2,DC=dataone,DC=org" \
--sysmeta-origin-member-node DEMO2 \
--sysmeta-authoritative-member-node DEMO2 \
--sysmeta-access-policy-public \
--cert-path /etc/dataone/client/certs/DEMO2.pem \
--key-path /etc/dataone/client/certs/DEMO2.pem \
--fields "pid,origin_mn,datemodified,size,objectformat,title" \
--query "*:*" \
--store-config
Now add one data object:
$ d1 \
--sysmeta-object-format text/csv \
--sysmeta-access-policy-public \
create foo.1.1 data/data-sites.csv
List objects on the node now
----------------------------
https://demoX.test.dataone.org/knb/d1/mn/v1/object
$ d1 --mn-url https://demoX.test.dataone.org/knb/d1/mn/v1 list
View system metadata
--------------------
https://demoX.test.dataone.org/knb/d1/mn/v1/meta/foo.1.1
$ d1 \
--dataone-url https://demoX.test.dataone.org/knb/d1/mn/v1 \
meta foo.1.1
Get the object from the MN
--------------------------
https://demoX.test.dataone.org/knb/d1/mn/v1/object/foo.1.1
$ d1 \
--dataone-url https://demoX.test.dataone.org/knb/d1/mn/v1 \
get foo.1.1
Insert two more objects -- data and EML
---------------------------------------
$ d1 \
--sysmeta-object-format text/csv \
--sysmeta-access-policy-public \
create foo.2.1 data/data-samples.csv
$ d1 \
--sysmeta-object-format eml://ecoinformatics.org/eml-2.0.1 \
--sysmeta-access-policy-public \
create foo.3.1 data/eml-metadata.xml
Test that the node passes tests
--------------------------------
1. Visit: http://mncheck.test.dataone.org:8080/MNWebTester
2. Enter MN Base URL: https://demoX.test.dataone.org/knb/d1/mn
List the objects on the node
----------------------------
https://demoX.test.dataone.org/knb/d1/mn/v1/object
$ d1 --mn-url https://demoX.test.dataone.org/knb/d1/mn/v1 list
Show synchronization has happened
---------------------------------
https://cn-dev.dataone.org/cn/v1/object
https://cn-dev.dataone.org/cn/v1/resolve/foo.3.1
https://cn-dev.dataone.org/cn/v1/meta/foo.3.1
Search for data locally on Metacat
----------------------------------
Visit: https://demoX.test.dataone.org/knb/
Search for: %
Search for data on the D1 Index
-------------------------------
Using the CLI:
$ d1 --query "origin_mn:DEMOX" search
List fields available for searching:
## fields indexed out of any metadata standard
$ d1 fields
A couple more searches:
$ d1 --query "barnacle" search
$ d1 --query "origin_mn:DEMOX AND objectformat:text/csv" search
Through a web interface:
http://cn-dev.dataone.org/solr/search.html
Through the Mercury search interface (operating on older CN deployment):
http://cn.dataone.org/mercury3/
Posted by kkwaiser at 02:33 PM | Comments (0) | TrackBack
DataOne Workshop - Morning
Technical considerations to make entry smoother
Resource dedication on UMBS behalf
Policies and practices
Advantages for researchers and UMBS
DataONE - overview - Amber Budden
- Remove inefficiencies related to data handling for researchers
Three components
Member Nodes, Coordinating Nodes, Investigator ToolKit
1) Cyberinfrastructure
- Member nodes that implement DataONE's software stack
- Coordinating nodes - catalog/indexer of member node content
- Investigator toolkit - end-user tools to work with data
Data wouldn't be submitted until after first use so efficiency gain is minimal for data originator
Tools planned for all stages of the data life cycle
Feedback mechanisms
- Scientist survey published in PLoS one
Progress to date
- Draft of architecture document
- Past prototype stage
- UCSB, U New Mexico, Oak Ridge - Coordinating Nodes
- Member Nodes - Dryad, KNB, Oak Ridge (aim to double over years)
- Investigator Toolkit - morpho
DataOne Users Group - stakeholder inclusion (meetings co-located with esip)
Infrastructure Overview - Dave Vieglais
Data Model
- 3 granules of management
1) Data object (e.g., file)
2) Metadata
3) Resource Map - OAI initiative expressed in RDF, (binds 1 and 2)
System metadata attached to each, from Member Node (e.g., file size, access rules)
- used for system checks
Data Package - objects 1-3 together, possibly hierarchical arrangement of data packages
System Metadata (some properties set by MNs (member nodes))
Identifier - -- MNs need to provide unique identifiers for each granule
fmtid - file description
size - file size
submitter - institution
Once data and metadta are submitted they cannot be changed, can be deprecated (but retained)
System metadata (Access rules) are alterable
Functionality
Identifying Objects - fairly unrestricted, assigned by MNs
Identifying People - client side certificates for authentication
- CILogon - select own identity provider (e.g., home institution, google)
- Access defined by MNs (i.e., who can access which content)
- Data objects replicated among MNs, directed by CNs
- MN replication action is varied, resource requirements will vary
Content Discovery
- Through CNs
Coordinating Nodes
- Object tracking/replication mgmt
- Java J2EE web services on Tomcat
- Metacat and Mercury, Hazlecast
Member Node Implementation
Process options
1. Implement APIs in implementation (e.g., in metaCat)
2. Deploy gateway service
3. Deploy independent member node (synchronize with own repository)
Member Node Tiers
1. Read only, public content
2. Read only, access control
3. Read/Write use APIs
4. Operate as a replication target
Authentication and Authorization
3 identity types
Inidividual Subject
Group Subject
Special Subject - Public use, authenticated user, verified user
Users register with dataOne, CILogon identity registered on first use
CN - expose API for identity jobs
CILogon -> gives authentication certificate for access to data objects
InCommon, Protect Network, OpenID
SSL used for communication
Client -> Authenticate CILogon -> CILogon communicates to CN -> certificate to client -> Client request to MN (with certificate) -> MN fulfills request
Authentication requires release of real name and email
Access policies set by contributors
Details about pieces - Operation docs
Source code repository
Posted by kkwaiser at 11:32 AM | Comments (0) | TrackBack
September 29, 2011
EIM 2011 - Sensors and Workflows Demonstration
Automating Data Processing and Quality Control using Workflow Software: Converting Sensor Data to Usable Environmental Information - W. Sheldon and J. Porter
Kepler
- Use of Kepler to split a R data-processing script into pieces. Advantages are graphical display of workflow and obvious specification of input parameters
- Actors as canned functions for internal processing or connecting to external software programs for processing.
GCE Data Toolbox
- Suite of custom scripts/gui's generated for data management
- Metadata and provenance are kept with data during processing
- MatLab-based
- Create and import metadata templates
Aside: with the potential addition of research scientists to the UMBS personnel we should explore the possibility of centralizing data management activities for their research (i.e., adopt an LTER-esque model.) Business-as-usual will see each scientist developing their own data management and QC routines which will make final deployment of metadata/data more time consuming for all parties.
Real-time event detection with Data Turbine
ESPER - Event Stream and Complex Event Processing for Java
RDV - Realtime Data Viewer
Senors into DT which standardizes streams -> ESPER for event identification -> DT -> Applications
Discussion
Training
- Inter-LTER training planned at GCE
- No kepler trainings planned
- Workshop at end of October (LTER) for sensor streaming
Posted by kkwaiser at 11:37 AM | Comments (0) | TrackBack
September 28, 2011
EIM 2011 - Keynote
Building Communities, Partnerships, Tools, and Services in Order to Thrive in a Dynamic Information Landscape - Patricia Cruse
UC Curation Center - partnership of UC libraries and Ca. Digital Library
Technical Approach
- no more monolithic systems
- Store data with SDSC (San Diego SuperComputer Center)
- Micro-Services - flexible, small and simple
Services
DMP Tool - Launch Sep 29th, 2011
- Connect researchers with resources
- streamline DMP creation
- Code *will* be opensource
- Directorate specific guidance (how often update this)
- Wizard or template
- 2 parts: Intellectual content and actual code will be released (didn't say when or what type of license, pretty much dodged the question)
EZID - long-term identifiers made easy
- create and manage persistent identifiers
- Credit data originators
- Link pubs to data
- Service available for purchase
Merritt (~DeepBlue)
- Data management system
- Modes of use:
- 1) Dark: preservation without access
- 2) Bright: preservation and end-user access
- 3) Back-end preservation - replication
- 4) Distributed data grid
WAS - Web Archiving Service
- Archiving websites and event
Digital Curation for Excel
- Create Add-on
- Summer 2012 first version
- Versioning, consistent headers
Posted by kkwaiser at 02:05 PM | Comments (0) | TrackBack
EIM 2011 - Sensors and Workflows
Sensors and Workflows Session
Lifemapper, VisTrails and EML: Documented, Re-executable Species Distribution Models - CJ Grady et al
- archive of species distribution models
- opensource
- REST and OGC (web mapping) services.
- Use IPCC and GBIF data for inputs
- Process metadata -
- Use EML for describing models
- Two papers:
Repeatability and transparency in ecological research
Analytic webs support the synthesis of ecological data sets
Sensor lifecycle management using scientific workflows - Barseghian et al.
- Kepler software - schedule workflows, sensor data analysis
- Opensource, Java
- versioning and documentation tool
- 3 components
- Field (SPAN, Data logger), Server (Data Turbine, Kepler, MetaCat), Desktop (Kepler)
- Sensor configuration from desktop to logger/sensor
This was linked at the end but I'm not sure why:
- REAP (Realtime Environment for Analytical Processing) "is an NSF-funded cyberinfrastructure development project, focused on creating technology in which scientific workflows tools can be used to access, monitor, analyze and present information"
Archiving Sensor Data - Applied to Dam Safety Information - Barateiro et al.
SHAMAN project - information management system (workflow system)
Obligatory random website reference: TIMBUS - manage acquisition dependencies
Acronym checks
- OAIS reference model
- Business Process Execution Language (BPEL)
Coral sensor network at Racha Island, Thailand - Jaroensutasinee et al.
- Impetus was largescale bleaching in 2010
- CREON project
- Met Station, Conductivity, Under water camera -> Data Turbine
- Historical NOAA data for the area show water temperature surface temp showing extreme peaks
- No bleaching events post-monitoring
- Acronym check:
AIMS (Australian Institute for Marine Sciences)
Moving from Custom Scripts with Extensive Instructions to a Workflow System: Use of the Kepler Workflow Engine in Environmental Information Management - Gries and Porter
- Use of workflow system for basic information management application (i.e., data manipulation, QA/QC)
1. Scripting language (e.g., python, php)
2. Programs with scripting (.e.g, R)
3. Workflow systems: Kepler, Taverna, Triana, VisTrails, Pegasus, BPEL
Chose Kepler because of packaged code/functions, documentation
Workflow at NTL -
Data file -> Format conversions and QA/QC -> parse into database
Sensor Data -> Data Turbine -> QA/QC ->
- Efficiently link scripts from different sources
- Large learning curve, very large data sets
Provenance and Quality Control in Sensor Networks - Lerner et al
Problem - sensor data has time gaps so use modeled values, need to identify modeled data
Use Little JIL for provenance tracking
- Exception handling is strong point
- Stage vs workflow oriented - either better explanation or more intuitive
- Data Derivation Graph (DDG) - articulates different QC procedures embarked upon based on original data (i.e., in range, out of range, NA)
Stream gage case study
- 15 min averages of stream characteristics, taken off data logger manually
- Move towards wireless, telemetry, automated QA/QC
Posted by kkwaiser at 11:26 AM | Comments (0) | TrackBack
April 19, 2011
Potential DEIMS conference paper ideas
A few of us DEIMS types are discussing putting together a paper for an upcoming IM Conference. It's looking like time is a major constraint but I sent out a few ideas anyways. Here they are:
Beyond data management:
- Performance benchmarking - DEIMS performance before and after performance tuning with freely available tools (i.e., http://drupal.org/project/boost)
- Usability benchmarking - DEIMS usability for anonymous users, data contributors and admins before and after tuning with freely available tools (i.e., http://drupal.org/project/modalframe)
- Administrative tasking - extending an information management system to include tools for everyday administrative operations. A path toward a better-resourced, unified and sustainable system?
- Geo CMS - tools, challenges and opportunities for building an opensource GIS data management system within the DEIMS framework.
- The same but different - a survey of Drupal related data management activities around the globe with an identification of potential synergies. LTER, OBFS, EDIT, USGS, NCEAS. This is only possible when you start from a general purpose, opensoure software package. Custom and/or proprietary code need not apply. Bam.
Posted by kkwaiser at 11:08 AM | Comments (0) | TrackBack
March 14, 2011
Potential OBFS Birds of a Feather session
Title: Identifying challenges, solutions and collaborative opportunities for information management at OBFS sites
Information management at OBFS sites is a challenging endeavor for many reasons. This session will be an open-forum discussion of how information managers at OBFS sites are responding to these challenges with an eye toward sharing solutions and identifying future collaborative channels. Potential topics:
- Funding (or lack thereof) for information management
- Avenues for collaboration and skill-sharing among IMs
- Data diversity and data management challenges
- Facilitating researcher involvement in data management activities
- Establishing and enforcing Data Management and Data Access policies
- Building and deploying information management systems
- Leveraging external data management resources
- Meeting new NSF Data Management requirements
Posted by kkwaiser at 02:21 PM | Comments (0)