« August 2011 | Main | October 2011 »

September 30, 2011

DataOne Workshop - Investigator ToolKit

Investigator Toolkit Overview - Chris Jones

...getting tired. quality of notes falling rapidly...

irc.ecoinformatics.org
#dataone

DataONE drive - mount dataone network on desktop


Visualization Software

FragStats
GRASS/ESRI
EstimateS - http://viceroy.eeb.uconn.edu/estimates

Posted by kkwaiser at 07:05 PM | Comments (0) | TrackBack

DataOne Workshop - Member Nodes

Member Node Information

Current institutions interested in membership

Posted by kkwaiser at 06:22 PM | Comments (0) | TrackBack

DataOne Workshop - Installation Instructions

Installation notes
------------------

data - example data set
d1_common_python - types and service methods
d1_libclient_python - library of utility methods for calling d1 common
d1_client_cli - build a command line parse


Prerequisites
=============

0. Ubuntu 10.04 stock with patches (installed from the Ubuntu Server CD, with OpenSSH Server selected at the Software Selection screen)
1. Java
-- add deb "http://archive.canonical.com/ lucid partner" to
/etc/apt/sources.list
$ sudo aptitude update
$ sudo aptitude install sun-java6-jdk

2. Install certificates
-- Copy certificate files to /etc/ssl/certs
$ sudo cp dataone*.crt /etc/ssl/certs
$ sudo cp test_dataone_org.crt /etc/ssl/certs
$ sudo cp cilogon-*pem /etc/ssl/certs
$ sudo c_rehash /etc/ssl/certs
-- Copy private key to /etc/ssl/private
$ sudo cp test_dataone_org.nopassword.key /etc/ssl/private
-- Add certs to Java keystore
$ cd /usr/lib/jvm/java-6-sun/jre/lib/security
$ sudo keytool -import -alias DataOneCA -keystore ./cacerts -file
/etc/ssl/certs/dataone-ca.crt
$ sudo keytool -import -alias DataOneTestCA -keystore ./cacerts -file
/etc/ssl/certs/dataone-test-ca.crt
$ sudo keytool -import -alias CILogonSilver -keystore ./cacerts -file
/etc/ssl/certs/cilogon-silver.pem
$ sudo keytool -import -alias CILogonBasic -keystore ./cacerts -file
/etc/ssl/certs/cilogon-basic.pem
$ sudo keytool -import -alias CILogonOpenID -keystore ./cacerts -file
/etc/ssl/certs/cilogon-openid.pem

3. Tomcat 6
$ sudo aptitude install tomcat6
-- Edit /etc/tomcat6/server.xml to enable the AJP connector on 8009
$ sudo /etc/init.d/tomcat6 restart

4. Apache
$ sudo aptitude install apache2 libapache2-mod-jk
Modify metacat workers.properties to point at Java and Tomcat, then:
$ sudo cp -i debian/jk.conf /etc/apache2/mods-available/
$ sudo cp -i debian/workers.properties /etc/apache2/
$ sudo a2dismod jk
$ sudo a2enmod jk
$ sudo a2enmod rewrite
$ a2enmod ssl
$ sudo cp -i debian/knb-ssl /etc/apache2/sites-available/
$ sudo a2dissite 000-default
-- Modify knb and knb-ssl to fit the local host
$ sudo a2ensite knb
$ sudo a2ensite knb-ssl
$ sudo /etc/init.d/apache2 restart

5. Subversion
$ sudo apache2 libapache2-mod-jk

6. Set up user account
$ sudo adduser demo

7. Install ant
$ sudo apt-get install --no-install-recommends ant

8. Install maven2
$ sudo aptitude install maven2

9. Postgres
$ sudo aptitude install postgresql
Add "host metacat metacat 127.0.0.1/32 password" to pg_hba.conf

10. Create LDAP account
Via KNB web site, username = d1demo

11. Curl
$ sudo aptitude install curl

12. Python libraries
$ sudo aptitude install python-setuptools
$ sudo aptitude install python-dateutil
$ sudo aptitude install python-lxml
$ sudo easy_install PyXB
$ sudo easy_install minixsv
$ sudo aptitude install python-argparse python-argparse-doc
-- Also install the DataONE Python client libraries
$ cd d1_common_python
$ sudo python setup.py develop
$ cd ../d1_libclient_python
$ sudo python setup.py develop
$ echo "alias d1=~/d1_client_cli/src/d1_client_cli/dataone.py" >> ~/.bashrc

13. R system
$ sudo aptitude install r-base-core
$ sudo R CMD javareconf
$ R
> install.packages("rJava")
> q()
$


Metacat install
----------------
0. Set up postgres
$ sudo -s
# su - postgres ##switch to postgres user
$ createdb metacat ##empty postgres database
$ psql metacat ##login to metacat database
## create db user
> CREATE USER metacat WITH UNENCRYPTED PASSWORD 'metacat';
> \q
$ exit
# /etc/init.d/postgresql-8.4 restart
# exit

1. Create metacat storage directory
$ sudo mkdir -p /var/metacat/
$ sudo chown -R tomcat6 /var/metacat ##recursively give permissions to director for tomcat user

2. Servlet installation
$ cd metacat-1.10.0-snapshot10 ##dev snapshot of metacat
$ sudo cp knb.war /var/lib/tomcat6/webapps/ ##java web files to web server
$ sudo cp geoserver.war /var/lib/tomcat6/webapps/
$ sudo /etc/init.d/tomcat6 restart

3. Configure metacat

-- Open Metacat site in browser
-- https://demoX.test.dataone.org/knb/
-- admin is: uid=d1demo,o=unaffiliated,dc=ecoinformatics,dc=org
- "Metacat Administrator"
## To create own admin - set up own LDAP server or use KNB
-- Global properties
-- Set Database user/pw to metacat/metacat
-- set Context to knb
-- DataONE section
Node name: Demonstration Node 1
Node ID: DEMOX -- for example, 'DEMO1', ##unique and persistent identifier
Node Subject: CN=DEMOX, DC=dataone, DC=org
-- Note that this automatically registers as a MN
## Node account representing the MN during authentication actions
## True = CNs should approach MNs

-- Restart tomcat
$ sudo /etc/init.d/tomcat6 restart

####################################3
####################################
####################################

Run a few DataONE Web services

#refer to http://mule1.dataone.org/ArchitectureDocs-current/ for API
------------------------------
https://demoX.test.dataone.org/knb/d1/mn/v1/node ##our MN
https://cn-dev.dataone.org/cn/v1/node ##all nodes
https://demoX.test.dataone.org/knb/d1/mn/v1/object

Logon to CILogon
----------------
1. Visit: https://cilogon.org/?skin=DataONE
2. Choose your provider and log in
-- Likely: LTER, Google, or Protect Network
2a. If you don't have an account, create one
-- Either Google or Protect network
3. Note the name of the certificate file downloaded to your machine
4. Note: there is a preinstalled certificate on your demo machine

Insert data files and metadata files
------------------------------------
First set some defaults for client operation:

$ cd ~
$ d1 \
--mn-url https://demo2.test.dataone.org/knb/d1/mn/v1 \
--cn-url https://cn-dev.dataone.org/cn/v1 \
--dataone-url https://cn-dev.dataone.org/cn/v1 \
--sysmeta-submitter "CN=DEMO2,DC=dataone,DC=org" \
--sysmeta-rightsholder "CN=DEMO2,DC=dataone,DC=org" \
--sysmeta-origin-member-node DEMO2 \
--sysmeta-authoritative-member-node DEMO2 \
--sysmeta-access-policy-public \
--cert-path /etc/dataone/client/certs/DEMO2.pem \
--key-path /etc/dataone/client/certs/DEMO2.pem \
--fields "pid,origin_mn,datemodified,size,objectformat,title" \
--query "*:*" \
--store-config


Now add one data object:

$ d1 \
--sysmeta-object-format text/csv \
--sysmeta-access-policy-public \
create foo.1.1 data/data-sites.csv

List objects on the node now
----------------------------
https://demoX.test.dataone.org/knb/d1/mn/v1/object
$ d1 --mn-url https://demoX.test.dataone.org/knb/d1/mn/v1 list


View system metadata
--------------------
https://demoX.test.dataone.org/knb/d1/mn/v1/meta/foo.1.1
$ d1 \
--dataone-url https://demoX.test.dataone.org/knb/d1/mn/v1 \
meta foo.1.1

Get the object from the MN
--------------------------
https://demoX.test.dataone.org/knb/d1/mn/v1/object/foo.1.1
$ d1 \
--dataone-url https://demoX.test.dataone.org/knb/d1/mn/v1 \
get foo.1.1


Insert two more objects -- data and EML
---------------------------------------
$ d1 \
--sysmeta-object-format text/csv \
--sysmeta-access-policy-public \
create foo.2.1 data/data-samples.csv

$ d1 \
--sysmeta-object-format eml://ecoinformatics.org/eml-2.0.1 \
--sysmeta-access-policy-public \
create foo.3.1 data/eml-metadata.xml

Test that the node passes tests
--------------------------------
1. Visit: http://mncheck.test.dataone.org:8080/MNWebTester
2. Enter MN Base URL: https://demoX.test.dataone.org/knb/d1/mn

List the objects on the node
----------------------------
https://demoX.test.dataone.org/knb/d1/mn/v1/object
$ d1 --mn-url https://demoX.test.dataone.org/knb/d1/mn/v1 list

Show synchronization has happened
---------------------------------
https://cn-dev.dataone.org/cn/v1/object
https://cn-dev.dataone.org/cn/v1/resolve/foo.3.1
https://cn-dev.dataone.org/cn/v1/meta/foo.3.1


Search for data locally on Metacat
----------------------------------
Visit: https://demoX.test.dataone.org/knb/
Search for: %

Search for data on the D1 Index
-------------------------------
Using the CLI:

$ d1 --query "origin_mn:DEMOX" search

List fields available for searching:
## fields indexed out of any metadata standard
$ d1 fields

A couple more searches:

$ d1 --query "barnacle" search

$ d1 --query "origin_mn:DEMOX AND objectformat:text/csv" search

Through a web interface:
http://cn-dev.dataone.org/solr/search.html

Through the Mercury search interface (operating on older CN deployment):
http://cn.dataone.org/mercury3/

Posted by kkwaiser at 02:33 PM | Comments (0) | TrackBack

DataOne Workshop - Morning

Technical considerations to make entry smoother
Resource dedication on UMBS behalf
Policies and practices
Advantages for researchers and UMBS

DataONE - overview - Amber Budden

- Remove inefficiencies related to data handling for researchers

Three components
Member Nodes, Coordinating Nodes, Investigator ToolKit

1) Cyberinfrastructure
- Member nodes that implement DataONE's software stack
- Coordinating nodes - catalog/indexer of member node content
- Investigator toolkit - end-user tools to work with data

Data wouldn't be submitted until after first use so efficiency gain is minimal for data originator

Tools planned for all stages of the data life cycle

Feedback mechanisms
- Scientist survey published in PLoS one

Progress to date
- Draft of architecture document
- Past prototype stage
- UCSB, U New Mexico, Oak Ridge - Coordinating Nodes
- Member Nodes - Dryad, KNB, Oak Ridge (aim to double over years)
- Investigator Toolkit - morpho

DataOne Users Group - stakeholder inclusion (meetings co-located with esip)

Infrastructure Overview - Dave Vieglais

Data Model

- 3 granules of management
1) Data object (e.g., file)
2) Metadata
3) Resource Map - OAI initiative expressed in RDF, (binds 1 and 2)

System metadata attached to each, from Member Node (e.g., file size, access rules)
- used for system checks

Data Package - objects 1-3 together, possibly hierarchical arrangement of data packages

System Metadata (some properties set by MNs (member nodes))

Identifier - -- MNs need to provide unique identifiers for each granule
fmtid - file description
size - file size
submitter - institution

Once data and metadta are submitted they cannot be changed, can be deprecated (but retained)
System metadata (Access rules) are alterable

Functionality

Identifying Objects - fairly unrestricted, assigned by MNs
Identifying People - client side certificates for authentication
- CILogon - select own identity provider (e.g., home institution, google)
- Access defined by MNs (i.e., who can access which content)
- Data objects replicated among MNs, directed by CNs
- MN replication action is varied, resource requirements will vary

Content Discovery
- Through CNs

Coordinating Nodes
- Object tracking/replication mgmt
- Java J2EE web services on Tomcat
- Metacat and Mercury, Hazlecast

Member Node Implementation
Process options
1. Implement APIs in implementation (e.g., in metaCat)
2. Deploy gateway service
3. Deploy independent member node (synchronize with own repository)

Member Node Tiers
1. Read only, public content
2. Read only, access control
3. Read/Write use APIs
4. Operate as a replication target

Authentication and Authorization

3 identity types
Inidividual Subject
Group Subject
Special Subject - Public use, authenticated user, verified user

Users register with dataOne, CILogon identity registered on first use

CN - expose API for identity jobs
CILogon -> gives authentication certificate for access to data objects

InCommon, Protect Network, OpenID

SSL used for communication

Client -> Authenticate CILogon -> CILogon communicates to CN -> certificate to client -> Client request to MN (with certificate) -> MN fulfills request

Authentication requires release of real name and email

Access policies set by contributors

Details about pieces - Operation docs

ArchitectureDocs

Source code repository

Posted by kkwaiser at 11:32 AM | Comments (0) | TrackBack

September 29, 2011

EIM 2011 - EML Checking

EML Congruency Checker - O'Brien and Servilla

IM Working Group WebPage

Services and Libraries w/checker
Future directions

List of 35 features that a "checker" should evaluate and report on
- More to come

0.1 checks:
1. Data URL is valid (pass/fail)
2. Display data from the URL
3. Database table can be generated (from attribute list, pass/fail)
4. Load data to table with SQL (pass/fail)
5. Compare number of rows loaded to number specified in metadata

Future
- Enhancements to benefit non-LTER sites
- Configurability of checks/flags/actions

EML Parser through KNB

John Porter has a TFRI (??) and an R script for checking EML quality

EML Best Practices - Resource to check out for suggestions

Sven's script to check congruency of his EML docs against the EML Congruency Checker

TFRI EML validity checker

John Porter
EML Document to a statistical program: Web service

Create a Statistical Program from an Ecological Metadata Language (EML) document

Posted by kkwaiser at 04:33 PM | Comments (0) | TrackBack

EIM 2011 - Sensors and Workflows Demonstration

Automating Data Processing and Quality Control using Workflow Software: Converting Sensor Data to Usable Environmental Information - W. Sheldon and J. Porter

Kepler
- Use of Kepler to split a R data-processing script into pieces. Advantages are graphical display of workflow and obvious specification of input parameters

- Actors as canned functions for internal processing or connecting to external software programs for processing.

GCE Data Toolbox

- Suite of custom scripts/gui's generated for data management
- Metadata and provenance are kept with data during processing
- MatLab-based
- Create and import metadata templates

Aside: with the potential addition of research scientists to the UMBS personnel we should explore the possibility of centralizing data management activities for their research (i.e., adopt an LTER-esque model.) Business-as-usual will see each scientist developing their own data management and QC routines which will make final deployment of metadata/data more time consuming for all parties.

Real-time event detection with Data Turbine

ESPER - Event Stream and Complex Event Processing for Java

RDV - Realtime Data Viewer

Senors into DT which standardizes streams -> ESPER for event identification -> DT -> Applications

Discussion

Training
- Inter-LTER training planned at GCE
- No kepler trainings planned
- Workshop at end of October (LTER) for sensor streaming

Posted by kkwaiser at 11:37 AM | Comments (0) | TrackBack

September 28, 2011

EIM 2011 - Semantics and Data Management

CI-Server Framework:Cyber-Infrastructure Over the Semantic Web - Gandara et al.

- Built on Drupal
- Cyber-Share website

CI-Server Framework has three primary goals:

1. to enable information sharing by providing tools that scientists can use within their scientific research to process data, publish and share artifacts
2. to build community by providing tools that support building and viewing discussions between scientists about artifacts used or created through scientific processes
3. to leverage the knowledge collected within the artifacts and scientific collaborations to support scientific discoveries.

Using Semantic Metadata for Discovery and Integration of Heterogeneous Ecological Data - Leinfelder et al.

- term extension (synonyms)
- semantically annotated data packages

OBOE Data Model - used to describe observational data

A Semantically-Enabled Provenance-Aware Water Quality Portal - McGuinness et al

Harvest water quality data from USGS/EPA

- Virtuoso
- PML - Proof Markup Language

Geospatial Data Management for Ecological Research Organizations - Valentine, Skibbe and Hollingsworth

- Lot of people using file-based approach for management, geodatabases (PostGIS) are second.

- Trout Lake runs a PostGIS with GeoServer

GeoServer, OpenLayers

Simple GEO

Doesn't seem to be any viable, third party, spatial data management solutions available. Someone mentioned something about Corvallis

Posted by kkwaiser at 04:42 PM | Comments (0) | TrackBack

EIM 2011 - Keynote

Building Communities, Partnerships, Tools, and Services in Order to Thrive in a Dynamic Information Landscape - Patricia Cruse

UC Curation Center - partnership of UC libraries and Ca. Digital Library

Technical Approach
- no more monolithic systems
- Store data with SDSC (San Diego SuperComputer Center)
- Micro-Services - flexible, small and simple

Services

DMP Tool - Launch Sep 29th, 2011
- Connect researchers with resources
- streamline DMP creation
- Code *will* be opensource
- Directorate specific guidance (how often update this)
- Wizard or template

- 2 parts: Intellectual content and actual code will be released (didn't say when or what type of license, pretty much dodged the question)

EZID - long-term identifiers made easy
- create and manage persistent identifiers
- Credit data originators
- Link pubs to data
- Service available for purchase

Merritt (~DeepBlue)
- Data management system
- Modes of use:
- 1) Dark: preservation without access
- 2) Bright: preservation and end-user access
- 3) Back-end preservation - replication
- 4) Distributed data grid

WAS - Web Archiving Service
- Archiving websites and event

Digital Curation for Excel
- Create Add-on
- Summer 2012 first version
- Versioning, consistent headers

UC3 Webinars

Posted by kkwaiser at 02:05 PM | Comments (0) | TrackBack

EIM 2011 - Sensors and Workflows

Sensors and Workflows Session

Lifemapper, VisTrails and EML: Documented, Re-executable Species Distribution Models - CJ Grady et al
- archive of species distribution models
- opensource
- REST and OGC (web mapping) services.

- Use IPCC and GBIF data for inputs
- Process metadata -
- Use EML for describing models
- Two papers:

Repeatability and transparency in ecological research
Analytic webs support the synthesis of ecological data sets

Sensor lifecycle management using scientific workflows - Barseghian et al.

- Kepler software - schedule workflows, sensor data analysis
- Opensource, Java
- versioning and documentation tool

- 3 components
- Field (SPAN, Data logger), Server (Data Turbine, Kepler, MetaCat), Desktop (Kepler)
- Sensor configuration from desktop to logger/sensor

This was linked at the end but I'm not sure why:
- REAP (Realtime Environment for Analytical Processing) "is an NSF-funded cyberinfrastructure development project, focused on creating technology in which scientific workflows tools can be used to access, monitor, analyze and present information"

Archiving Sensor Data - Applied to Dam Safety Information - Barateiro et al.

SHAMAN project - information management system (workflow system)

Obligatory random website reference: TIMBUS - manage acquisition dependencies

Acronym checks
- OAIS reference model
- Business Process Execution Language (BPEL)

Coral sensor network at Racha Island, Thailand - Jaroensutasinee et al.

- Impetus was largescale bleaching in 2010
- CREON project
- Met Station, Conductivity, Under water camera -> Data Turbine
- Historical NOAA data for the area show water temperature surface temp showing extreme peaks
- No bleaching events post-monitoring

- Acronym check:
AIMS (Australian Institute for Marine Sciences)

Moving from Custom Scripts with Extensive Instructions to a Workflow System: Use of the Kepler Workflow Engine in Environmental Information Management - Gries and Porter

- Use of workflow system for basic information management application (i.e., data manipulation, QA/QC)

1. Scripting language (e.g., python, php)
2. Programs with scripting (.e.g, R)
3. Workflow systems: Kepler, Taverna, Triana, VisTrails, Pegasus, BPEL

Chose Kepler because of packaged code/functions, documentation
Workflow at NTL -
Data file -> Format conversions and QA/QC -> parse into database
Sensor Data -> Data Turbine -> QA/QC ->

- Efficiently link scripts from different sources
- Large learning curve, very large data sets

Provenance and Quality Control in Sensor Networks - Lerner et al

Problem - sensor data has time gaps so use modeled values, need to identify modeled data

Use Little JIL for provenance tracking
- Exception handling is strong point
- Stage vs workflow oriented - either better explanation or more intuitive
- Data Derivation Graph (DDG) - articulates different QC procedures embarked upon based on original data (i.e., in range, out of range, NA)

Stream gage case study
- 15 min averages of stream characteristics, taken off data logger manually
- Move towards wireless, telemetry, automated QA/QC

Posted by kkwaiser at 11:26 AM | Comments (0) | TrackBack

September 23, 2011

Logging and communicating file downloads

I had considered a feature freeze for the Research Gateway to keep the upgrade path to D7 as clear as possible but this may not abide.

Here is one I am thinking to add:

When a datafile is downloaded email the data originator to let them know about it, possibly including email and purpose information from the downloader.

The collection of downloader information (e.g., email, institution, purpose) can be done in a number of ways.

1) Have them create a full account
2) The first time they go to view a data set or list of data sets, collect the needed information and store it in a cookie. Anytime they download a file, email the data originator. Q. What happens if cookies are disabled.
3) Upon clicking a file link, interrupt with a popup screen asking for downloader information. Potentially store this in a cookie or ask for it everytime a file is downloaded. Either way, auto-email the information to the originator.

Relevant modules:

Scant findings thus far.

Email Download - not compatible with CCK fields, otherwise it does everything needed.

And that's it - although it may work to come up with a custom solution using CCK and Rules.

Posted by kkwaiser at 04:22 PM | Comments (0) | TrackBack

September 19, 2011

Multi-step conditional forms in Drupal

Collection of notes and links relating to a potential project of unspecified character.

Web forms in Drupal - Rules/CCK vs Webform module

Multistep registration form in Drupal 6 - lacks the term "conditional"

Multistep Module - "Multistep adds multiple-step functionality to content type editing forms." May not play well with Conditional Fields module.

Pageroute Module - similar to Multistep?

Multistep in D7 -


To summarize, the options appear to be a multiple content types with rules that direct the user between forms OR the webform module - I'm not seeing examples of a multi-step conditional content type (node/add) form.

Posted by kkwaiser at 12:52 PM | Comments (0) | TrackBack