« March 2011 | Main | May 2011 »

April 27, 2011

Ubuntu and Webalizer

This is why I continue to add to this blog (my work journal.) A month or more ago I installed Webalizer on my development machine because random IP addresses from Romania, France, Poland, Russia, China and every country in between were hitting my machine. I went through the normal process (i.e., google, google, google) and, in no time, I had installed software that analyzes the server logs - Webalizer. I went back to my development machine today and decided to rerun a few of the reports and forget how. In fact, for a time I even forget which program I installed. Was it Webalizer or LogWatch?

It was Webalizer. And here are the commands I need to remember:

The most basic report:
$ sudo webalizer -o stats /var/log/apache2/access.log

This breaks:
$ sudo webalizer -o stats /var/log/apache2/error.log

$ sudo webalizer -F CLF -o stats /var/log/apache2/error.log

$ sudo webalizer -F IIS -o stats /var/log/apache2/error.log





Comparing the March to April reports, I have gone from 16 to 5 IPs hitting my dev site. Of the five, I know 3 of the IPs. I'm blocking the other two via the firewall:

$ sudo ufw deny from [ip-address]

This may be an amateurish way of securing a server but it seems to work as anomalies in my log files have dropped considerably. Bam.

Posted by kkwaiser at 08:45 AM | Comments (0) | TrackBack

April 21, 2011

Drupal SEO Followup

Over a month ago I installed PathAuto and XML Sitemap on the Research Gateway with the goal of increasing search engine referrals. Here are a few graphs which imply a positive correlation. A few bullet points:

- The trend before installation of SEO modules was toward more search engine referrals so it is possibly my efforts only expedited this trend.
- I spent about a day installing and configuring these modules.
- The number of URLs indexed does not indicate *which* URLs are indexed. I believe Google discards indexed-URLs if they have low information content and many of our pages probably fit that description. I don't expect our entire site (~11000 URLs) to be indexed, ever. It appears to have hit a plateau but I expect the number to continue fluctuating.
- I never got the .htaccess rewrite rule (www.umbs.lsa.umich.edu -> umbs.lsa.umich.edu) to work properly. I hate .htaccess rule syntax.

(right-click > view image to embiggen)

(right-click > view image to embiggen)

(right-click > view image to embiggen)

Posted by kkwaiser at 11:23 AM | Comments (0) | TrackBack

April 19, 2011

Potential DEIMS conference paper ideas

A few of us DEIMS types are discussing putting together a paper for an upcoming IM Conference. It's looking like time is a major constraint but I sent out a few ideas anyways. Here they are:

Beyond data management:
- Performance benchmarking - DEIMS performance before and after performance tuning with freely available tools (i.e., http://drupal.org/project/boost)
- Usability benchmarking - DEIMS usability for anonymous users, data contributors and admins before and after tuning with freely available tools (i.e., http://drupal.org/project/modalframe)
- Administrative tasking - extending an information management system to include tools for everyday administrative operations. A path toward a better-resourced, unified and sustainable system?

- Geo CMS - tools, challenges and opportunities for building an opensource GIS data management system within the DEIMS framework.
- The same but different - a survey of Drupal related data management activities around the globe with an identification of potential synergies. LTER, OBFS, EDIT, USGS, NCEAS. This is only possible when you start from a general purpose, opensoure software package. Custom and/or proprietary code need not apply. Bam.

Posted by kkwaiser at 11:08 AM | Comments (0) | TrackBack

April 05, 2011

On DOI's and Data

I am going to waste a great title on a boring post. One of my pie-in-the-sky hopes is to advance to the point where datasets within the Research Gateway are assigned Digital Object Identifiers (DOI). I know this would help to establish legitimacy in the eyes of our researchers but, honestly, I do not know much about the underpinnings of the DOI system. Here are a few pointers:

There is a website DOI but you are better off starting at the Wikipedia page.

DataCite is a DOI service specifically for datasets (Wikipedia page).

If a DOI is a unique identifier then, my understanding is that, the Handle System makes sure the DOI points to the right place (Wikipedia page).

Questions to answer:

Q: Is the University of Michigan associated with DataCite at all? Is DeepBlue?

Yes, ICPSR is a member of DataCite and contributed datasets receive a DOI. I don't think DeepBlue has this capability but I asked them anyways. They should.

How complicated and expensive would an automated DOI registration system be?

How complicated and expensive would a manual DOI registration system be?

Posted by kkwaiser at 01:29 PM | Comments (0) | TrackBack

Dryad and DSpace

It seems everyone has their own data management solution these days. Dryad is an example of a larger system designed to serve as data repository for data from peer-reviewed articles. It is built on an open source platform called DSpace. Sure would be nice if they had a sandbox available.





Sure, there's the DSpace demo site.

See below the fold for more of my IRC chat.

< kbk1> Hi there. Is DSpace completely custom code or is it built on top of existing code? I just watched a preview video and the theme of the demo site recalled wordpress.
< kbk1> From what I am seeing so far, it looks as if it is built from the ground-up.
< tdonohue> Hi kbk1. DSpace is completely custom code. It was initially built by MIT and Hewlett Packard back in 2002, and since then was open sourced and is community maintained code.
< td> (so, it actually pre-dates WordPress, by about a year, I believe)
< kbk1> I don't suppose there is a sandbox available? I've worked with a few field stations to build an information management system on Drupal and am curious about what other solutions look like.
< td> There is a sandbox/demo site. It's at http://demo.dspace.org
< kompewter> [ DSpace 1.7.0 Demonstration Repository ] - http://demo.dspace.org
< kbk1> Awesome!
< td> From there, you'd want to visit *either* the XMLUI (XML-based UI) or JSPUI (JSP-based UI). Those are the two offered UIs for DSpace.
< td> If you visit one of those UIs, there are actual sample logins provided on the homepage (e.g. read the intro text of the XMLUI, which provides you with sample logins to the demo system: http://demo.dspace.org/xmlui )
< kompewter> [ Community List ] - http://demo.dspace.org/xmlui
< kbk1> I will take a look. I found out about DSpace while reading a paper which mentioned Dryad - which is built on DSpace. Dryad claims the ability to assign DOI's to datasets. Is this functionality within DSpace?
< td> kbk1: DSpace does not specifically assign DOIs by default. But, it will assign Handles (http://handle.net/), which are a part of the DOI system -- see: http://en.wikipedia.org/wiki/Digital_object_identifier
< kompewter> [ Digital object identifier - Wikipedia, the free encyclopedia ] - http://en.wikipedia.org/wiki/Digital_object_identifier
< td> PeterDietz -- That's fine. A GSoC project need not cover all UIs, to be honest. It could be specific to one UI. As of 1.7, we no longer have complete "UI parity", so it doesn't matter if we scoped around XMLUI (as long as the project didn't do anything that would "break" another UI)
< td> PeterDietz -- that being said, if we still had concerns about the project, or no interested mentors, we could 'pull' the project and suggest the student look at other GSoC projects.
< td> kbk1: also if you have specific questions about the Dryad project, one of it's repository programmers, ks, is currently "lurking" in this chat channel. So you might want to ask him, if he's got time.
< ksclarke> yep, I'm around
< ks> DOI assignment takes place through a module Dryad has developed and we're working to make it work within the identity services module atmire is developing (maybe moving our code into that, eventually, if it sees uptake in the dspace community)
< ks> our DOIs have a particular form (meaning embedded in them, unfortunately (imo), related to our modeling of data packages and data files) so are not completely generic like a regular dspace module should be
< kbk1> Thanks ks. If I could add one feature to our IMS it would be DOI assignment. I think it would really encourage researcher buy-in.
< kbk1> I like the workflow for item submission on the demo site.
< ks> I did notice recently that the ezid service (that CDL provides and that we use to register) allows minting now in addition to registering... it might not be much work to build something over that for data-centric DOI registration
< ks> yes, we chose DOIs over other schemes because we though the buy-in would be more significant for that reason
< ks> we thought
< kbk1> Did you think correctly?
< ks> we are seeing uptake in our submissions; we're also working with journals though who are now requiring deposition in a data repository like dryad -- so we can't (I don't think) tease apart what's contributing to our growth
< ks> and I'm not front lines (interacting with folks) so I don't have much ancedotal evidence
< ks> I could ask our curator though so see if she's had feedback about the assignment of DOIs
< ks> I know our workflow passes the DOI back to the submittor so that it can be included in their article
< ks> so we're definitely presenting it as a selling point ("here is your DOI for your data package so people can reference you")
< kbk1> Right, the paper by Vision (BioScience 2010) was the only one I've found thus far that explicitly advocated a DOI for datasets. I work with a lot graduate students and would like to tell them to add Contributed Datasets to their CV's.
< ks> Yeah, that would be great! My believe is having a DOI will do more to encourage that... I'll ask our curator
< ks> my belief
< ks> arg, can't type today
< kbk1> But I can't tell them to do that if it is not a more widely applied practice.
< ks> Yes, I understand... we're definitely hoping to encourage people in that direction but it's not a widely applied practice yet
< kbk1> Either way, if we do move in this direction it may be helpful to talk with someone who has been there. Is IRC your preferred forum?
< ks> IRC works for me but I'll only be with the project for about another month (moving on to something else); you could join the Dryad mailing list... it's low volume and questions, etc. are welcome there: https://lists.nescent.org/mailman/listinfo/wg-digitaldata
< kompewter> [ Wg-digitaldata Info Page ] - https://lists.nescent.org/mailman/listinfo/wg-digitaldata
< ks> I believe that's the open list
< kbk1> Thanks. I just joined. If you are interested, here is the Drupal-based IMS I have built: http://umbs.lsa.umich.edu/research/
< kompewter> [ University of Michigan Biological Station ] - http://umbs.lsa.umich.edu/research/
< ks> there is also a dryad-dev list that's intended to be for dev-focused discussion: http://groups.google.com/group/dryad-dev?pli=1 (also low volume... though we're trying to use it more)
< kompewter> [ dryad-dev | Google Groups ] - http://groups.google.com/group/dryad-dev?pli=1
* ks goes to take a look
< ks> nice, my last place of work was moving towards drupal
< ks> this looks like a nice example of what can be done with it
< ks> btw, just had a colleague tell me the dryad-dev list is intended to be the open public list for discussions, etc.
< kbk1> So favor that over the nescent list?
< ks> yes, I guess so
< ks> I'm surprised you were able to join if the other is not intended to be open but perhaps they were hoping for security through obscurity
< ks> which I've now foiled
< kbk1> Oops. Let the spamming commence.
< kbk1> I'm off to lunch but I appreciate the chance to chat. Looks like you guys are up to good.

Posted by kkwaiser at 10:47 AM | Comments (0) | TrackBack

April 04, 2011

Notes on Scratchpads

The proposal idea I sent to the UM Herbarium involves a natural history application of Drupal called Scratchpads. I am really not big on the name but here are a few notes from a paper they published:

- Sandbox site with login

- List of all modules used by scratchpads.

- Paper: Scratchpads: a data-publishing framework to build, share and manage information on the diversity of life


Left And Right, a module for revamping large taxonomies. You must replace the "taxonomy.module"

Can take in taxonomic vocabs from EOL. "This service supplies terms and associated
metadata (authority, rank and synonymy) in an RDF representation of the Taxonomic Concept Schema" see below for screenshots.


Character project - didn't expect this, don't know what it is. Associate matrix data with taxa nodes? Here's a better writeup the following screenshot:

iSpecies Cache - custom module (?) for caching data from web services

Location and Specimen -> we've known about these for a while. DarwinCore implementation for collections.

Phylogenetic tree - Display widget for improved viewing of large taxonomic trees

Auto Tagging - uses the drupa module

EOL's classification module - "Written by EOL, this improves the management of taxonomic classifications."


Screenshots of the taxonomic import screen from Scratchpads Sandbox site:
- I've requested information on the module dependencies for this functionality.
- Once you log in with the test information, create an empty vocabulary and go to Import.

(right-click > view image to embiggen)

And here's the product:

(right-click > view image to embiggen)

Once you populate a taxonomy, the Scratchpad will create a page aggregated information for the terms. For example, here is the page for Amorpha canescens. Note, that I created a specimen record for Amorpha canescens (a plant I am familiar with from my bee-studying days.)

(right-click > view image to embiggen)

Posted by kkwaiser at 12:53 PM | Comments (0) | TrackBack

Combining pdfs with on Ubuntu

O'Reilly nails this one:

To combine pages into one document, invoke pdftk like so:

pdftk < input PDF files> cat [< input PDF pages>] output < output PDF filename>

A couple of quick examples give you the flavor of it. Here is an example of combining the first page of in2.pdf, the even pages in in1.pdf, and then the odd pages of in1.pdf to create a new PDF named out.pdf:

pdftk A=in1.pdf B=in2.pdf cat B1 A1-endeven A1-endodd output out.pdf

My usage:

$ pdftk A=Desktop/ebb-flow-march-2011-receipt.pdf B=payhub-claim-2010_page3.pdf cat A1 B1 output Desktop/darling_rocks.pdf

Posted by kkwaiser at 12:28 PM | Comments (0) | TrackBack

Something to look forward to...

When your Monday begins by finding an inexplicable bug that previously didn't exist, you worry. When updating to the most recent version of your modules solves the problem, you rejoice.

Posted by kkwaiser at 10:29 AM | Comments (0) | TrackBack