Open Software for Restricted Data: an Environmental Epidemiology example

Table of Contents

1 Copyright

Copyright is the GNU General Public License version 2 WITHOUT ANY WARRANTY, Ivan C Hanigan <ivan.hanigan@gmail.com>

2 Abstract

Ivan Hanigan 1, Steven McEachern 2, David Fisher 3.

  • 1 National Centre for Epidemiology and Population Health, Australian National University
  • 2 Australian Data Archive, Australian National University
  • 3 Information Technology Services, Australian National University

Last Updated 2nd December 2015

Increasing concerns over privacy in Australia and globally, combined with the risk from hacking and the accidental release of large-scale data sets is leading to increased restrictions on the use of confidential, highly sensitive health data. This is coincident with increased statistical and computational power, with the potential to glean many new insights from already collected data. Unfortunately, however, the two trends risk cancelling each other out. It is thus imperative that universities and other institutions who have access to large data sets manage them in ways that maintain organisational and public confidence in their integrity.

This paper presents the design and development of a Virtual Laboratory for analysing restricted data using open software. These tools were assembled with the aim to allow users to access restricted data in an appropriate and safe manner whilst allowing use of open software to enhance reproducibility and accessibility. The system implementation is described specifically for the Australian National Research Cloud http://www.nectar.org.au/research-cloud/.

We present a case study of an application from Environmental Epidemiology using confidential health records, which was a motivating reason for us to develop this system. In the example presented here, we provide a simple analysis of the distribution of suicides with drought across NSW and also with votes for the conservative parties; both of which have previously been found to increase the risk of suicides in NSW. The paper then concludes with a reflection on the implications of applying these open software tools to restricted access data such as the Australian Deaths dataset.


3 Introduction /introduction.html

We present an environment for analysing restricted data using open
software.  The system is described using an analysis of the historical
association of suicides with drought in Australia; and extrapolate
this under climate change and adaptation scenarios.  These tools were
assembled to allow users to access restricted data in a manner that
protects confidentiality of sensitive data, whilst also allowing use
of open software for reproducibility. 

Recently restrictions on access to confidential health records have
increased, especially for  sensitive data on suicide used in our
case study.  Previous solutions to this challenge make access 
so restricted that usability is compromised. We aimed to build a
collection of tools for the conduct of many types of health and social
science research. The starting point for users is the data catalogue,
which provides for finding data available from the store of
unrestricted and restricted data for approved use. Once data are
discovered, the researcher has capacity to manipulate the datasets on
the secure server. The PostgreSQL database integrates and Geoserver
visualises, while statistical tools are available in the R-studio
server browser.

Such analytical tools will enhance the
ability of adaptive management practitioners to assess the potential
influence of adaptations.  The use of the system shows the ease with
which multiple data sources (some restricted) can be analysed in a
secure way using open software.  This will build capacity to answer
complex research questions and compare multiple climate change
scenarios or adaptation assumptions; achieving simultaneous vision of
potential future outcomes from different standpoints.

4 Deploying Virtual Machines

4.1 Description

We want to deploy the service as a symbiotic pair of virtual images that together will provide integrated data storage and analysis services in the cloud. Our recent review of the system prototype I developed suggested a need for two servers to comprise this system; to provide a more robust and powerful combination of structural and functional aspects.

The services hosted by the pair of servers will be:

  • The Brains: a Statistical Analysis engine (running the R environment) integrated with a metadata registry, and
  • The Brawn: a PostgreSQL Geographical Information System (GIS) Database server.

4.2 Accessing Virtual Machines on the Australian Research Cloud

4.2.1 Launch an instance

  • Log on to the research cloud using your Australian University details http://www.nectar.org.au/research-cloud/.
  • create two security groups for brawn and brains, this will be the special firewall settings for each
  • add ssh port 22 (add http(s): 80 etc, postgres 5432, icmp -1 as needed)
  • Under images and snapshots launch the centos 6.2 amd64 image
  • give your server a name, add security group
  • specify a ssh keypair http://support.rc.nectar.org.au/demos/launching_123.html
  • launch and note the ip address
  • go to the access and security section and edit the rules (ie for http add a rule to allow access from port 8080 to 9000 with 0.0.0.0/0 (CIDR))
  • Note that this allows access to the whole world, we will think about securing the server later

4.2.2 security-groups header

4.2.3 requesting help

To: rc-support@nectar.org.au Subject: Request about snapshot status

Hi,

My AAF email I login to the cloud with is: ivan.hanigan@anu.edu.au

What is currently happening: I took a snapshot of one VM and it said success but now in the status column it has 'shutoff' and a spinning wheel.

What should happen: It should return to status 'Active'.

Cheers, Ivan Hanigan Data Management Officer. Thursday PhD scholar (ANU/CSIRO). National Centre for Epidemiology and Population Health. College of Medicine, Biology and Environment. Australian National University Canberra, ACT, 0200. Ph: +61 2 6125 7767. Fax: +61 2 6125 0740. Mob: 0428 265 976. CRICOS provider #00120C.

4.2.4 connect using ssh

If using Windows make sure you have git or WinSCP installed. Opening a bash shell from these will enable you to connect to your server using ssh. Then the contents of the following orgmode chunks can be evaluated on the remote server. If your on linux then orgmode can execute these chunks using C-c C-c.

whoami

In the next chunk, insert the relevant ip address and you may have to answer yes to the question about adding this RSA fingerprint to your list.

cd ~/.ssh
ssh -i keypairname root@your.new.ip.address
# it is prudent to set a hideously long password for root
# passwd root

4.3 Setting up the basic server framework

This is done in a similar way for both the Brains and the Brawn servers.

4.3.1 Install any updates using the yum package manager

This is recommended to do every week to maintain the server in good condition, especially regarding security software.

yum update 

4.4 Security measures

The following section sets some restrictions on the server. I would like to know how important it is to restrict root login and also if we can permit login via ssh and port 22 if it is only open to the NCEPH VPN range? If so I just leave the below as yes yes yes? I had to get a bit of advice from a sysadmin at work about the following but I am sure it is still pretty unsecure. There are various TODOs hidden in the source code document.

################################################################
# name:security
#visit 
/etc/ssh/sshd_config
#under authentication remark out 
#RSAAuthentication yes
#PubkeyAuthentication yes
#AuthorizedKeysFile     .ssh/authorized_keys

4.4.1 TODO add a ssh key

  • get the private/public key generated
  • insert the public key

4.4.2 Restrict ssh access to a specific ip address/range

4.5 Setting the host.* TCP Wrappers

4.5.1 Setting the host.* TCP Wrappers header

4.5.2 Security Enhanced Linux (selinux)

To run the Database and the Rstudio server it is best to disable the selinux.

  • TODO Find out if this is necessary for PostgreSQL as well as Rstudio.
    # selinux config
    vi /etc/selinux/config
    # This file controls the state of SELinux on the system.
    # SELINUX= can take one of these three values:
    #     enforcing - SELinux security policy is enforced.
    #     permissive - SELinux prints warnings instead of enforcing.
    #     disabled - No SELinux policy is loaded.
    SELINUX=enforcing
    # Change SELINUX=enforcing to disabled
    

4.5.3 Install some base packages

There are a few commonly used packages recommended for both Brawn and Brains.

yum install gcc-gfortran gcc-c++ readline-devel libpng-devel libX11-devel libXt-devel texinfo-tex.x86_64 tetex-dvips docbook-utils-pdf cairo-devel java-1.6.0-openjdk-devel libxml2-devel make unzip hdparm

4.6 Hardware set-up

Now we will note the size and number of the disc partitions.

df -h

4.6.1 Swap space /swapon.html

--- 
name: swapon
layout: default
title: Swap space
---

For some reason the Research Cloud Centos VMs do not have any swap space.
I added one GB swapfile, but was advised to enable about the same amount as we have RAM.
I will come back and review this, also I was too lazy to add to boot so just do swapon every time?
#### Code
    free -m | grep Swap
Add swap file with this:
#### Code
    # Create an empty file called /swapfile (here over 2 GB is required for oracle but for just one GB it is count = 1024, need to come back and review)
    dd if=/dev/zero of=/swapfile bs=1024000 count=3000
    #Format the new file to make it a swap file
    mkswap /swapfile
    #Enable the new swapfile. 
    swapon /swapfile
    free -m | grep Swap

4.6.2 Persistent Net Rules Should Be Avoided On Centos

The current "CentOS 6.2 amd64" image (ID 21401), if launched and snapshotted, is not connectible when launched from this snapshot due to networking. (davidjb, Mon Mar 26, 2012 support.rc.nectar.org.au/forum).

################################################################
# name:netrulses
# Just remove the /lib/udev/write_net_rules file (and
# /etc/udev/rules.d/70-persistent-net.rules for good measure), and then
# any further instances will always have their networking adapter as
# eth0. There's probably a better way to do this, but that's working for
# me now
rm /lib/udev/write_net_rules
rm /etc/udev/rules.d/70-persistent-net.rules

4.6.3 Disk Storage

Every VM on the Research Cloud has a 10GB primary disk which is used for the image you launch. The Primary disk is copied in a snapshot, so anything on this primary disk can be backed up via snapshots. In addition every Virtual Machine will get secondary storage. This secondary disk is not copied or backed up via snapshots. If you reboot your virtual machine, the secondary disk data remains in tact. If you shut down your VM, the data disappears (it is not persistent).

######################################################################
# Mount the drive to /home eg.
mount -t ext3 /dev/vdb /home
# df -h
# once successful edit /etc/fstab to mount the drive at boot e.g.
# /dev/vdb    /home    ext3     defaults    0 0
# Reboot the server to ensure that the drive mounts on boot.
# check performance
hdparm -tT /dev/vdb
  • TODO find out recommended practice for backups, and purpose of 'object storage'

    import boto.s3.connection

    connection = boto.s3.connection.S3Connection( awsaccesskeyid='7b7a0a6f71994e42a07c8e9b0de0f8ca', awssecretaccesskey='2d54f6a679cd4b06820550459451cb50', port=8888, host='swift.rc.nectar.org.au', issecure=True, callingformat=boto.s3.connection.OrdinaryCallingFormat())

    buckets = connection.getallbuckets() print buckets

    • downside is you have to use python to work with the data?

5 The Brawn

The Brawn is a PostgreSQL Geographical Information System (GIS) Database server.

5.1 OpenGeo Suite (Postgres and friends)

5.1.1 Restricted OpenGeo Suite /opengeosuite-restricted.html

  • The OpenGeo Suite Installer
    --- 
    name: opengeosuite-restricted
    layout: default
    title: Restricted OpenGeo Suite
    ---
    
      <li><a href="#sec-5-3-8">TODO Previous: </a></li>
      <li><a href="/opengeosuite-upgrade-tomcat6.html">Next: Disable Tomcat</a></li>
    <p></p>
    
    The OpenGeo Suite provides a complete spatial data server software stack.  I use this for the batch install and configuration (TODO look into the individual components offered by OpenGeo). I don't want the Brawn server to run the tomcat etc.  [Installation instructions are here](http://suite.opengeo.org/opengeo-docs/installation/linux/centos/suite.html) but beware that the Redhat version requires the optional channel to be enabled. 
    
    
    #### Code Centos
        # find out your os
        cat /etc/issue
        uname -r
        # if centos
        cd /etc/yum.repos.d
        wget http://yum.opengeo.org/suite/v3/centos/6/x86_64/OpenGeo.repo
        yum install opengeo-suite
    
    #### Code Redhat
        cd /etc/yum.repos.d
        wget http://yum.opengeo.org/suite/v3/rhel/6/x86_64/OpenGeo.repo
        yum install opengeo-suite
    
  • Upgrade tomcat6
    #### Code
        service tomcat6 stop
        chkconfig tomcat6 off
        cd /usr/share            
        wget http://mirror.overthewire.com.au/pub/apache/tomcat/tomcat-6/v6.0.37/bin/apache-tomcat-6.0.37.tar.gz
    
        tar -xzf apache-tomcat-6.0.37.tar.gz
        #To start Tomcat, go to the bin folder of Tomcat installation and then run the startup.sh script,
        cd /usr/share/apache-tomcat-6.0.37
        vi conf/server.xml
        # change <Connector port="8081"/> from 8080
        # use vi /etc/sysconfig/iptables to check this port is open in firewall
    
        vi conf/tomcat-users.xml
        #if you want to use the manager, I did not for security reasons
        #add a user with roles of admin,manager-gui like this
        #<role rolename="manager-gui"/>
        #<user username="tomcat" password="s3cret" roles="manager-gui"/>
        
        #To start Tomcat, go to the bin folder of Tomcat installation and then run the startup.sh script,
        ./bin/startup.sh     
        ./bin/shutdown.sh
    
        # make a script for starting
        cd /etc/init.d  
        vi tomcat
        # add this  
        # to find java_home "update-alternatives --display java" and remove bin/java
        ####
        #!/bin/bash  
        # description: Tomcat Start Stop Restart  
        # processname: tomcat  
        # chkconfig: 234 20 80  
        JAVA_HOME=/usr/lib/jvm/jre-1.6.0-openjdk.x86_64
        export JAVA_HOME  
        PATH=$JAVA_HOME/bin:$PATH  
        export PATH  
        CATALINA_HOME=/usr/share/apache-tomcat-6.0.37  
          
        case $1 in  
        start)  
        sh $CATALINA_HOME/bin/startup.sh  
        ;;   
        stop)     
        sh $CATALINA_HOME/bin/shutdown.sh  
        ;;   
        restart)  
        sh $CATALINA_HOME/bin/shutdown.sh  
        sh $CATALINA_HOME/bin/startup.sh  
        ;;   
        esac      
        exit 0
        ####
    
        # now
        chmod 755 tomcat  
        chkconfig --add tomcat  
        chkconfig --level 234 tomcat on  
        # verify
        chkconfig --list tomcat  
        service tomcat start  
        # test stop, start, restart
        more /usr/share/apache-tomcat-6.0.37/logs/catalina.out  
        # to check for errors, but is all greek to me
        # so assuming ddiindex was previously running on tomcat6
        service tomcat stop
    
    
        cd /usr/share/tomcat6/webapps
        cp -r dashboard /usr/share/apache-tomcat-6.0.37/webapps
        cp -r geowebcache /usr/share/apache-tomcat-6.0.37/webapps
        cp -r geoserver /usr/share/apache-tomcat-6.0.37/webapps    
        cp -r opengeo-docs /usr/share/apache-tomcat-6.0.37/webapps        
        cp -r geoexplorer /usr/share/apache-tomcat-6.0.37/webapps            
        cp -r recipes /usr/share/apache-tomcat-6.0.37/webapps                
        cd /usr/share/apache-tomcat-6.0.37
        service tomcat start
    
        #now you have installed the suite go into the geoserver website  
        # http://ip.address.of.server:8081/dashboard
        # then go to the geoserver and log in as admin with password geoserver
        # follow instructions there to enhance the security
    
        # for The default user/group service should use digest password encoding see
        #  http://suite.opengeo.org/docs/geoserver/security/tutorials/digest/index.html
    
        # Configure the Digest authentication filter
        # 
        # Start GeoServer and login to the web admin interface as the admin user.
        # Click the Authentication link located under the Security section of the navigation sidebar.
        # Scroll down to the Authentication Filters panel and click the Add new link.
        # Click the Digest link.
        # Fill in the fields of the settings form as follows:
        # Set Name to “digest”
        # Set User group service to “default”
        # Save.
        # Back on the authentication page scroll down to the Filter Chains panel.
        # Select “Default” from the Request type drop down.
        # Unselect the basic filter and select the digest filter. Position the the digest filter before the anonymous filter.
        # Save
        # 
        # now go to users, groups, roles
        # under User Group Services, click default
        # under Password encryption, change to digest
        # save
    
  • Secure tomcat
    • We have a service called Nessus that reviewed the VM tomcat port
    • found some minor issues
    • Apache Tomcat servlet/JSP container default files
      Synopsis The remote web server contains example files. Description Example JSPs and Servlets are installed in the remote Apache Tomcat servlet/JSP container. These files should be removed as they may help an attacker uncover information about the remote Tomcat install or host itself. Or they may themselves contain vulnerabilities such as cross-site scripting issues. Solution Review the files and delete those that are not needed. Risk Factor Medium CVSS Base Score 6.8 (CVSS2#AV:N/AC:M/Au:N/C:P/I:P/A:P) Plugin Information: Publication date: 2004/03/02, Modification date: 2012/01/20 Ports tcp/8081

      The following default files were found :

      /examples/servlets/index.html /examples/jsp/snp/snoop.jsp /examples/jsp/index.html

      Just remove this dir? cd usr/share/apache-tomcat-6.0.37/webapps rm -R -f examples

    • CGI Generic Cross-Site Scripting (persistent, 3rd Pass)
      A CGI application hosted on the remote web server is potentially prone to cross-site scripting attacks. Description The remote web server hosts one or more CGI scripts that fail to adequately sanitize request strings containing malicious JavaScript. By leveraging this issue, an attacker may be able to cause arbitrary HTML and script code to be executed in a user's browser within the security context of the affected site. This script identified patterns that were injected to test 'reflected' (aka 'non-persistent') XSS. The issues are likely to be 'persistent' (or 'stored') after all Solution Restrict access to the vulnerable application and contact the vendor for a patch or upgrade

5.2 Postgres standalone

5.2.1 PostgreSQL /postgresql.html

## Install PostgreSQL 9.2 
PostgreSQL is an Open Source database that can be extended with GIS functionality using the PostGIS tools.  The latest version is 9.2.  Please see [this link](http://people.planetpostgresql.org/devrim/index.php?/archives/70-How-to-install-PostgreSQL-9.2-on-RHELCentOSScientific-Linux-5-and-6.html)  for the orginial documentation I used to install this on Centos or Redhat 6.4.  Check the correct download from [this link](http://yum.postgresql.org/repopackages.php#pg92). Please note that this didn't work on Ubuntu 12.04 LTS for me in early 2013 because PostGIS 2.0 was not included in their repositories.  I ended up rolling back to PostgreSQL 9.1 on that machine.

#### Code
    # install the PostgreSQL 9.2 repo package
    rpm -ivh http://yum.postgresql.org/9.2/redhat/rhel-6-x86_64/pgdg-centos92-9.2-6.noarch.rpm
    # install PostgreSQL 9.2 with single command:
    yum groupinstall "PostgreSQL Database Server PGDG"
    # This will install PostgreSQL 9.2 server, along with -contrib subpackage.
    # Once it is installed, first initialize the cluster:
    service postgresql-9.2 initdb
    # Now, you can start PostgreSQL 9.2:
    service postgresql-9.2 start
    # If you want PostgreSQL 9.2 to start everytime on boot, run this:
    chkconfig postgresql-9.2 on
  • Configure PostgreSQL connection settings
    #### Code
        #edit your pg_hba.conf file under /var/lib/pgsql/9.2/data
        #I added a super user from my ip address and allowed all the local ip addresses access
        host    all             postgres        my.desk.ip.address/32       md5
        # if you want to allow data sharing on a specific database then create a public user
        # host    dbname        publicdata      0.0.0.0/0                   md5
        # if you want people to access from a subnet at your work
        # host    dbname        username        ip.address.range.0/24        md5
    
    #### Code
        #connect to psql
        #Set postgres Password
        su - postgres
        psql postgres postgres
        alter user postgres with password 'password';
        select pg_reload_conf();
        # close the psql using \q
        # change back to root
        exit
    
    Now make the server listen for any connection requests from anywhere in the world.
    
    #### Code
        # First locate the postgresql.conf file under /var/lib/pgsql/9.2/data.
        # uncomment and change from localhost
        # listen_addresses = '*'
        # then restart the server
        sudo service postgresql-9.2 restart
        #then reboot and confirm postgres started
    
  • Allow connection to postgres through the firewall
    #### Code: setting-ports
        vi /etc/sysconfig/iptables 
        # and add the line
        -A INPUT -m state --state NEW -m tcp -p tcp --dport 5432 -j ACCEPT
        service iptables restart
    

5.2.2 PostGIS 2.0 /postgis.html

  • Postgis
    --- 
    name: postgis
    layout: default
    title: PostGIS
    ---
    
    <li><a href="/postgresql.html">Previous: PostgreSQL</a></li>
    <li><a href="/postgres-migrate.html">Next: PostgreSQL Migrate</a></li>
    
    
    ## Install PostGIS 2.0
    The PostGIS suite enables a PostgreSQL database with spatial data types and analysis functions.
    
    
    #### References   
     [http://www.davidghedini.com/pg/entry/postgis_2_0_on_centos]([http://www.davidghedini.com/pg/entry/postgis_2_0_on_centos])
        [http://people.planetpostgresql.org/devrim/index.php?/archives/64-PostGIS-2.0.0,-RPMs-and-so..html](http://people.planetpostgresql.org/devrim/index.php?/archives/64-PostGIS-2.0.0,-RPMs-and-so..html)
        [http://people.planetpostgresql.org/devrim/index.php?/archives/65-Installing-PostGIS-2.0.X-on-RHELCentOSScientific-Linux-5-and-6-Fedora-That-is-easy!.html](http://people.planetpostgresql.org/devrim/index.php?/archives/65-Installing-PostGIS-2.0.X-on-RHELCentOSScientific-Linux-5-and-6-Fedora-That-is-easy!.html)
    
    #### code: install-postgis2
        yum list postgis*
        yum install postgis2_92.x86_64 
        yum install postgis2_92-devel.x86_64
      
      
    
  • GDAL, PROJ and GEOS
    #### Code
        sudo rpm -Uvh http://elgis.argeo.org/repos/6/elgis/x86_64/elgis-release-6-6_0.noarch.rpm
        sudo rpm -Uvh http://mirror.as24220.net/pub/epel/6/i386/epel-release-6-7.noarch.rpm
        # yum list gdal*
        yum install gdal-devel.x86_64 
        yum install proj-devel.x86_64
        yum install proj-nad.x86_64
        yum install proj-epsg.x86_64 
        yum install geos-devel.x86_64
        #to update
        yum remove gdal*
        yum install gdal-devel.x86_64 
    
  • Create Database
    #### Code: Create Database
        su - postgres 
        createdb mydb
        psql -d mydb -U postgres  
        CREATE EXTENSION postgis;  
        CREATE EXTENSION postgis_topology;  
     
    
  • Create a GIS user and a group
    #### Code: Create a GIS user and a group
        CREATE ROLE public_group;
        CREATE ROLE ivan_hanigan LOGIN PASSWORD 'password';
        GRANT public_group TO ivan_hanigan;
    
        grant usage on schema public to public_group;
        GRANT select ON ALL TABLES IN SCHEMA public TO public_group;
        grant execute on all functions in schema public to public_group;
        grant select on all sequences in schema public to public_group;
        grant select on table geometry_columns to public_group;
        grant select on table spatial_ref_sys to public_group;
        grant select on table geography_columns to public_group;
        grant select on table raster_columns to public_group;
        grant select on table raster_overviews to public_group;
        \q
        exit
    
  • Specific transformations grid for Australian projections AGD66 to GDA94
    #### Additional Reprojection File
    A special transformations grid file is required to be added to the PROJ.4 files for reprojecting the Australian projections AGD66 to GDA94.
    
    Thanks to [Joe Guillaume](https://github.com/josephguillaume) and [Francis Markham](http://stackoverflow.com/users/103225/fmark) for providing this solution.
    
    #### Code: transformations grid for Australian projections
        cd /usr/share/proj
        # the original was moved
        # wget http://www.icsm.gov.au/icsm/gda/gdatm/national66.zip
        wget http://www.icsm.gov.au/gda/gdatm/national66.zip
        # if it moves again a version of it is included with this repo
        # from your local scp to your server
         
        yum install unzip
        unzip national66.zip
        mv "A66 National (13.09.01).gsb" aust_national_agd66_13.09.01.gsb
    
        su - postgres 
        psql -d mydb
    
        UPDATE spatial_ref_sys SET
        proj4text='+proj=longlat +ellps=aust_SA +nadgrids=aust_national_agd66_13.09.01.gsb +wktext'
        where srid=4202;
        \q
        exit
    
  • Test transform
    #### Code: test transformation from AGD66 to GDA94
        select geocode, geoname, st_transform(geom, 4283) as the_geom
        into schema.gda94_table
        from  schema.agd66_table;
    
    #### FROM YOUR R SERVER OR LOCAL ####
    #R
    require(devtools)
    install_github("swishdbtools", "swish-climate-impact-assessment")
    require(swishdbtools)
    pwd <- getPassword(remote=F)
    ch <- connect2postgres2("django")
    #("ip.address", "dbname", "postgres", p = pwd)
    ## dbSendQuery(su,
    ## "UPDATE spatial_ref_sys SET
    ## proj4text='+proj=longlat +ellps=aust_SA +nadgrids=aust_national_agd66_13.09.01.gsb +wktext'
    ## where srid=4202;
    ## ")
    # now you can go ahead and convert AGD66 (4202) to GDA94 (4283)
    sql <- sql_subset_into(ch, "schema.agd66_table", select =
               'geocode, geoname, st_transform(geom, 4283) as the_geom',
               into_schema = "schema", into_table = "gda94_table", eval = F, check = F)
    cat(sql)
    
    

5.2.3 PostgreSQL Migration /postgres-migrate.html

  • Migrate the data
    ---
    name: postgres-migrate
    layout: default
    title: PostgreSQL migration from default location
    ---
    
    <li><a href="/postgis.html">Previous: PostGIS</a></li>
    <li><a href="/sharedmemory.html">Next: Important shared memory settings</a></li>
    
    To take advantage of the extra storage on the secondary disk we mounted in the initial configuration then do the following.
    
    #### Code: Migrate PostgreSQL Data
        service postgresql-9.2 stop
        #     Copy the pgsql directory from /var/lib (or customer install directory) location to another location
        cp -r /var/lib/pgsql /home/pgsql
        chown -R postgres:postgres /home/pgsql
    
        #     Edit the start script 'postgresql'
        vi /etc/init.d/postgresql-9.2
        #     Search for parameter PGDATA which would be entered as "PGDATA=/var/lib/pgsql"
        #     Edit the line such that PGDATA points to the new location. For e.g. "PGDATA=/newloc/pgsql"
        #     ALSO DO PGLOG, and PGUPLOG
        #     Save and exit the file
        #     Start PostgreSQL Service 
        service postgresql-9.2 start
        # tidy up but not too much, just data?
        rm -r -f /var/lib/pgsql/9.2/data?
    
    
  • Set up backups
    From here to a secure location at my work.

5.3 Tuning Postgres Performance

5.3.1 Important shared memory settings /sharedmemory.html

--- 
name: sharedmemory
layout: default
title: Important Shared Memory Settings
---

## Managing memory settings
The default settings in PostgreSQL are usually pretty good but these memory settings are conservative to start with and often need modifications.

## References

[Kernal memory limitations](http://michael.otacoo.com/postgresql-2/take-care-of-kernel-memory-limitation-for-postgresql-shared-buffers/)

#### Code: sharedmemory
    vi /home/pgsql/9.2/data/postgresql.conf 
    # shared_buffers
    
    # PostgreSQL has a default shared_buffers value at 32MB, what is
    # enough for small configurations but it is said that this
    # parameter should be set at 25% of the system’s RAM. This allows
    # your system to keep a good performance in parallel with the
    # database server.  So in the case of a machine with 4GB of RAM,
    # you should set shared_buffers at 1GB.
    2GB = 2048MB
     
    #also look at max_locks_per_transaction.  tried setting to 1000???
     
    ################################################################
     
    # http://www.postgresql.org/docs/9.2/static/kernel-resources.html
    # Linux
     
    #     The default maximum segment size is 32 MB, which is only
    #     adequate for very small PostgreSQL installations. The
    #     default maximum total size is 2097152 pages. A page is
    #     almost always 4096 bytes except in unusual kernel
    #     configurations with "huge pages" (use getconf PAGE_SIZE to
    #     verify). That makes a default limit of 8 GB, which is often
    #     enough, but not always.
     
    #     The shared memory size settings can be changed via the
    #     sysctl interface. For example, to allow 16 GB:
     
    sysctl -w kernel.shmmax=17179869184
    sysctl -w kernel.shmall=4194304
     
    #     In addition these settings can be preserved between reboots
    #     in the file /etc/sysctl.conf. Doing that is highly
    #     recommended.
     
     
    #    The remaining defaults are quite generously sized, and
    # usually do not require changes.  also
    # http://www.linux.com/learn/tutorials/394523-configuring-postgresql-for-pretty-good-performance
    # work mem 4MB

5.3.2 TODO Postgres conf

I noticed when installing opengeo on RHEL and Ubuntu that the installer changes more parameters on the Ubuntu conf file. Compare these with diff and update RHEL to reflect the settings for ubuntu.

5.3.3 Test loading some shapefiles

on your ubuntu desktop install postgis and gdal (see above)

################################################################
sudo apt-get install postgis

then let's demo the Tasmanian SLAs:

  • download the shapefiles
      ################################################################
      # name:tassla06
      # ABS spatial units are available at http://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/1259.0.30.0022006?OpenDocument
    setwd('..')
    dir.create('data')
    setwd('data')
    dir.create('abs_sla')
    setwd('abs_sla')
    
    download.file('http://www.abs.gov.au/AUSSTATS/subscriber.nsf/log?openagent&1259030002_sla06aaust_shape.zip&1259.0.30.002&Data%20Cubes&18E90A962EFD4D7ECA25795D00244F5A&0&2006&06.12.2011&Previous',
                    'SLA06.zip', mode = 'wb')
      unzip('SLA06.zip',junkpaths=T)
      
      sink('readme.txt')
        cat(paste('Australian Bureau of Statistics Statistical Local Areas 2006
        downloaded on', Sys.Date(),
        '
        from http://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/1259.0.30.0022006?OpenDocument')
        )
      sink()
      
      # and load spatial data (sd)
      install.packages('rgdal')
      require(rgdal)
      sd <- readOGR('SLA06aAUST.shp', layer = 'SLA06aAUST')
      # might take a while
      head(sd@data)
      plot(sd)
      dev.off()
      save.image('aussd.Rdata')
      
      ######################
      # tas
      sd2 <-  sd[ sd@data$STATE_CODE == 6,]
       plot(sd2)
       axis(1);axis(2); box()
      # plot(sd, add = T)
       names(sd2@data)
       writeOGR(sd2,'tassla06.shp','tassla06','ESRI Shapefile')
       test <- readOGR(dsn = 'tassla06.shp', layer = 'tassla06')
       plot(test, col = 'grey')
       rm(sd)
      # save.image('tassd.Rdata')
      
    
  • upload the shp2psql
    # psql -d ewedb -U postgres -h 115.146.94.209
    # CREATE SCHEMA abs_sla;
    # grant ALL on schema abs_sla to gislibrary;
    cd data
    shp2pgsql -s 4283 -D tassla06.shp public.tassla06 > tassla06.sql
    # psql -d ewedb -U gislibrary -W -h 115.146.94.209 -f tassla06.sql
    # warning terminal not fully functional?  ran from normal terminal
    # now on the remote server run
    # psql ewedb postgres
    # GRANT select ON ALL TABLES IN SCHEMA public TO gislibrary;
    

6 The Brains

The Brains is a Statistical Analysis engine (running the R environment) integrated with a metadata registry.

6.1 R Server

6.1.1 R

#rpm -Uvh http://mirror.as24220.net/pub/epel/6/i386/epel-release-6-7.noarch.rpm
rpm -Uvh http://mirror.overthewire.com.au/pub/epel/6/i386/epel-release-6-7.noarch.rpm
yum install R R-devel

#rpm -Uvh #http://mirror.as24220.net/pub/epel/6/i386/epel-release-6-8.noarch.rpm To update R as ‘root’ on your system simply type

yum update R

6.1.2 TODO package management and R updates

Kudos2 http://zvfak.blogspot.com.au/2012/06/updating-r-but-keeping-your-installed.html The problem is that when you update R you usually need to re-install your libraries or change .libPaths() to point to a location that has your previous libraries.

The solution below will work for unix-like operating systems including Mac OS X.

###########################################################################
# newnode: packageManagement

#First, we need a location to install all our packages from now
#on. This can be any directory, and location of this directory should
#be indicated in ~/.Renviron file. Let's create that directory now:

mkdir ~/Rlibs

#We created Rlibs directory in our home directory. Now, create the
#.Renviron file in your home directory and enter the following line
#and save the .Renviron file:

R_LIBS=~/Rlibs

# We can now start R and install any library. The libraries will be
# installed to ~/Rlibs, and when we update R, R will still look for
# libraries in ~/Rlibs directory so we don't need to re-install the
# libraries. However, we will need to update the libraries in ~/Rlibs
# directory to their most recent versions. All we need to do is to run
update.packages() 
# in R console, and the libraries will be updated.

6.1.3 Rstudio

check out the RSudio versions at: http://rstudio.org/download/server

# on RHEL6 ran into dependencies: libcrypto.so.6()(64bit) is needed by rstudio-server-0.96.331-1.x86_64
#       libgfortran.so.1()(64bit) is needed by rstudio-server-0.96.331-1.x86_64
#       libssl.so.6()(64bit) is needed by rstudio-server-0.96.331-1.x86_64
# http://www.linuxquestions.org/questions/linux-software-2/need-libcrypto-so-6-64bit-and-libssl-so-6-64bit-for-redhat-6-a-873068/
# yum list openssl098e*
# yum install openssl098e.x86_64
# yum list compat-libgfortran*
# yum install compat-libgfortran-41.x86_64 
wget http://download2.rstudio.org/rstudio-server-0.97.551-x86_64.rpm
sudo yum install --nogpgcheck rstudio-server-0.97.551-x86_64.rpm

wget http://download2.rstudio.org/rstudio-server-rhel-0.99.451-x86_64.rpm
sudo yum install --nogpgcheck rstudio-server-rhel-0.99.451-x86_64.rpm
rstudio-server verify-installation

6.1.4 firewall access

# kudos2 http://slinsmeier.wordpress.com/2012/05/19/creating-a-lab-environment-with-rstudio/
# It is necessary to open the firewall port to allow the browser
# access to RStudio: edit the 
vi /etc/sysconfig/iptables 
# file and add the line
# -A INPUT -m state --state NEW -m tcp -p tcp --dport 8787 -j ACCEPT
# directly after the opening of the ssh port 22 (or copy that line and change the port 22 to 8787).
# reminder that you need to have the security group set up on the research cloud to allow tcp from 8787 to 8787 cidr 0.0.0.0/0
service iptables restart

Check that you can log on via port 8787. Note that this is an unsecure R server and the following steps are required to make this a secure server. We will need to block port 8787 later on.

6.1.5 SSL/HHTPS and running a proxy server

sudo yum install httpd.x86_64

# run the following interactively
sudo openssl genrsa -out /etc/pki/tls/private/rstudio.ivan.com.key 1024
# the next is one line
sudo openssl req -new -key /etc/pki/tls/private/rstudio.ivan.com.key -x509 -out /etc/pki/tls/certs/rstudio.ivan.com.crt -days 365

sudo yum install mod_ssl.x86_64 
#vi /etc/httpd/conf.d/ssl.conf 
# Change the paths to match where the Key file is stored. 
SSLCertificateFile /etc/pki/tls/certs/rstudio.ivan.com.crt
# Then set the correct path for the Certificate Key File a few lines below. 
SSLCertificateKeyFile /etc/pki/tls/private/rstudio.ivan.com.key

mkdir /etc/httpd/sites

# vi /etc/httpd/conf/httpd.conf 
# and add 
Include /etc/httpd/sites/
# as the last line.

# vi /etc/httpd/sites/rstudio-ivan.com

# insert
<VirtualHost *:80>

  ServerName rstudio.ivan.com
  RedirectMatch ^(.*)$ https://rstudio.ivan.com$1

</VirtualHost>
# goodo
# vi /etc/httpd/conf.d/ssl.conf
# add

  <Proxy *>
    Allow from localhost
  </Proxy>

  ProxyPass        / http://localhost:8787/
  ProxyPassReverse / http://localhost:8787/


# before </VirtualHost>


/etc/init.d/httpd restart

# weird error? ignore?
# Stopping httpd: [60G[FAILED]
# Starting httpd: httpd: apr_sockaddr_info_get() failed for i-00002979
# httpd: Could not reliably determine the server's fully qualified domain name, using 127.0.0.1 for ServerName
# [60G[  OK  ]

sudo chkconfig httpd on
# sudo vi /etc/sysconfig/iptables 
# to the previously added line for 8787 modify to 
# -A INPUT -m state --state NEW -m tcp -p tcp --dport 443 -j ACCEPT
# change the research cloud firewall rules to reflect this change

sudo service iptables restart

# sudo vi /etc/rstudio/rserver.conf
 www-address=127.0.0.1


sudo /etc/init.d/rstudio-server restart

# now going to https://your.new.ip.address/
# should ask you to add an exception
# can also try sudo reboot?

6.1.6 git

yum install git
# restart R studio

6.1.7 ssh for github

  • in rstudio
  • tools / options / version control
  • create rsa key, ok, ok
  • view pub key, copy, paste to your github account

6.1.8 gdal

sudo rpm -Uvh http://elgis.argeo.org/repos/6/elgis/x86_64/elgis-release-6-6_0.noarch.rpm
# yum list gdal*
yum install gdal-devel.x86_64 proj-devel.x86_64

6.1.9 geos

yum install geos-devel.x86_64

6.1.10 or under ubuntu

sudo apt-get update
sudo apt-get install libgdal1-dev
sudo apt-get install libproj-dev
# OR
# You need the development packages of GDAL and proj4. Probably easier to
#install from repository than from source. Try:

# sudo apt-get install libgdal1-dev libproj-dev
# sudo R
# install.packages("rgdal")

6.1.11 test readOGR

NB only possible once the PostGIS install is complete, see below.

################################################################
# name:readOGR2

readOGR2 <- function(hostip=NA,user=NA,db=NA, layer=NA, p = NA) {
 # NOTES
 # only works on Linux OS
 # returns uninformative error due to either bad connection or lack of record in geometry column table.  can check if connection problem using a test connect?
 # TODO add a prompt for each connection arg if isna
 if (!require(rgdal)) install.packages('rgdal', repos='http://cran.csiro.au'); require(rgdal)
 if(is.na(p)){
 pwd=readline('enter password (ctrl-L will clear the console after): ')
 } else {
 pwd <- p
 }
 shp <- readOGR(sprintf('PG:host=%s
                         user=%s
                         dbname=%s
                         password=%s
                         port=5432',hostip,user,db,pwd),
                         layer=layer)

 # clean up
 rm(pwd)
 return(shp)
 }

tassla06 <- readOGR2(hostip='115.146.95.82',user='gislibrary',db='ewedb', layer='tassla06')

                                        # func
if (!require(rgdal)) install.packages('rgdal'); require(rgdal)
if(!require(ggmap)) install.packages('ggmap'); require(ggmap)
epsg <- make_EPSG()
# load
latlong <- read.table(tc <- textConnection(
  "ID  POINT_Y   POINT_X
  1  150.5556 -35.09305
  2  150.6851 -35.01535
  3  150.6710 -35.06412
  4  150.6534 -35.08666
  "), header = TRUE); close(tc)
# do
for(i in 1:nrow(latlong)){
  coords <- as.numeric(latlong[i,c('POINT_Y', 'POINT_X')])
  e <- as.data.frame(cbind(i, t(coords), revgeocode(coords)))
  write.table(e, "test.csv", sep = ',', append = i > 1, col.names = i == 1, row.names = F)
}
d <- read.csv('test.csv')
head(d)
## Treat data frame as spatial points
pts <- SpatialPointsDataFrame(cbind(d$V2,d$V3),d,
                              proj4string=CRS(epsg$prj4[epsg$code %in% '4283']))
writeOGR(pts, 'test.shp', 'test', driver='ESRI Shapefile')

6.1.12 rgraphviz

wget http://www.graphviz.org/graphviz-rhel.repo
mv graphviz-rhel.repo /etc/yum.repos.d/ 
yum list available 'graphviz*'
yum install 'graphviz*'

# as root
R
source('http://bioconductor.org/biocLite.R')
biocLite("Rgraphviz")
q()

6.1.13 test

###########################################################################
# newnode: test-rgraphviz
try newnode_test from
git@github.com:ivanhanigan/disentangle.git

6.1.14 install just the postgres bits required for RPostgreSQL package

###########################################################################
# newnode: rpostgresql
vi /etc/yum.repos.d/CentOS-Base.repo
# append: exclude=postgresql* to [base] and [updates] sections
curl -O http://yum.postgresql.org/9.1/redhat/rhel-6-x86_64/pgdg-centos91-9.1-4.noarch.rpm
rpm -ivh pgdg-centos91-9.1-4.noarch.rpm
# kudos2 http://www.davidghedini.com/pg/entry/install_postgresql_9_on_centos
# Many, if not most, third party software and modules are still be set to look for PoistgreSQL's conf file and data directory under their old (pre-version 9) locations.
# You can address this, and make life easier for yourself, by creating a few symlinks from the new locations to the old.
# Symlink 1: Symlink for the binary directory. This is particularly useful as this is the location of the pg_config file
# view plaincopy to clipboardprint?
# so install the basic packages for a database
# install a basic PostgreSQL 9.1 server:
yum install postgresql91-server postgresql91 postgresql91-devel postgresql91-libs postgresql91-contrib
# THIS LINE HERE
ln -s /usr/pgsql-9.1/bin/pg_config /usr/bin  
# now check
R
install.packages('RPostgreSQL')
# works?

6.1.15 postgis utilities

################################################################
# name:postgisutils
yum list postgis2*
yum install postgis2_91.x86_64 postgis2_91-devel.x86_64 
# try 
/usr/pgsql-9.1/bin/raster2pgsql

# to use this and psql via CMD line need to add using R
su - user
R
sink('~/.pgpass')
cat('hostname:port:database:username:password')
sink()
q()
exit
chmod 0600 /home/user/.pgpass
# see http://www.postgresql.org/docs/current/static/libpq-pgpass.html
# and other methods

6.1.16 TODO unixODBC

################################################################
# name:unixODBC
yum list unixODBC*
yum install unixODBC.x86_64  unixODBC-devel.x86_64 
# unixODBC-kde.x86_64 
R
install.packages('RODBC')
require(RODBC)

6.2 Test the Backups of this Minimal R Sever.

To test that this will restore successfully if there is a crash or the VM becomes corrupted take a snapshot from the ResearchCloud dashboard website and launch it to a new VM. Note that this will only restore the 10GB primary disc which has our software but not the user data which is on the secondary disc so try mounting that and check that the new VM will be a good replacement of the old one.

6.2.1 Backup the 2nd Disc

################################################################
# name:2ndDisc
# on your secure local machine
mkdir /projects/OpenSoftware-RestrictedData/Archives
# on your remote server

ssh root@ip.address.of.server
cd /
mkdir backups
cd backups
zip -r homebackup_yyyymmdd /home
#scp homebackup_yyyymmdd.zip root@ip.address.of.server:/Archives
#or use the R studio download file function
# upload to the new R server
exit
scp /Archives/homebackup.zip root@ipaddress:/home
# and unzip (TODO need to mv these up to the /home/ level after unzip)
ssh root@ip.address.of.server
cd home
unzip homebackup.zip
rm homebackup.zip

6.2.2 Launch from this snapshot and test the R server and 2nd Disc

6.2.3 TODO the permissions of the user on their home directory cause issues for logging in.

I've just used userdel -r username, then useradd etc and use their own user account to scp the backup files across. One solution is to not offer backup services, then users will have to undertake to load all data again from a fresh server if it failed. Data could be backed up on Brawn only?

6.3 Oracle XE Permissions and Users System

Installing Oracle XE, this is used as a data registry and I have a HTML-DB application available from XXXXXX to be used. It is called OraPUS because the Oracle XE license restricts any references to Oracle in the name of the application. There are other restrictions on this licence such as XXXXX.

6.3.1 backup local ubuntu version

I have a ubuntu pc as my main desktop and also backups to create oracle on this I needed a centos virtual Machines use the instructions at http://extr3metech.wordpress.com/2012/10/25/centos-6-3-installation-in-virtual-box-with-screenshots/ and also the comments ie: @Gunwant, One way for configuring the network, open your virtual box settings and go to the “Network” tab and choose the option “Attached to” to “Bridged Adapter”.

Then boot your centos virtual machine, and type the following in your terminal:

And then change the file to the following:

DEVICE=”eth0″ BOOTPROTO=”dhcp” ONBOOT=”yes

And save the file and exit. After that you need to restart the network service and you can do that by typing the following in the terminal:

Now, you can check your command

Now, it should work. I will be posting a detailed article soon for my readers soon. Thank you for visiting my blog. Feel free to ask any questions in the comments section. Have fun!

6.3.2 INIT

################################################################
# name:init
yum update
# SElinux config
# vi /etc/selinux/config
# This file controls the state of SELinux on the system.
# SELINUX= can take one of these three values:
#     enforcing - SELinux security policy is enforced.
#     permissive - SELinux prints warnings instead of enforcing.
#     disabled - No SELinux policy is loaded.
SELINUX=enforcing
# Change SELINUX=enforcing to disabled and you must reboot the server after applying the change.
# vi /etc/hosts
127.0.0.1  localhost.localdomain localhost ADD-VMNAME 

6.3.3 SWAP

MAKE SURE YOU HAVE 2GB SWAP, this is required for install but I have deleted it aftward and things still work http://bvirtual.nl/2009/08/add-more-swap-space-to-running-linux.html I SET TO 3000, didn't set to start at boot so have to swapon /swapfile

################################################################
# name:swap
#Create an empty file called /swapfile from about 2GB
dd if=/dev/zero of=/swapfile bs=1024000 count=3000
#Format the new file to make it a swap file
mkswap /swapfile
#Enable the new swapfile. Only the swapon command is needed, but with
#the free command you can clearly see the swap space is made available
#to the system.
free -m | grep Swap
swapon /swapfile
free -m | grep Swap
# TOO LAZY TO ADD TO BOOT SO JUST DO SWAPON EVERY TIME?

6.3.4 DOWNLOAD AND SCP

create free oracle account. download to local and then upload to server.

scp oracle-xe-11.2.0-1.0.x86_64.rpm.zip user@ip.address.of.server:/home/

6.3.5 Install the database

I followed these online guides: http://youtu.be/DRZg8tfmldA (this needs VNC or other) http://www.davidghedini.com/pg/entry/install_oracle_11g_xe_on

yum install libaio bc flex unzip 
cd /home
mkdir OracleXE
unzip oracle-xe-11.2.0-1.0.x86_64.rpm.zip -d OracleXE 
cd OracleXE/Disk1
rpm -ivh oracle-xe-11.2.0-1.0.x86_64.rpm  
/etc/init.d/oracle-xe configure
# accept defaults
#set up the etc/group file so that oracle is in dba (and oinstall? not there?)
id -a oracle
#log on as oracle and then 
su - oracle
# set the environment
cd /u01/app/oracle/product/11.2.0/xe/bin  
. ./oracle_env.sh 
# NOT DONE, which users needs this?  which dot file?  both?  To set
# the environment permanently for users, add the following to the
# .bashrc or .bash_profile of the users you want to access the
# environment:
     . /u01/app/oracle/product/11.2.0/xe/bin/oracle_env.sh 
sqlplus /nolog
connect sys/password as sysdba
#To allow remote access to Oracle 11g XE GUI (as well as Application Express GUI) issue the following from SQL*Plus
EXEC DBMS_XDB.SETLISTENERLOCALACCESS(FALSE);  
exit
su - root
# modify firewall
# vi /etc/sysconfig/iptables
# copy -A INPUT -m state --state NEW -m tcp -p tcp --dport 22 -j ACCEPT
# twice and modify for 8080 and 1521
# restrict 1521 to your administration IP address
# will also need the port 8080 enabled on the research cloud firewall
service iptables restart

Now we can check the database service is running by pointing a browser at http://ip.address.of.server:8080/apex

Although Apex is already installed, we still need to set the Internal Admin password.

# To do so, run the apxchpwd.sql located under /u01/app/oracle/product/11.2.0/xe/apex:
su - oracle
. /u01/app/oracle/product/11.2.0/xe/bin/oracle_env.sh
sqlplus /nolog
connect sys/password as sysdba
@/u01/app/oracle/product/11.2.0/xe/apex/apxchpwd.sql 
#Note: pick something simple like Password123! as you will be prompted to change it on first log in anyways.
# access the Application Express GUI at:
http://ip.address.of.server:8080/apex/f?p=4550:1

Now log on:

  • Workspace: Internal
  • User Name: admin
  • Password: (whatever you selected above).
  • log in, change password
  • log in again

Now create Workspace:

  • Workspace Name = DDIINDEXDB
  • next page say existing schema "no" and write DDIINDEXDB into the schema Name
  • schema password same as the user you want
  • 100MB
  • admin user IVAN

6.3.6 import application and set Security

import the application 102 then set authentication I followed the advice from the oracle documentation http://docs.oracle.com/cd/E17556_01/doc/user.40/e15517/sec.htm#BCGICEAH Changing the Authentication Scheme Associated with an Application. To change the authentication scheme for an application:

Navigate to the Authentication Schemes:

  • On the Workspace home page, click the Application Builder icon.
  • Select an application.
  • On the Application home page, click Shared Components.
  • The Shared Components page appears.
  • Under Security, select Authentication Schemes.
  • Click the Change Current tab at the top of the page.
  • From Available Authentication Schemes, select a new authentication scheme and click Next.
  • (Application Express is good)
  • Click Make Current.

6.3.7 TODO explain the table creation script

now go to lib/setuporacleofdelphe.r to build the tables

6.3.8 set up R, RJDBC and ROracle

Using RJDBC is what I recommend. I found ROracle suboptimal.

su root
#make sure env is set
export LD_LIBRARY_PATH=/u01/app/oracle/product/11.2.0/xe/lib:$LD_LIBRARY_PATH

R
connect2oracle <- function(){
if(!require(RJDBC)) install.packages('RJDBC'); require(RJDBC)
drv <- JDBC("oracle.jdbc.driver.OracleDriver",
            '/u01/app/oracle/product/11.2.0/xe/jdbc/lib/ojdbc6.jar')
p <- readline('enter password: ')
h <- readline('enter target ipaddres: ')
d <- readline('enter database name: ')
ch <- dbConnect(drv,paste("jdbc:oracle:thin:@",h,":1521",sep=''),d,p)
return(ch)
}  
ch <- connect2oracle()


# OR
#install.packages('ROracle')

6.3.9 TODO maintenance

6.3.10 unlock

6.3.11 unlock header

---
name: unlock
layout: default
title: unlock
---

http://riaschissl.blogspot.com.au/2010/07/oracle-unexpire-and-unlock-accounts.html

Some of our customers' applications are built around Oracle, so we have to fight the beast from time to time. Unfortunately, some of the surprizes the beast has to offer are quite random and rare, and due to this we tend to simply forget how we fixed and/or circumvented the issues previously.

As usual, google is your friend and one of the most valuable resources on the net we've found is www.orafaq.com and most notably its security FAQs [1]

So this is just an attempt of a small cheat sheat to help our overloaded brains :-)
expired and locked accounts - the basics
Now, as of version 11g, Oracle has enabled account expiration per default for many vital accounts (such as SYSMAN, SYS, ...). Quite a weird idea in my view, but who knows what hyper security things some wise engineer had in mind when doing so.

Beware that we run Oracle under various Linux flavours, so things might be different for you.

become database admin
First log into your host running oracle and become the oracle user.

#### Code:unlock
    sqlplus /nolog
    connect / as SYSDBA



find out which accounts are expired
select username, account_status from dba_users where ACCOUNT_STATUS LIKE '%EXPIRED%';
unexpire an account
once an account has been expired, it can only be revived by assigning it a new password:

#### code
    ALTER USER scott IDENTIFIED BY password;


unlock an account
ALTER USER scott ACCOUNT UNLOCK;
disable default password expiry [2]
this all depends on the profile a user belongs to, to disable password expiry for all users assigned the default user profile do this:
ALTER PROFILE DEFAULT LIMIT PASSWORD_LIFE_TIME UNLIMITED;    

6.3.12 extending/revising the database to track servers and files

Aim:

  • the db is good for files, and even books etc.
  • I think it might also be good for tracking server maintenance tasks
  • I will experiment by modifying the names of the data entry fields ie from file to server
  • and the tables stay the same. then I can also modify the app.
  • change the names and fields
    • go to page, edit page, remove fields one at a time
    • keep the pages to the bare minimum
    • go to edit page and change their names

    ##### stdydscr becomes Projects

    • idno becomes projectid
    • titl = projecttitle
    • authenty = primaryinvestigator
    • distrbtr = contactaddress
    • abstract = businessrequirements

    ##### filedscr becomes servers

    • filename = vmname
    • Filelocation = ipaddress
    • Filedscr = Description
    • reqid = authorisedaccessgroupid
    • notes = Server function

    ##### datadscrs becomes tasks

    • LABL = task
  • SAVE VERSIONS
    export application to dropbox/Catalogue/BACKups

6.4 DDIindex

The DDIindex is a java based catalogue that was written by the Australian Data Archives at the ANU (Formerly the ASSDA http://assda.anu.edu.au/ddiindex.html). The original application source is still available from the ASSDA site but I am using the one I have archived because it has some configurations that I don't remember who to make again at the moment.

6.4.1 TOMCAT

http://coastalrocket.blogspot.com.au/2012/08/installing-geostack-on-centos-6-64bit.html

###########################################################################
# newnode: install-tomcat
yum -y install tomcat6 tomcat6-webapps
# not tomcat6-admin-webapps 
vi /etc/tomcat6/tomcat-users.xml
#add a user with roles of admin,manager like this
<user name="tomcat" password="password" roles="admin,manager" />
###########################################################################
# newnode: for-restarts
chkconfig tomcat6 on

changing the Tomcat Port

  • Locate server.xml in {Tomcat installation folder}\ conf \
  • change <Connector port="8181"/> from 8080
  • and connector executor too? only if it is not commented out (was on centos, not on RHEL)
  • change iptables and restart
service tomcat6 restart

6.4.2 UPLOAD THE DDIINDEX

6.4.3 MySQL

http://www.if-not-true-then-false.com/2010/install-mysql-on-fedora-centos-red-hat-rhel/

rpm -Uvh http://rpms.famillecollet.com/enterprise/remi-release-6.rpm
yum --enablerepo=remi,remi-test list mysql mysql-server
yum --enablerepo=remi,remi-test install mysql mysql-server
/etc/init.d/mysqld start ## use restart after update
## OR ##
service mysqld start ## use restart after update

chkconfig --levels 235 mysqld on

mysqladmin -u root password [your_password_here]
mysql -h localhost -u root -p
then create database, 
database user and
database password 
## CREATE DATABASE ##
mysql> CREATE DATABASE security;

## CREATE USER ##
mysql> CREATE USER 'ddiindex'@'localhost' IDENTIFIED BY 'BlahBlahBlah';

## GRANT PERMISSIONS ##
mysql> GRANT ALL ON security.* TO 'ddiindex'@'localhost';

##  FLUSH PRIVILEGES, Tell the server TO reload the GRANT TABLES  ##
mysql> FLUSH PRIVILEGES;

# nd then use next at command line:

mysql -h localhost -u root -p security < security_2010-04-27-1259.sql; 
# Enter password: 
# deprecated let thru iptables port 3306?

6.4.4 make sure ddiindex can connect to mysql as ddiindex user

# now go to /usr/share/tomcat/webapps/ddiindex/WEB-INF/classes/index-conf.xml
# dburl change to 
    <dbUrl>jdbc:mysql://localhost:3306/security</dbUrl>
    <dbUserName>ddiindex</dbUserName>
    <dbPassword>BlahBlahBlah</dbPassword>
# other personalisations here such as <dataServer>
mkdir /opt/ddiindex/ewedb
mkdir /opt/ddiindex/ewedb_s

from home.jsp removed <%@taglib uri="http://assda.anu.edu.au/ddiindex" prefix="ddiindex"%> and login.jsp <%@taglib uri="http://assda.anu.edu.au/ddiindex" prefix="ddiindex"%> and search.jsp <%@taglib uri="http://assda.anu.edu.au/ddiindex" prefix="ddiindex"%>

6.4.5 TODO Check the security implications of allowing write permissions here

I found I need to allow write access to the xmldata folder for uploads of the xml metadata files.

adduser ddiindex
chown -R ddiindex /xmldata
chmod 0777 /xmldata
# bad for security, try 0750 instead
# chmod 0777 /opt/ddiindex/nceph
# or this works
chmod -R 777 /opt/ddiindex/nceph
chmod 0777 /opt/ddiindex/nceph_s
chmod -R 777 /opt/ddiindex/ewedb
chmod 0777 /opt/ddiindex/ewedb_s

6.4.6 Running the Indexer

Log on as an admin user (set in the MySQL userEJB table) and you can run the indexer.jsp Before doing this the first time I did the following:

  • went to nceph and ncephs deleted the recent created files
  • went to xmldata, deleted all
  • restart TOMCAT
  • go to indexer, upload xml
  • hit the 'add to index' button.

6.4.7 personalise the ddiindex

ie home.jsp <div class="intro">The Graduate Information Literacy Program at ANU now offers resources for researchers and graduate students on data management planning: a comprehensive Manual and a Data Management Planning Template. <a href="http://ilp.anu.edu.au/dm/" target="blank">http://ilp.anu.edu.au/dm</a></div>

6.5 Set up a private git server for data and code

6.6 True-Crypt encrypted volumes

The servers are secure but we will sometimes want local copies and so should use truecrypt or another option. I am using Ubuntu as my desktop so here is the instructions.

################################################################
#http://www.liberiangeek.net/2012/07/install-truecrypt-via-ppa-in-ubuntu-12-04-precise-pangolin/
sudo add-apt-repository ppa:michael-astrapi/ppa
sudo apt-get update && sudo apt-get install truecrypt

#### FAILED TO INSTALL ON REDHAT ####
# or http://ubuntupop.blogspot.com.au/2012/10/how-to-install-truecrypt-on-centos-63.html
# 1. Download Truecrypt package. Or use wget instead
# wget http://www.truecrypt.org/download/truecrypt-7.1a-linux-x64.tar.gz
# 2.  Now extract the package
# tar xzvf truecrypt-7.1a-linux-x64.tar.gz
# 3. It will produce a file called truecrypt-7.1a-setup-x64
# chmod +x truecrypt-7.1a-setup-x64 
# ./truecrypt-7.1a-setup-x64
# installer ran but then 
# truecrypt -h 
# truecrypt: error while loading shared libraries: libfuse.so.2: cannot open shared object file: No such file or directory
# yum install libfuse.so.2 didn't help

# DIDN'T TRY  http://vinnn.blogspot.com.au/2009/01/tryecrypt-6-on-centos-5.html

6.7 TODO ResearchData storage

################################################################
# name:ResearchData
cd /home
mkdir ResearchData
# and them to a particular group
groupadd researchdata
usermod -a -G researchdata username
chown -R root:researchdata /home/ResearchData
# TODO check what the security numeric codes do
chmod -R 0777 /home/ResearchData
chmod -R 0750 /home/ResearchData
service rstudio-server restart
# user may need to q() from R session and sign out of rstudio?
# for a user to have this link in their home go
su username
ln -s /home/ResearchData /home/username/ResearchData

6.8 TODO example workflow to dev a local data package and publish to shared

  • work in personal workspace then as root

[root@i-00002e43 ResearchData]# mkdir GARNAUTCLIMATECHANGEREVIEW [root@i-00002e43 ResearchData]# chown -R root:researchdata /home/ResearchData [root@i-00002e43 ResearchData]# chmod -R 0750 /home/ResearchData [root@i-00002e43 ResearchData]# ls -l [root@i-00002e43 ResearchData]# cd GARNAUTCLIMATECHANGEREVIEW/ [root@i-00002e43 GARNAUTCLIMATECHANGEREVIEW]# cp -R home/ivanhanigan/projects/GARNAUTCLIMATECHANGEREVIEW/* . [root@i-00002e43 GARNAUTCLIMATECHANGEREVIEW]# rm GARNAUTCLIMATECHANGEREVIEW.Rproj rm: remove regular file `GARNAUTCLIMATECHANGEREVIEW.Rproj'? y

6.9 TODO Share with windows network

7 Procedures for add users to the system.

7.1 Description of the access procedure

7.1.1 Getting Access

The "Getting Access" procedure to help users apply for and gain access to mortality data is shown in Figure 1, and a more detailed description of the process is shown in the Appendix, Figure 4. The process is a set of formally defined steps that are designed to move the User through two general stages:

  • Requesting data: guiding the researcher through the process of (a) gaining Ethics Approval from a Human Research Ethics Committee, and (b) Project Level Approval from the Registrar of Births, Deaths and Marriages.
  • Providing data: provision of a confidentialised (often aggregated) dataset in an appropriately secured manner such as access to remote secure servers or encrypted archives accessed on local disk media, with security determined by (a) the nature of the data and (b) any project management related criteria.

7.1.2 Managing Access

Procedures for managing access are shown in Figure 2 (and in the Appendix, Figure 5). These activities are intended to maintain information on the current state of projects using the mortality data, and to report any changes in situation to the State Registries. The User Administrator is responsible for conduct of the "Managing Access" procedures.

The process is initiated by running a query on the ANU-User-DB to make a list of all Projects and Users, and then each Project is sent a reminder to report any changes in Project Status (sent annually to coincide with a similar reminder sent by the ANU Human Research Ethics Committee). The purpose of these reminders is to ensure that Project management plans continue to consider data security as a primary concern, even during long multi-year projects where many project management and staffing issues inevitably arise.

Once a response is received, the User Administrator then enters the relevant information into the ANU-User-DB, and if the project has been concluded will then initiate the final "Ending Access" process.

7.1.3 Ending Access

The procedure for ending access aims to ensure that data are both securely and sustainably stored. It is very important that files used for authorised projects are never re-used in un-authorised projects, but that future researchers may have the opportunity to create an authorised project and potentially replicate historical analyses. This is an important part of reproducible research and the robust practice of scientific enquiry.

7.2 The process user administrators go through to set up users

When adding users go to the oraphi and track their permissions

7.2.1 lodge request in user db

  • new study into nceph ddiindexdb
  • new file with filelocation = brawns EWEDB schema, working directory
  • in primary database record (not project) add access requestor, accessors add accessor, date to end
  • finalise approval

7.2.2 r-nceph

production goes thru david fisher testbed done by me

7.3 brains

7.3.1 linux user

adduser userxxx
passwd userxxx 
# (I use R to randomly generate passwords by mixing words, letters,
# numbers and symbols) 

7.3.2 mysql (check ddiindex)

mysql -u root -p security
select * from UserEJB;
insert into UserEJB (id, password, admin) values ('userxxx', 'passwd', 0);
  • oracle (oraphi)
    • log in as admin user to ddiindexdb workspace and go to administration,
    • on the right there is a tasks pane with create user
  • github
    • downstream users can just download zips from github
    • collaborators need to sign up to github
    • will create a walk through for those

7.4 brawn

7.4.1 TODO postgres

Need to find out if this 0.0.0.0/0 is a big no-no?

# first edit your pg_hba.conf file
# under /home/pgsql/9.2/data/pg_hba.conf I added a super user from my ip
# address and allowed all the www ip addresses access
host    all             postgres        my.desk.ip.address/32       md5
host    all             gislibrary      0.0.0.0/0                   md5
# and save.  now
su - postgres
psql postgres postgres
CREATE ROLE userxxx LOGIN PASSWORD 'xxxxx';
GRANT gislibrary TO userxxx;
CREATE SCHEMA userxxx;
grant ALL on schema userxxx to userxxx;
GRANT ALL ON ALL TABLES IN SCHEMA userxxx TO userxxx;
grant ALL on all functions in schema userxxx to userxxx;
grant ALL on all sequences in schema userxxx to userxxx; 
# might want to set a date for expiry?  now 
select pg_reload_conf();
# or if not connected just service postgresql-9.2 restart

7.4.2 geoserver

  • general users don't need log in
  • admin users can log on as admin?

7.4.3 alliance wiki

https://alliance.anu.edu.au/welcome/ uses their anu password

7.4.4 email details without password

Dear user,

The new EWEDB server is now ready. You can configure your Kepler environment by following the tutorial at http://swish-climate-impact-assessment.github.io/2013/05/hello-ewedb/

The server details are: database = ewedb host = ipaddress user = username

Your password will be sent to you as an SMS to your designated mobile phone number.

NB Please note that the connection information will be saved in plain text to your user profile. As this presents a security risk we request that you only use this service from a secure computer.

Thankyou, Ivan Hanigan Data Management Officer. National Centre for Epidemiology and Population Health. College of Medicine, Biology and Environment. Australian National University Canberra, ACT, 0200. Ph: +61 2 6125 7767. Fax: +61 2 6125 0740. Mob: 0428 265 976. CRICOS provider #00120C.

7.4.5 text message the set password

This solution is adequate. Might want to invest in a sacrificial SIM card to avoid having too many phone calls?

7.5 edits to oraphi via sql

update accessors 
set othermat = 'phone number'
where accessornames = 'username';

7.5.1 revocation

userdel -r ivan_hanigan

8 Backups

8.1 General

8.1.1 maintenance

8.1.2 nearline for potential restore

8.1.3 archive and remove

8.2 General concerns

8.3 Brains

8.3.1 Backup oraphi

Needs to be done from Brains as it needs the oracle client set up.

  ###########################################################################
  # newnode: backupOraphi
  # using an emacs ssh buffer log on to target server, start R, start ESS-remote
  # do this as root and put into /backups
  if(!require(RJDBC)) install.packages('RJDBC'); require(RJDBC)
  drv <- JDBC("oracle.jdbc.driver.OracleDriver",
              '/u01/app/oracle/product/11.2.0/xe/jdbc/lib/ojdbc6.jar')
  ch <- dbConnect(drv,"jdbc:oracle:thin:@xx.yy.zz.ww:1521","DDIINDEXDB","password")
  dir()
  dir.create(paste('ddiindexdb-backups/csvs/',Sys.Date(),sep=''),recursive=T)

  for( tbl in c('STDYDSCR','DATADSCR','FILEDSCR','KEYWORDS','ACCESSORS','ACCESSSTDY','OTHRSTDYMAT')){
  #tbl='STDYDSCR'
  dataout <- dbReadTable(ch, tbl)
  # sqlFetch(ch,tbl,as.is=T)
  write.table(dataout,paste('ddiindexdb-backups/csvs/',Sys.Date(),'/',tbl,'.csv',sep=''),na = "",row.names=F,sep=',',quote=T)
  }
# download these to a backup location for storage.

8.3.2 sync the home directory

8.4 Brawn

8.4.1 Find out how big is it?

AKA brawn-dbsize

8.4.2 Dump and download it to a secure computer

  • file system backup
  • backup-brawn-filesystem header
  • pgdump
    Backup the pgdump to your workplace or another VM on the Research Cloud.

    First do this to setup the backup directory

    mkdir /home/backup
    cd /home/backup
    /usr/pgsql-9.2/bin/pg_dump -U postgres ewedb | gzip > ewedb_$(date +"%F-%H%M").gz
    # see http://www.postgresql.org/docs/9.2/static/app-pgdump.html
    # NB tried -Fc compressed version but took too long to restore
    # see http://www.postgresql.org/docs/9.2/static/backup-dump.html
    # 24.1.3. Handling Large Databases
    

    Optionally you could set up a scheduled cron job

    cron -e
    # 15 past midnight everyday
    15 0 * * * /home/ewedbbackup/backup_script.sh
    

    Then add this script.

8.4.3 or simple home backup

#(set up passwordless ssh) #http://www.linuxproblem.org/art_9.html

crontab -e 15 0 * * * rsync -arvR –delete home root@my.server.ip:/ :wq

from a backup machine vi /etc/sysconfig/iptables als go to /etc/hosts.allow and allow this machine from the remote machine copy public key to the backup user’s home directory on the backup machine, like this: scp ~/.ssh/idrsa.pub root@115.146.92.162: copy public SSH key into the list of authorised keys for that user. #mkdir -p .ssh cat idrsa.pub >> .ssh/authorizedkeys

Then lock down the permissions on .ssh and its contents, since the SSH server is fussy about this.

$ chmod 700 .ssh $ chmod 400 .ssh/authorizedkeys

That’s it, passwordless-SSH is now set up: log out of the server then log back in:

Now set up the remote prod server to send it's home dir at a regular time to the backup server crontab -e 15 0 * * * rsync -arvR –delete home root@my.server.ip:/home/brainsbackup/ ESC :wq

8.4.4 Restore databse dump into a new database on another machine.

8.4.5 TODO launch a new Nectar VM from the snapshot image

8.4.6 TODO mount the 2nd disc and load/restore the postgres db and data into it

8.5 Disaster Recovery Plan

8.5.1 test a snapshot

should test snapshots 3-monthly?

  • cleaning up links on github
    in case of a completely failed VM, launching a fresh VM from snapshot will give a different IP-address and so Github and other links, and user's bookmarks will need to be revised.

8.5.2 TODO 60GB disk is not being saved in snapshots

Will also need to investigate the restore procedure (and security) for the secondary disc.

8.5.3 TODO Restore ORAPHI

This should have restored ok with data intact from the last snapshot. If not first set up oracle and the tables. Get the most recent backup of the tables. Load the tables to the new ORAPHI. In the event of an upgrade to a tested replacement you may find that the contents on the running VM that is getting retired is more current than the old stuff in the replacement tested server so will need to sync.

9 Examples

9.1 Pumilio for Bioacoustics