Workspaces:1:Working Notes

From IKMEmergent

Jump to:navigation, search

These notes are from the early stages of the Young Lives Linked Data Demonstrator and are maintained here as an archive only.

Contents

Young Lives Data Demonstrator

Working with data from the Young Lives project we are aiming to identify:


Development Environment

Development Environment

Exploring the data

Data from the Young Lives project is available in SPSS files through the UK Data Archive.

The free and open-source R statistical software can read SPSS files using the 'foreign' package (alternatively, PSPP is an open source clone of SPSS). With a young lives data file in the current working directory, the following commands can be used to convert the SPSS file to CSV which is easier to work with in most data conversion processes. 

install.packages("foreign")
install.packages("Hmisc")
library(foreign)
library(hmisc)
fulldata<-spss.get("DataMashup_Peru_FullData.sav",use.value.labels=TRUE)
write.csv(fulldata,file="datafile.csv")

Once you have a CSV file you can explore the data in spreadsheet software or a tool such as Freebase Gridworks. Within R the following commands may be useful:

> library(Hmisc)
> describe(fulldata) #Show the questions asked for each variable;
> summary(fulldata) #Show a summary of the available data;

Modelling the data as RDF involves:

Initial modelling decisions

The below gives an example of a very rough-and-ready attempt to model a question from a Young Lives dataset in RDF:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix definitions: <http://younglives.practicalparticipation.co.uk/ontowiki/younglives/defn#> .
@prefix questions: <http://younglives.practicalparticipation.co.uk/ontowiki/younglives/questions/> .
@prefix answers: <http://younglives.practicalparticipation.co.uk/ontowiki/younglives/answers/> .
@prefix younglives: <http://younglives.practicalparticipation.co.uk/ontowiki/younglives/items/> .
@prefix dbpedia: <http://dbpedia.org/resource/> .

<younglives:individual/2>
    a younglives:individual ;
    definitions:gender <dbpedia:Female> ;
    definitions:age "15" ;
    questions:ableToSpeakAboutViewsAndFeelingsToParentsOrGuardians <answer:CertainlyTrue> ;
    questions:mostlyParentsTreatFairlyWhenDoneWrong <answer:NA> .

<question:ableToSpeakAboutViewsAndFeelingsToParentsOrGuardians>
    rdfs:label "You feel able to speak about your views and feelings with your parents/guardians" ;
    definitions:relatedQuestions "younglives:questions/mostlyParentsTreatFairlyWhenDoneWrong" ;
    definitions:hasAnswer <answer:CertainlyTrue>, <answer:NA>, <answer:NotTrue>, <answer:AnswerRefused> .

<answer:CertainlyTrue> rdfs:label "A little true for you" .

<answer:NA> rdfs:label "NA" .

<answer:NotTrue> rdfs:label "Not true for you" .

<answer:AnswerRefused> rdfs:label "Refused to answer" .

The graph below (geneated with [RDF Gravity) gives a visual representation of the RDF statements above. You can see the individual respondent and their demographics to the top-left of the diagram, their question answers towards the middle, and the question to which those answers may belong on the right.

RDF-Graph-Version1.png

Questions

Further notes on modelling

There are a number of vocabularies available for statistical data.

The RDF Data Cube vocabulary provides a way of representing measures, dimensions and atributes.

The Data Cube vocabulary also has a concept of slices (grouping together a set of observations along a particular set of dimensions to attach further meta-data to these). For example, in Young Lives, all the observations from one country may constitute a slice against which specific meta-data for that country can be attached.

The data cube vocabulary uses SKOS to represent concepts.


DDI is a standard for micro-data (e.g. Survey data) rather than aggregate statistical data (which is the preserve of SDMX). The spssDDI packaged in R can be used to read SPSS files and write DDI files. The following example shows how to generate an DDI file from an SPSS file using R. The DDI XML may possibly be useful in generating Data Cube RDF

library(spssDDI)
data<-readSpssSav("data.sav") 
sink("dataddi.xml") 
writeDDI(essSample,"essSample.sav") 
sink()

SDMX

Drawing on the namespaces here [1] we can use SDMX concepts for describing dimensions in our data cube which match to those used in SDMX standards. This should support comparison between summary statistics in our dataset and any from SDMX using systems.


A list of SDMX concepts can be retrieved from our SPARQL endpoint with this query [2]

Meeting Notes - 8th September 2010

Tim Davies, Anne Solon, Caroline Knowles present (TD's notes)

Data structure

We can annotate questions, and potentially set the scope of those annotations by country.

Even demographic data can have caveats. E.g. in Ethiopia respondents may not be clear on their age or date of birth: may give the same age two years in a row (more than 12 months apart).

Other datasets worth looking at

The Birth to 20 project in South Africa: potentially using many of the same questions as the Young Lives project.

Also need to look for taxonomies of topics that may be relevant for categorising questions. Could look at journal key-words, with a focus on the following journals:

Linking against a journal taxonomy could be useful in creating connections between question meta-data and wider data on research.

Notes on dataset structure

The following NA/missing data codes are used:

The ChildID code is made up of multiple components:

It's not possible to release the exact locations of each site for any site with < 40,000 people. However, each site does have the administrative geography hierarchy recorded for it - so we could work out what level of administrative geography gives anonymisation to all sites (e.g. County; Region) and code all sites against geography at that level of detail.

Focus for interface prototypes

Interface prototypes should focus on:

Licensing

We need to think carefully about licensing. The terms and conditions of the dataset from the UK Data Archive may not be easy to render into RDF and communicate - particularly when data is being mashed up.

Tim will look into this more and prepare a brief options note.


Meeting Notes: 12th October 2010

Present: Tim Davies, Anne Solon, Caroline Knowles, Inka Barnett

We discussed data comparisons and looked at how comparisons can be made across statistics with the same topic and subject but from different datasets: for example, displaying the statistics for smoking prevalence in a Young Lives dataset, with those in a WHO dataset for a country, or in other micro-data.

To do this we will need to provide rendering of those datasets (or at least, summary statistics) in our data store.

This means we need to:

Develop an interface that will display side-by-side any summary statistics that:

E.g.

@prefix yls: <http://data.younglives.org.uk/statistics/>.

yls:refAges  a rdf:Property, qb:DimensionProperty;
    rdfs:label "Reference Ages"@en;
    rdfs:subPropertyOf sdmx-dimension:refPeriod; # <-- Find correct SDMX way of representing age...
    rdfs:range interval:Interval;
    qb:concept sdmx-concept:refPeriod .

yls:refPeriod  a rdf:Property, qb:DimensionProperty;
    rdfs:label "reference period"@en;
    rdfs:subPropertyOf sdmx-dimension:refPeriod;
    rdfs:range interval:Interval;
    qb:concept sdmx-concept:refPeriod .

yls:refArea  a rdf:Property, qb:DimensionProperty;
    rdfs:label "Area statistic refers to"@en;
    rdfs:subPropertyOf sdmx-dimension:refArea;
    rdfs:range admingeo:UnitaryAuthority; #<---- Find correct Admin Geo...
    qb:concept sdmx-concept:refArea .

yls:smokingPrevalence a rdf:Property, qb:MeasureProperty;
	rdfs:label "Smoking Prevalence"@en;
	rdfs:subPropertyOf sdmx-measure:obsValue;
	rdfs:range xsd:decimal .
	#Need to also specific that measure is a percentage... perhaps express on scale of 0 - 1. 

 eg:dsd-le a qb:DataStructureDefinition;
      qb:component [qb:dimension eg:refArea;         qb:order 1];
      qb:component [qb:dimension eg:refAge; 	     qb:order 2];
      qb:component [qb:dimension eg:refPeriod; 	     qb:order 3];
      qb:component [qb:dimension sdmx-dimension:sex; qb:order 4];

      # The measure(s)
      qb:component [qb:measure yls:smokingPrevelance];
      # The attributes
      qb:component [qb:attribute sdmx-attribute:unitMeasure; qb:componentAttachment qb:DataSet;] .

yls:health-2010-statistics-smoking a qb:DataSet;
	qb:structure eg:dsd-le .

yl:smoking-m-15 a qb:Observation;
	    qb:dataSet yls:health-2010-statistics-smoking;
		yls:refArea                 admingeo:peru ;                  
		yls:refPeriod               <http://reference.data.gov.uk/id/gregorian-interval/2004-01-01T00:00:00/P3Y> ;
		yls:refAge 	"15";
		sdmx-dimension:sex         sdmx-code:sex-M ;
		yls:smokingPrevelance	"0.15".

yl:smoking-f-15 a qb:Observation;
	    qb:dataSet yls:health-2010-statistics-smoking;
		yls:refArea                 admingeo:peru ;                  
		yls:refPeriod               <http://reference.data.gov.uk/id/gregorian-interval/2004-01-01T00:00:00/P3Y> ;
		yls:refAge 	"15";
		sdmx-dimension:sex         sdmx-code:sex-F ;
		yls:smokingPrevelance	"0.16".


Navigation
Browse By
Languages
Toolbox