Basic usage¶
This section details a comprehensive example to illustrate usage of OpenTabulate.
For example’s sake, let’s say you are interested in tabulating hospital location data, where the data of interest is latitude-longitude coordinates, hospital name and regional health authority. The data collected generally does not share the same schema (data attributes) but is structured enough to extract this information. How do we go about using OpenTabulate to achieve our goal of organizing and tabulating this data?
First, an output schema is formalized. Say that we want to output CSVs to have the column names
name, health_authority, longitude, latitude
We must specify this in the OpenTabulate configuration, which will be made use of in a source file. Source files will be discussed in a bit, for now simply take each of them to be metadata and configuration for a specific dataset.
The OpenTabulate configuration file is ~/.config/opentabulate.conf
, which should have be in place after following the installation instructions. Its structure is simple, consisting of sections and key-value pairs.
[section]
key = value
key = value
...
[section]
...
As provided, there should be exactly two sections, [general]
and [labels]
. The labels section is where the output schema is specified. The key is the group name, which only serves for readability. The value is a Python tuple of strings. Based on the desired output schema, we edit the configuration to be
[labels]
hospital = ('name', 'health_authority')
location = ('longitude', 'latitude')
The next, and probably most involved step, is to write a source file for each dataset to process. Let’s say we have two datasets hospitals-bc.csv
HOSPITAL_NAME,RG_NAME,X,Y
Burnaby Hospital,Fraser Health Authority,-123.016837,49.248776
Fort St. John Hospital,Northern Health Authority,-120.817122,56.256892
"UBC Hospital, Purdy Pavilion",Vancouver Coastal Health,-123.245945,49.263388
and hospitals-ab.xml
<?xml version="1.0" encoding="UTF-8"?>
<HospitalData>
<Hospital id="1">
<Name>Canmore General Hospital</Name>
<City>Calgary</City>
<Geocoordinates>
<Longitude>-115.35172</Longitude>
<Latitude>51.092455</Latitude>
</Geocoordinates>
</Hospital>
<Hospital id="2">
<Name>Alberta Children's Hospital</Name>
<City>Calgary</City>
<Geocoordinates>
<Longitude></Longitude>
<Latitude></Latitude>
</Geocoordinates>
</Hospital>
</HospitalData>
First we want to place this data where OpenTabulate will read them. They will be located in the directory:
$root_directory/data/input
where $root_directory
refers to the path in the root_directory
variable in the configuration file. In the example provided in the installation section, $root_directory
is /home/bob/opentabulate
.
To direct OpenTabulate on how to tabulate this data into our schema, we need to write a source file for each dataset. A source file is written in JSON format, and is usually very brief. Minimal examples for our example datasets are shown below.
For hospitals-bc.csv
, our source file is named src-hospitals-bc.json
and contains
{
"localfile": "hospitals-bc.csv",
"schema_groups": ["hospital", "location"],
"format": {
"type": "csv",
"delimiter": ",",
"quote": "\""
},
"schema": {
"hospital" : {
"name" : "HOSPITAL_NAME",
"health_authority": "RG_NAME"
},
"location" : {
"longitude": "X",
"latitude": "Y"
}
}
}
and for hospitals-ab.xml
we have src-hospitals-ab.json
which contains
{
"localfile": "hospitals-ab.xml",
"schema_groups": ["hospital", "location"],
"format": {
"type": "xml",
"header": "Hospital"
},
"schema": {
"name": "Name",
"location" : {
"longitude": "X",
"latitude": "Y"
}
}
}
Each JSON key has a specific meaning in the source file:
localfile
specifies the filename of the dataset in the input directoryschema_groups
refer to which group names (those configured in the[labels]
section ofopentabulate.conf
) are allowed to be usedformat
detail the input file format parametersschema
describes the schema mapping that the tabulation will use
In particular, the contents schema``of the form ``"output_column" : "input_attribute"
tell OpenTabulate how to precisely map the input data to the output. The use of group names in schema
are for organizational purposes.
Where should source files be stored? They can be stored anywhere, but by our convention we organize them here:
$root_directory/sources
Now we are ready to process! Simply run
$ opentab src-hospitals-ab.json src-hospitals-bc.json
Replace the source file names with paths to them as needed. The resulting output files should be
$root_directory/data/output/hospitals-bc.csv
$root_directory/data/output/hospitals-ab.csv
with output (up to permutation of the columns)
name,health_authority,longitude,latitude
Burnaby Hospital,Fraser Health Authority,-123.016837,49.248776
Fort St. John Hospital,Northern Health Authority,-120.817122,56.256892
"UBC Hospital, Purdy Pavilion",Vancouver Coastal Health,-123.245945,49.263388
and
name,health_authority,longitude,latitude
Canmore General Hospital,,-115.35172,51.092455
Alberta Children's Hospital,,,
respectively.