Source files

A fundamental component of OpenTabulate and its use to tabulate your data is the source file. A source file is a metadata and configuration file in JSON format that directs how OpenTabulate should process a specific dataset. The contents include the input data format, data attribute mapping to the output schema, the filename of the dataset stored on the disk, and so on.

Source files follow a specific format which OpenTabulate parses and validates. Below is an example source file for British Columbia hospital data

{
  "localfile": "bc-hospitals.csv",
  "schema_groups": ["health", "geocoordinates", "address"],
  "source": "https://catalogue.data.gov.bc.ca/dataset/hospitals-in-bc",
  "licence": "https://www2.gov.bc.ca/gov/content/data/open-data/open-government-licence-bc",
  "provider": "Province of British Columbia",
  "encoding": "cp1252",
  "format": {
      "type": "csv",
      "delimiter": ",",
      "quote": "\""
  },
  "schema": {
      "name" : "SV_NAME",
      "health_authority": "RG_NAME",
      "address_str": ["STREET_NUMBER", "CITY", "PROVINCE", "POSTAL_CODE"],
      "geocoordinates" : {
          "longitude": "geo_lon",
          "latitude": "geo_lat"
      }
  }
}

For brevity, we refer to

  • the JSON keys that appear in (the top level of) a pair of closing curly braces as a layer
  • the JSON key-value pairs as a tag

In the above example, the keys found in the first layer are localfile, schema_groups, source, licence, provider, encoding, format and schema. Besides schema, most of these keys specify metadata and data reading parameters. In more detail:

  • localfile is the filename of a dataset stored in the input directory (to be processed)
  • schema_groups is a list of group names specifying which column names in the output schema are allowed to be used; the group names are defined in opentabulate.conf
  • source, licence and provider are metadata for organizational purposes; the provider value is stored in a column in the output format
  • encoding refers to the input data’s character encoding
  • format provides parameters to parse the input data
  • schema gives the input data attribute mapping to the output schema

How to read the tag documentation

The documented source file tags are organized to match the structure of the example above. The tag information is structured as

Key : JSON Type : Requirements
Description…
  • Key : Key name.
  • JSON Type : The supported JSON type for the key’s value.
  • Requirements : Specifies if the tag must be included and any dependencies it has, if any.
  • Description : A brief description of the what the key represents.

Metadata and configuration tags

The tags presented refer to the first layer of the source file, i.e. they must appear first set of closing curly braces {...}.

localfile : string : Required.
The filename of the input data to process stored in $root_directory/data/input.
schema_groups : string, list of strings : Required.
Groups referring to output column names this source file is allowed to use. The group names are specified in the [labels] section of the configuration file opentabulate.conf.
format : object : Required.
Input data format specification.
encoding : string : Optional.
Dataset character encoding, which currently can only be “utf-8” or “cp1252”. If not specified, the encoding is guessed from these options.
schema : object : Required.
The mapping of input data attributes to output columns. This describes how the attributes of the input dataset should be tabulated. The output column names must appear in the group names provided in the schema_groups tag.
filter : object : Optional.
Regular expression filter rules for choosing which entries to process. filter contains keys which are the input data attributes to filter on. The value of each key is a regular expression string. For more details, see Filtering with regular expressions.
provider : string : Optional.
Data provider name to auto-fill the column provider created during tabulation.
source : string : Optional.
Data source string. Our convention is that this is a URL. This is here purely for organization purposes.
licence : string : Optional.
Data source licence string. Our convention is that this is a URL. This is here purely for organization purposes.

Format tags

The format tag is a JSON object, specifying input data format information so as to tell Opentabulate how to parse the input data.

type : string : Required.
The dataset format. Currently supports the values “csv” and “xml”.
header : string : Required if "type": "xml".
XML tag name identifying a single data point. Each occurence of this tag should contain the information referenced in the schema.
delimiter : string : Required if "type": "csv", must be one character.
The delimiting character in the raw CSV dataset.
quote : string : Required if "type": "csv", must be one character.
The quote character in the raw CSV dataset.

Schema tags

The tag schema is defined as a JSON object. It is special in the sense that its contents are largely customized by the user. The formatting of the contents of the schema tag is important and follows strict rules. First, recall that the output schema is specified in the [labels] section of the configuration file opentabulate.conf. We might have something like

...
[labels]

geocoordinates = ('longitude', 'latitude')

facility = ('name',)

The tags in schema must either be a data attribute mapping tag

# example: "name": "Facility_Name"
"output_column" : "input_attribute"

or a group tag, with group name key and JSON object value, containing column names from that group

# example:
# "geocoordinates" : {
#     "longitude" : "LON",
#     "latitude" : "LAT"
# }
"group_name" : {
    "output_column" : "input_attribute",
    ...
}

The use of group tags are not necessary and do not change OpenTabulate’s output. They are primarily there for readability.

Additional features

Extracting information from data to be tabulated sometimes requires methods beyond simply mapping data attributes, such as regular expression filtering and entry concatenation. To incorporate such methods into OpenTabulate, special tags have been added or specific syntax for key values is taken into account. Below is a complete list of these features.

Concatenating entries

When writing schema tags, all data attribute mapping tags support having JSON lists of strings as a value instead of just a string. This has the effect of concatenating data entries. For example, if we have

"col": ["attr_1", "attr_2", "attr_3"],

OpenTabulate interprets this as “for each entity in the data, concatenate (separating by spaces) the contents of the attributes attr_1, attr_2, and attr_3 (in that order), and then proceed to process this and output to the column col.

Manually inject or fill data

Any data attribute tag in schema that contains the prefix force: in a string value is interpreted differently. Instead of searching for a data attribute in the input, the substring after the prefix is injected into the data entry for the corresponding output column. It can be used as a single value data filler. For example, if we have

"a" : "force:b"

then OpenTabulate’s output will have a column named a filled with the character b. One can do something more complicated in combination with entry concatenation:

"address" : ["STREET_ADDRESS", "CITY", "force:Canada"]

This has the effect of taking every entity’s street address and city name, concatenating them and appending Canada at the end of the concatenated string. The string then undergoes further processing and is output to the address column.

Filtering with regular expressions

As perhaps seen earlier, source files support a filter tag in the first layer. filter contains tags whose values are strings forming regular expressions. Each key is an input data attribute to apply the regular expression to. A data entity is kept if and only if for every key k, the regular expression matches the entity’s data entry for the attribute k.

Debugging syntax errors

OpenTabulate checks the syntax of your source file and will warn you if something is off, but this validation does not cover all situations. Moreover, it cannot make sense of semantic errors in output until either during processing or when you inspect the output.

Try your best to adhere to the standards and conventions as shown in the documentation. If you discover any bugs or have any further questions, please post an issue in our GitHub repository.