Data processor

This is a backend services to ingest customer data, convert it to valid import data for the DB, and insert the data to the DB.

Terms

DataProcessor The working name of this component of the HView software stack.
Modules A general name for code written to handle different types of data (e.g. handling a hydrastor system report will be done by the hydrastor system report module.)
Object Store The current upload storage target for large files. Currently, the object store is an AWS S3 compliant open source product called Zenko Cloud Server.

Architecture

The module is currently built with 2 primary ingest mechanisms in mind.

Email ingest: Some data is only receivable via email.
File upload: Other data can be uploaded to the platform for ingest. Typically for data sets that are too large to be emailed. This is made to be automatable by using a combination of cloud technologies (namely, S3 & REST).
REST API: This is used to track & manage processing jobs as they occur in the DataProcessor

Email

The email module is currently built around Postfix. An email is sent to the data processor server, is collected by a postfix service, but is forwarded directly to DataProcessor software for processing.

Data Flow

EMAIL --> [1] POSTFIX --> [2] DSC-MAILHANDLER --> [3] RELEVANT DSC PARSING MODULE --> [4] DB INSERT

An email is sent to the server from an automated system/any email source. 1. The email is received by Postfix and immediately pushed into the mailhandler. 2. The mailhandler converts the attachments and mail body into local files, categorizes the mail and then triggers the appropriate parsing module. 3. Each email type will have to have a parsing script written, in order to retrieve the relevant data. 4. The data is then inserted into the DB/or actioned according to its needs.

File Upload

The file upload has two primary components. An object store & and API (as this is managed behind the frontend).

The object store is currently built with an open source AWS S3 compliant package called Zenko Cloud Server. The API is a home grown REST compliant API, which handles and triggers jobs upon upload complete/at user request.

The S3 object store is hosted as a separate docker container, and is maintained primarily by the software vendor (Zenko) themselves. (Please see the Hview-Build project for further details)

Data Flow

FILE UPLOAD INITIATED --> [1] AWS S3 OBJECT STORE --> [2] API INFORMED --> [3] RELEVANT DSC PARSING MODULE --> [4] DB INSERT

An upload of a large (compressed) file is initiated. 1. AWS compliant object push to a relevant S3 bucket is then processed. 2. The DataProcessor's REST API is then informed that the upload is completed. 3. The API will then initiate the relevant parsing module. 4. The data will be inserted into the DB/actioned appropriately.

REST API

The rest API is built to initiate jobs on uploaded files, as well as manage jobs that have occured on the platform (from any source).

REST API GUIDE - Documentation needs to be built.

Directory Structure

The app has been built into a docker container flow and will be deployed accordingly.

When on an HVIEW host, the following command can gain access to the DataProcessor container.

docker exec -it hview.data bash

Application Files

The root path of the application will be /app/. within the container.

- api/                  The REST API module is contained here (this is where the API is defined)
- mailhandler/          The mail handling module is container here (this is what is called by postfix)
- shared_modules/       These are generic, globally shared modules for the entire project (code reuse)
- modules/              These are the specific components that are written to handle each data source.
- tools/                These are useful tools when working in the container
- config.json           This is used to configure global settings to the application.
- main.py               This is the master to the API application, and is run by the docker container.

Logging

All logs are pushed to /var/log/hview/dataprocessor_*

Data files PATH PREFIX

The customer data is contained in the path /home/mailbox/. due to the requirement of a non-root user for Postfix. All other data is handled/processed there for consistency. These are the 3 PATH PREFIXES found in the DataProcessor

/home/mailbox/inbox - This is where all files are saved/downloade too before being classified and copied to ./tmp for processing. *FILES HERE SHOULD NEVER BE MODIFED*
/home/mailbox/tmp - This is where object store data as well as email data is processed (and upserted into DB if required). *Check DB prefix for it's equivalent path.
/home/mailbox/archive - This is where a copy of the original source data is stored, if required (configurable), for future reference. Mounted to the local filesystem which can be redirected to slow/archive storage.

Data path structure

Within each PATH PREFIX, there is a sub path that is designed to create as unique a path as possible, whilst still being standardized. The full data path will take the format <PATH-PREFIX>/<customer>/<category>/<type>/<group>/<datetime>/*

customer - Highest level grouping is at a customer level (for customer data seperation) (i.e. bobs_burgers)
category - Category will typically be the product name (i.e. hydrastor)
type - This refers to the module that will parse the data (i.e. system_report, alert, veritas2hview...)
group - Group is a sub category, to further differentiate sources (to avoid conflicts) (i.e a group for the type alert will be nodename OR a group for the type system_report will be gridname)
datetime - This value is autogenerated on recieving the data. This is the recieve time, not the send time.

Example

A customer "Bobs Burgers" has a hydrastor that sends a daily system_report at 12:00 on the 13th of Jan 2023 via email, for thier grid icecream_machine. The attachements/email body will be saved in the following location, before being processed.

Inbox

/home/mailbox/inbox/bobs_burgers/hydrastor/system_report/icecream_machine/20230113_120000/*

Temp (Processing Area)

/home/mailbox/tmp/bobs_burgers/hydrastor/system_report/icecream_machine/20230113_120000/*

Archive

/home/mailbox/archive/bobs_burgers/hydrastor/system_report/icecream_machine/20230113_120000/*

Modules

Hydrastor - System Report (Email)

This module handles emailed hydrastor system reports.

Hydrastor - Alert (Email)

This module handles the emailed hydrastor alerts.

Hydrastor - Veritas2Hview (Upload)

This module handles the uploaded Veritas export data for correlation to the hydrastor grids.

Example usage (CLI)

STEP 1 Upload a file to the veritas2hview bucket. The example uses the AWSCLI CLI package to demonstration purposes

aws s3 cp <dump>.tar.gz s3://veritas2hview/<customer>/<grid>/<dump>.tar.gz --endpoint-url=https://<my.hosted.hview.com>:25025

STEP 2 Trigger the parsing on the DataProcessor's REST API. uid is a unique customer uid (from hview IAM).

curl --location --request POST 'http://<object.hview.com>:25025/api/1.0/veritas2hview/update' \
--header 'Content-Type: application/json' \
--header 'Accept: application/json' \
--header 'X-Api-Key: my-super-secret-api-key' \
--data-raw '{
  "uid" : "<user-uid>",
  "customer" : "<customer-name>",
  "grid" : "<grid-name>",
  "s3_path" : "<path-in-object-store-bucket>"
}'

**Please note that the path-in-object-store-bucket in the request body does not need the bucket name in it. e.g. "s3_path" : "bobs_burgers/ice_cream_machine/big_ol_dump.tar.gz"

Notes

Object Store

Any AWS S3 compliant method can be used to push files to the object store. Go to the S3 CLI section in the Object Store README

REST API

Any method of connecting to REST is supported. A simple test of the API can be carried out using CURL, on the docker host, as in the example below.

curl --location --request GET 'http://localhost:25025/api/1.0/jobs' --header 'X-Api-Key: some-clever-api-key'

Change some-clever-api-key to the correct API key for your installation

Caveats & Limitations

Email handler

[SECURITY] Any source sender is allowed. Mail are dropped if no relevant subject is found.

File Upload

[LIMITATION] Currently only .tar.gz and .zip files are supported. A feature enhancement has been added to the next sprint when customers will be able to upload from the frontend.

CURRENT SPRINT

Enhancements

[x] GENERAL-FILE_MANAGEMENT - need to standardise and implement file moving as part of the process. (Going to be implemented in Job handler)
[x] EMAIL-HANDLER-JOBS - Implement the jobs module into the email handler
[x] JOBS-SYSTEM_REPORT - Need to implement detailed jobs handling
[x] JOBS-ALERT - Need to implemented detaild job handling
[x] API-DOWNLOAD - Update API download to inbox, not tmp
[x] API-JOBS-GETALL - Add a default only last 24hours flag [WILL NOT BE IMPLEMENTED UNLESS REQUIRED]
[x] JOBS-TIMEOUT - Add a job to error jobs over a set timeout duration (will add additional column into DB for jobs timout)

Known Issues

[x] HYDRASTOR-SYSTEM_REPORT-UPSERT - PostgreSQL does not get sufficient privilleges to upsert the files - TEST
[x] HYDRASTOR-SYSTEM_REPORT - Need to handle each parse use case
[x] HYDRASTOR-SYSTEM_REPORT - Extract files needs to move to master caller

Changelog

Versioning

The container image versioning is strucutured as follows:

docker-image:[RELEASE VERSION].[MAJOR VERSION].[MINOR VERSION]

e.g. devops.iohub.cloud:5000/iohub/hview/dataprocessor:1.6.1

Definitions

MINOR VERSION - Changes that do NOT impact any other code in the codebase. ( e.g. certain new features, bugfixes where the code edited is not reused anywhere else)
MAJOUR VERSION - Features/bug fixes where existing codebase has been modified/will interact with that change. ( e.g. fixing bugs that have required work on any exisiting shared code, new features that impact/change the way other functions/code/services)
RELEASE - Typically reserved for dramatic changes to the way the customer interfaces with the platform/ the platform's aesthetic/API changes.

Changelog File