Tricks for quick reverse-engineering during an import

In this article we lay out the mechanisms we used internally to accelerate our database engineering. These tricks are relevant for anyone looking to import data tables easily, and succesfully.

Field naming conventions

Date fields, we identify 2 fields throughout the system as important date tracking fields in most data objects. They are "ts_Creation" and "ts_Change". ts_Creation will normally be adjusted to be auto-filled at row insertion time through our engine, and ts_Change is normally assigned a on update trigger in the database directly to insert the current_timestamp. (Perhaps a MySQL/MariaDB specificity).

ID numeric fields; typically we'll name such fields as tablename_ID, and they should consist of a bigint or greater type of integer (numeric?), with the auto_increment field set (in MariaDB parlance), or a sequence assigned in the table creation script (for the rest of the world). At import time, our engine should take care of generating the necessary sequence if it detects such an ID field. We recommend (and override it internally at table creation time) using unsigned integer types. This can always be overridden if you want to get eccentric with your data structure and use negative ID values. ;) But our engine doesn't implement any trickery based on negative IDs, although we never thought of programming them in the defense mechanisms, usually we expect positive IDs.

When importing tables with numeric auto_increment fields, if the imported data structure also includes the table creation statement with the auto_increment current value, or the sequences at the time-of-export values, all the easier. Otherwise the engine should auto-detect the highest value at the end of an import and adjust the increment value on the related sequence or auto_increment setting.

Language specific fields; in our system we process language fields in what we think is the most efficient method for web-based content; after years of experimenting with different structures, the proposed mechanism is the most convenient, and easiest to import no matter the original structure. The system functions by having fields named with an ISO 2 character language code such as fieldname_en, fieldname_fr, fieldname_es and so on. This allows our engine the ability to handle different language aspects related to the same item efficiently, and present it in a "sane" manner for the content administrator. The real benefits of this methodology lies in the new Edit view renderings that we're currently tying-in with the Content-Tools library which allows you to toggle between languages (like in an Internet site), and which we plan to integrate with an auto-translator tool. The second benefit is the clarity of the language code at the database manipulation level which quickly allows programmers, and our engine, to recognize the appropriate language, and remain selective when building queries for optimization purposes.

Encrypted, ciphered and hashes in the data sets. We've stuck with a little internal convention of suffixing our crypto-centric fields the same way as the language and ID fields. Albeit there lies a little rub for data imports concerning encrypted and hashed field contents. There needs to be a deciphering-ciphering step to adapt to the new hosted key set, invariably.

For quick imports though, we suggest using the naming convention to allow the engine to auto-detect and tune the storage facilities on the servers. For long-term storage of data, we implement a private key encryption routine (same key is used to encrypt and decrypt) on fields suffixed with _crypted, for example email_crypted. The routine is implemented using Sodium Crypto Boxes for maximum long-term security with a different IV for each field. For password hashes (if the passwords were used through the local UX engine), then we normally use the fieldname Password_hash, the hashing routine that we use internally differs from the following hashing routine, fieldname_hash, which is a cryptographic-seeded hash built from a referenced field* normally. The password hashing mechanism on the other hand is based on Argon2 internally and doesn't rely on a field reference, their usage differs greatly as well. We do not implement Password verification from user-imported-data currently, this presents too many security issues. It might be possible that you export tables containing _crypted fields to setup a new system or such, but the keys should and must differ for security purposes, and thus, this data cannot be easily exported and re-imported. The engine should instead decipher what it can (passwords and hashes being an exception) when the user does a data export. The exported file can then be ciphered in its entirety for safe-keeping or transport. It is also not possible to mix hashing and ciphering measures on the same field (for obvious technical reasons). Hashing fields can be useful to construct quick hash indexes made public afterwards as keying material. One needs to be careful to hash immutable data fields in this case.

During the import process you data should be deciphered and detected hash fields would be reset to NULL by default.

* If using a cryptographic hash, then you must configure the referenced field by hand after a fresh table import. This field-reference cannot be reverse-engineered from existing data, and hash values generated outside of the local system should be discarded and re-computed. (A feature we'll have to tie-in during our import development).

Best import format

By far the best import format would be a proper SQL script that includes the table, indexes, foreign keys, primary keys and sequences (if using them). Furthermore, we would recommend making it so that you can recreate your tables in an idempotent fashion (without erasing them!), for example by using such queries as "CREATE TABLE IF NOT EXISTS".

Our engine should be smart enough not to bark too loud if you're attempting to re-import the same table. By default it should skip the data import upon discovering the table already exists. If you wish to merge data, then you'd have to specify it manually by exerting the Merge Data override in the data import interfaces. Caution should be exerted if you've carried out important modifications to your existing data set, such as renaming fields or changing their purposes.

Our import interface should allow the import of more than 1 file at a time. To benefit from this feature, we recommend you make 1 file per data table, and have the table creation scripts in its head. The engine should then be able to grant the data-developer with a reset switch that reloads the table from the cached import file (for a limited time after the initial import).

Automating imports

Our engine is built along a Bot Garden which allows application and data administrator to schedule regular data imports from external sources. We're still working out the finer details of how to manage the security during these imports, but for the moment looks like we'll be providing Bots that can import records from recognized REST API sources such as Paypal, Amazon, and so on. It is possible to configure a Bot against a self-expanding table (which can expand its field set as new fields are encountered), but their use can be dangerous at many levels (such as exponential growths following a mistaken data taxonomy)

We're also considering a wide range of data storage facilities from the onset with RRD databases that can be fed automatically, Redis, Elastic Cache and Mem-based databases for streaming inputs, as well as our own internal streaming packet chute engine from which Bots can consume data in real-time.

REST/JSON for automated processes

Using the features found under the table administrative interfaces, you can easily setup a data table for REST or JSON connectivity. This allows you to script or configure your custom applications to feed or read data from the current system. Typically such interfaces should provide a limited field view (which can be administered in the same interfaces, on a per/ field level) and be restricted to your local network environment in some form if you consider the data "sensitive".

In our SaaS offerings the engine should take care of protecting users against sensitive data-leaks by making it extra difficult to map encrypted fields. (hint-hint!).

About

Tricks for using the 4GL aspects of the system, such as field name conventions to quicken imports and reverse-engineering, data structure recommendations, import file recommendations and other details concerning data automation.

Blog Settings

Blog ID:

languages:

français (not created yet)
english 1450 words , 7.3 mins
español (not created yet)

Originally on site(s):

core.dev.kopel.ca

Created :

June 22, 2020

Last revision :

-- not set --