This article describes news easy algorithm for data versioning in databases. It is similar to Log trigger, but it is more universal and reliable. If this method has already been invented, I have not found it in the public domain. How it all began?In one project, we were tasked with managing and distributing many directories, sometimes over unstable communication channels and with a chance to roll back time on servers. I'll be glad to share my experience in this area with you.For technical and political reasons, technical and political reasons we didn't use an existing software program. There are no analyses of existing systems in this article. The project was made using language C# and MS SQL Server. We used architecture "Master-Slave". One Master system contained a reference sample of directories, and enabled our client to edit the directories using the GUI or file import. Slave systems pulled the Master system for updates periodically. Data exchange was performed with WebAPI and JSON, because clients ranged from Windows Server to Android. There were about 10,000 Slave systems. They pulled the Master system about once per hour. It's not a resource-intensive process; nonetheless we also used file-caches and a system of patches to run the Master system on very weak servers. We had different directories:
A special kind of directory is the Registry (e.g. a registry of cash machines):
We developed a common method of storage and distribution for directories and registries. It's lightweight, and you can use it even if your data structures are simpler. ProblemsThe first question: what is version number? There are only a few variants: date and time of the directory's editing and an integer number (auto-increment). "DateTime" is a very popular variant, but it has some drawbacks:
Accordingly, we chose an integer number for the versioning process, as a more reliable variant. The second question: how to store several versions in a database with quick access and without data duplication? There are more variants:
SolutionA table with information about versions:
A table with data:
This structure can be used in both the master and the slave systems. Master and slave systems store a different number of directory versions. A record in a “Data” table belonged to several versions; that is why the method was called, "virtual versioning". Filling directoryWhen using this method, you must create a new “version” in the table "Versions" if you want to edit the directory. After that you can edit your "Data” tables:
For example:
The record "Kate" was created in version 1 and it exists in all other versions. The record "Lisa" was created in version 1 and it was deleted in version 2. (i.e. in version 2 this record doesn't exist anymore). The record "Tom" was created in version 1 and then it was edited in version 3 (in ver. 3 it is "female" already). One record in the “data” table belonged to several versions: from “addInVersion” (inclusively) to “deleteInVersion-1”. The unique identifier for every record in the data table is “ID + addInVersion” (composite key). If a new version of a directory is created over a long period (of time) with a lot of editing of the same record, then the algorithm becomes as follows:
You can always safely remove the version with status "editing": simply delete all records with “addInVersion = N” and set “deleteInVersion” as “NULL” for all records, where “deleteInVersion” set as “N”. The drawback of this method is that the history of interim changes isn’t saved, but you can use an additional journal of user actions instead. The advantage of the method is that interim changes won't be delivered to the slave systems. Several related tablesIf your directory consists of several tables and you want shared versioning for them, add the fields “addInVersion” and “deleteInVersion” to all of the “data” tables and then designate a new table as the controlling "Versions" table. Example: Table Users:
Table Sex:
In this example, two tables were edited in version “3”. Data consistency won't be broken on thea slave system side. IMPORTANT: the field “SexID” in the table “Sex” isn't unique - it's not “Primary Key”. In this case the “Primary Key” is a composite: “SexID + addInVersion”. In the table “Users”, the field “SexID” isn't the “Foreign Key”, (i.e. DBMS doesn't control referential integrity for data with versioning). It's impermissible to use the composite foreign key “SexID + addInVersion” in the table “Users”: if a record in the table “Sex” is edited, the composite foreign key in the table “Users” will continue to relate to the record from an older / previous version. A surrogate key isn't the solution either. When creating new version, your software must control referential integrity. The process of adding and editing records remains the same as described above: when editing link-fields, it's necessary to check that the existing related record contains “deleteInVersion = NULL” (i.e. confirm that it was not deleted from the directory earlier). Deleting records: when a record is marked as deleted (i.e. setting “deleteInVersion” as "Version number"), it's necessary to check other records which could refer to this record. An example with an error: Table Users:
Table Sex:
Version 4 was created; the record containing "other" was deleted. The DBMS won’t return an error because the field “SexID” continues to relate to “SexID=3”, but your program will crash. An example without error:Table Users:
Table Sex:
The record "Tom" was deleted together with the record "other". Optional method: the record "Tom" could be edited in version 4 by assigning a new value to the field “SexID”. But apparently Tom did not have time to decide. :) If you have several independent tables in your project and you want to use versioning for them, you can add several tables with version numbers, each of which is related to one or several data tables. There are pitfalls with related tables: We had some problems with related tables: for example, we decided to edit the table Sex, and this process takes a long time, as it requires coordination with many people. Create Version 4; set “Status” as "approving"; edit the table “Sex”; wait for completion. Suddenly comes a command to edit table “Users”. Version 4 was / has been created already; just make an edit in the table “Users” (“addInVersion = 4”). But slave systems can't receive these changes until full version 4 is approved, and that can take a very long time. Solutions:
Cleaning of Outdated VersionsOutdated versions can be quickly deleted from the master and slave databases.
You can delete only the oldest version (i.e. it is strongly discouraged to delete version 2, if version 1 exists). The retention period of versions can be different for the master and slave systems. In our project the master system stored directories for about 5 years; the slave systems, for about a year. You can also store only the latest version on the slave system, if necessary. Selecting Data from DirectoryAll requests are quite simple and don't require complex calculations. First, select the version number you want to work on from the table "Versions". It can depend on the current date and time, or it can be set to force, according to your specific business process. Search records in a directory with version X: SELECT Name, Sex FROM Users WHERE Name = 'Kate' AND (addInVersion <= X OR
Request all records from a directory with version X: SELECT Name, Sex FROM
Request all records from a directory with version X or earlier: SELECT * FROM
You can use this request for a slave system initialization: If you need to create a diff-file for versions between X and Y (Y = X + 1), then make two requests: 1) Request all records which must be added: SELECT
2) Request all records which must be marked as deleted: SELECT * FROM
Now you can save this data in a file (for example, JSON). Data Transfer from Master to Slave SystemsYou can use any method for data transfer. We used the following:
That's all, folks! Thank you for your attention! 2015.01.06, Andrey Abramov |
Articles >