revise and expand docs about storage / taskdb / replica

This commit is contained in:
Dustin J. Mitchell 2020-11-22 17:28:28 -05:00
parent ffbf272afc
commit 39a0dfe798
10 changed files with 274 additions and 155 deletions

View file

@ -2,7 +2,9 @@
- [Installation](./installation.md)
- [Usage](./usage.md)
- [Internal Details](./internals.md)
- [Data Model](./data-model.md)
---
- [Development Notes](./development-notes.md)
- [Data Model](./data-model.md)
- [Replica Storage](./storage.md)
- [Task Database](./taskdb.md)
- [Synchronization](./sync.md)
- [Planned Functionality](./plans.md)

View file

@ -1,42 +1,5 @@
# Data Model
A client manages a single offline instance of a single user's task list.
The data model is only seen from the clients' perspective.
## Task Database
The task database is composed of an un-ordered collection of tasks, each keyed by a UUID.
Each task in the database has an arbitrary-sized set of key/value properties, with string values.
Tasks are only created and modified; "deleted" tasks continue to stick around and can be modified and even un-deleted.
Tasks have an expiration time, after which they may be purged from the database.
## Task Fields
Each task can have any of the following fields.
Timestamps are stored as UNIX epoch timestamps, in the form of an integer expressed in decimal notation.
Note that it is possible for any field to be omitted.
NOTE: This structure is based on https://taskwarrior.org/docs/design/task.html, but will diverge from that
model over time.
* `status` - one of `Pending`, `Completed`, `Deleted`, `Recurring`, or `Waiting`
* `entry` (timestamp) - time that the task was created
* `description` - the one-line summary of the task
* `start` (timestamp) - if set, the task is active and this field gives the time the task was started
* `end` (timestamp) - the time at which the task was deleted or completed
* `due` (timestamp) - the time at which the task is due
* `until` (timestamp) - the time after which recurrent child tasks should not be created
* `wait` (timestamp) - the time before which this task is considered waiting and should not be shown
* `modified` (timestamp) - time that the task was last modified
* `scheduled` (timestamp) - time that the task is available to start
* `recur` - recurrence frequency
* `mask` - recurrence history
* `imask` - for children of recurring tasks, the index into the `mask` property on the parent
* `parent` - for children of recurring tasks, the uuid of the parent task
* `project` - the task's project (usually a short identifier)
* `priority` - the task's priority, one of `L`, `M`, or `H`.
* `depends` - a comma (`,`) separated list of uuids of tasks on which this task depends
* `tags` - a comma (`,`) separated list of tags for this task
* `annotation_<timestamp>` - an annotation for this task, with the timestamp as part of the key
* `udas` - user-defined attributes
A client manages a single offline instance of a single user's task list, called a replica.
This section covers the structure of that data.
Note that this data model is visible only on the client; the server does not have access to client data.

View file

@ -1,93 +0,0 @@
Goals:
* Reasonable privacy: user's task details not visible on server
* Reliable concurrency - clients do not diverge
* Storage O(n) with n number of tasks
# Operations
Every change to the task database is captured as an operation.
Each operation has one of the forms
* `Create(uuid)`
* `Delete(uuid)`
* `Update(uuid, property, value, timestamp)`
The Create form creates a new task.
It is invalid to create a task that already exists.
Similarly, the Delete form deletes an existing task.
It is invalid to delete a task that does not exist.
The Update form updates the given property of the given task, where property and value are both strings.
Value can also be `None` to indicate deletion of a property.
It is invalid to update a task that does not exist.
The timestamp on updates serves as additional metadata and is used to resolve conflicts.
Operations act as deltas between database states.
## Versions and Synchronization
Occasionally, database states are named with an integer, called a version.
The system as a whole (server and clients) constructs a monotonic sequence of versions and the operations that separate each version from the next.
No gaps are allowed in the verison numbering.
Version 0 is implicitly the empty database.
The server stores the operations for each version, and provides them as needed to clients.
Clients use this information to update their local task databases, and to generate new versions to send to the server.
Clients generate a new version to transmit changes made locally to the server.
The changes are represented as a sequence of operations with the final operation being tagged as the version.
In order to keep the gap-free monotonic numbering, the server will only accept a proposed version from a client if its number is one greater that the latest version on the server.
When this is not the case, the client must "rebase" the local changes onto the latest version from the server and try again.
This operation is performed using operational transformation (OT).
The result of this transformation is a sequence of operations based on the latest version, and a sequence of operations the client can apply to its local task database to "catch up" to the version on the server.
## Snapshots
As designed, storage required on the server would grow with time, as would the time required for new clients to update to the latest version.
As an optimization, the server also stores "snapshots" containing a full copy of the task database at a given version.
Based on configurable heuristics, it may delete older operations and snapshots, as long as enough data remains for active clients to synchronize and for new clients to initialize.
Since snapshots must be computed by clients, the server may "request" a snapshot when providing the latest version to a client.
This request comes with a number indicating how much it 'wants" the snapshot.
Clients which can easily generate and transmit a snapshot should be generous to the server, while clients with more limited resources can wait until the server's requests are more desperate.
The intent is, where possible, to request snapshots created on well-connected desktop clients over mobile and low-power clients.
## Encryption and Signing
From the server's perspective, all data except for version numbers are opaque binary blobs.
Clients encrypt and sign these blobs using a symmetric key known only to the clients.
This secures the data at-rest on the server.
Note that privacy is not complete, as the server still has some information about users, including source and frequency of synchronization transactions and size of those transactions.
## Backups
In this design, the server is little more than an authenticated storage for encrypted blobs provided by the client.
To allow for failure or data loss on the server, clients are expected to cache these blobs locally for a short time (a week), along with a server-provided HMAC signature.
When data loss is detected -- such as when a client expects the server to have a version N or higher, and the server only has N-1, the client can send those blobs to the server.
The server can validate the HMAC and, if successful, add the blobs to its datastore.
## Expiration
TBD
.. conditions on flushing to allow consistent handling
# Implementation Notes
## Client / Server Protocol
TBD
.. using HTTP
.. user auth
.. user setup process
## Batching Operations
TBD
## Recurrence
TBD

View file

@ -1,4 +0,0 @@
# Internal Details
This section describes some of the internal details of TaskChampion.
While this section is not required to use TaskChampion, understanding some of these details may help to understand how TaskChampion behaves.

35
docs/src/plans.md Normal file
View file

@ -0,0 +1,35 @@
# Planned Functionality
This section is a bit of a to-do list for additional functionality to add to the synchronzation system.
Each feature has some discussion of how it might be implemented.
## Snapshots
As designed, storage required on the server would grow with time, as would the time required for new clients to update to the latest version.
As an optimization, the server also stores "snapshots" containing a full copy of the task database at a given version.
Based on configurable heuristics, it may delete older operations and snapshots, as long as enough data remains for active clients to synchronize and for new clients to initialize.
Since snapshots must be computed by clients, the server may "request" a snapshot when providing the latest version to a client.
This request comes with a number indicating how much it 'wants" the snapshot.
Clients which can easily generate and transmit a snapshot should be generous to the server, while clients with more limited resources can wait until the server's requests are more desperate.
The intent is, where possible, to request snapshots created on well-connected desktop clients over mobile and low-power clients.
## Encryption and Signing
From the server's perspective, all data except for version numbers are opaque binary blobs.
Clients encrypt and sign these blobs using a symmetric key known only to the clients.
This secures the data at-rest on the server.
Note that privacy is not complete, as the server still has some information about users, including source and frequency of synchronization transactions and size of those transactions.
## Backups
In this design, the server is little more than an authenticated storage for encrypted blobs provided by the client.
To allow for failure or data loss on the server, clients are expected to cache these blobs locally for a short time (a week), along with a server-provided HMAC signature.
When data loss is detected -- such as when a client expects the server to have a version N or higher, and the server only has N-1, the client can send those blobs to the server.
The server can validate the HMAC and, if successful, add the blobs to its datastore.
## Expiration
Deleted tasks remain in the task database, and are simply hidden in most views.
All tasks have an expiration time after which they may be flushed, preventing unbounded increase in task database size.
However, purging of a task does not satisfy the necessary OT guarantees, so some further formal design work is required before this is implemented.

73
docs/src/storage.md Normal file
View file

@ -0,0 +1,73 @@
# Replica Storage
Each replica has a storage backend.
The interface for this backend is given in `crate::taskstorage::TaskStorage` and `TaskStorageTxn`.
The storage is transaction-protected, with the expectation of a serializable isolation level.
The storage contains the following information:
- `tasks`: a set of tasks, indexed by UUID
- `base_version`: the number of the last version sync'd from the server
- `operations`: all operations performed since base_version
- `working_set`: a mapping from integer -> UUID, used to keep stable small-integer indexes into the tasks for users' convenience. This data is not synchronized with the server and does not affect any consistency guarantees.
## Tasks
The tasks are stored as an un-ordered collection, keyed by task UUID.
Each task in the database has an arbitrary-sized set of key/value properties, with string values.
Tasks are only created and modified; "deleted" tasks continue to stick around and can be modified and even un-deleted.
Tasks have an expiration time, after which they may be purged from the database.
### Task Fields
Each task can have any of the following fields.
Timestamps are stored as UNIX epoch timestamps, in the form of an integer expressed in decimal notation.
Note that it is possible, in task storage, for any field to be omitted.
NOTE: This structure is based on https://taskwarrior.org/docs/design/task.html, but will diverge from that
model over time.
* `status` - one of `Pending`, `Completed`, `Deleted`, `Recurring`, or `Waiting`
* `entry` (timestamp) - time that the task was created
* `description` - the one-line summary of the task
* `start` (timestamp) - if set, the task is active and this field gives the time the task was started
* `end` (timestamp) - the time at which the task was deleted or completed
* `due` (timestamp) - the time at which the task is due
* `until` (timestamp) - the time after which recurrent child tasks should not be created
* `wait` (timestamp) - the time before which this task is considered waiting and should not be shown
* `modified` (timestamp) - time that the task was last modified
* `scheduled` (timestamp) - time that the task is available to start
* `recur` - recurrence frequency
* `mask` - recurrence history
* `imask` - for children of recurring tasks, the index into the `mask` property on the parent
* `parent` - for children of recurring tasks, the uuid of the parent task
* `project` - the task's project (usually a short identifier)
* `priority` - the task's priority, one of `L`, `M`, or `H`.
* `depends` - a comma (`,`) separated list of uuids of tasks on which this task depends
* `tags` - a comma (`,`) separated list of tags for this task
* `annotation_<timestamp>` - an annotation for this task, with the timestamp as part of the key
* `udas` - user-defined attributes
## Operations
Every change to the task database is captured as an operation.
In other words, operations act as deltas between database states.
Operations are crucial to synchronization of replicas, using a technique known as Operational Transforms.
Each operation has one of the forms
* `Create(uuid)`
* `Delete(uuid)`
* `Update(uuid, property, value, timestamp)`
The Create form creates a new task.
It is invalid to create a task that already exists.
Similarly, the Delete form deletes an existing task.
It is invalid to delete a task that does not exist.
The Update form updates the given property of the given task, where property and value are both strings.
Value can also be `None` to indicate deletion of a property.
It is invalid to update a task that does not exist.
The timestamp on updates serves as additional metadata and is used to resolve conflicts.

120
docs/src/sync.md Normal file
View file

@ -0,0 +1,120 @@
# Synchronization
The [task database](./taskdb.md) also implements synchronization.
Synchronization occurs between disconnected replicas, mediated by a server.
The replicas never communicate directly with one another.
The server does not have access to the task data; it sees only opaque blobs of data with a small amount of metadata.
The synchronization process is a critical part of the task database's functionality, and it cannot function efficiently without occasional synchronization operations
## Operational Transformations
Synchronization is based on [operational transformation](https://en.wikipedia.org/wiki/Operational_transformation).
This section will assume some familiarity with the concept.
## State and Operations
At a given time, the set of tasks in a replica's storage is the essential "state" of that replica.
All modifications to that state occur via operations, as defined in [Replica Storage](./storage.md).
We can draw a network, or graph, with the nodes representing states and the edges representing operations.
For example:
```text
o -- State: {abc-d123: 'get groceries', priority L}
|
| -- Operation: set abc-d123 priority to H
|
o -- State: {abc-d123: 'get groceries', priority H}
```
For those familiar with distributed version control systems, a state is analogous to a revision, while an operation is analogous to a commit.
Fundamentally, synchronization involves all replicas agreeing on a single, linear sequence of operations and the state that those operations create.
Since the replicas are not connected, each may have additional operations that have been applied locally, but which have not yet been agreed on.
The synchronization process uses operational transformation to "linearize" those operations.
This process is analogous (vaguely) to rebasing a sequence of Git commits.
### Versions
Occasionally, database states are named with an integer, called a version.
The system as a whole (all replicas) constructs a monotonic sequence of versions and the operations that separate each version from the next.
No gaps are allowed in the version numbering.
Version 0 is implicitly the empty database.
The server stores the operations to change a state from a version N to a version N+1, and provides that information as needed to replicas.
Replicas use this information to update their local task databases, and to generate new versions to send to the server.
Replicas generate a new version to transmit changes made locally to the server.
The changes are represented as a sequence of operations with the state resulting from the final operation corresponding to the version.
In order to keep the gap-free monotonic numbering, the server will only accept a proposed version from a replica if its number is one greater that the latest version on the server.
In the non-conflict case (such as with a single replica), then, a replica's synchronization process involves gathering up the operations it has accumulated since its last synchronization; bundling those operations into version N+1; and sending that version to the server.
### Transformation
When the latest version on the server contains operations that are not present in the replica, then the states have diverged.
For example (with lower-case letters designating operations):
```text
o -- version N
w|\a
o o
x| \b
o o
y| \c
o o -- replica's local state
z|
o -- version N+1
```
In this situation, the replica must "rebase" the local operations onto the latest version from the server and try again.
This process is performed using operational transformation (OT).
The result of this transformation is a sequence of operations based on the latest version, and a sequence of operations the replica can apply to its local task database to reach the same state
Continuing the example above, the resulting operations are shown with `'`:
```text
o -- version N
w|\a
o o
x| \b
o o
y| \c
o o -- replica's intermediate local state
z| |w'
o-N+1 o
a'\ |x'
o o
b'\ |y'
o o
c'\|z'
o -- version N+2
```
The replica applies w' through z' locally, and sends a' through c' to the server as the operations to generate version N+2.
Either path through this graph, a-b-c-w'-x'-y'-z' or a'-b'-c'-w-x-y-z, must generate *precisely* the same final state at version N+2.
Careful selection of the operations and the transformation function ensure this.
See the comments in the source code for the details of how this transformation process is implemented.
## Replica Implementation
The replica's [storage](./storage.md) contains the current state in `tasks`, the as-yet un-synchronized operations in `operations`, and the last version at which synchronization occurred in `base_version`.
To perform a synchronization, the replica first requests any versions greater than `base_version` from the server, and rebases any local operations on top of those new versions, updating `base_version`.
If there are no un-synchronized local operations, the process is complete.
Otherwise, the replica creates a new version containing those local operations and uploads that to the server.
In most cases, this will succeed, but if another replica has created a new version in the interim, then the new version will conflict with that other replica's new version.
In this case, the process repeats.
The replica's un-synchronized operations are already reflected in `tasks`, so the following invariant holds:
> Applying `operations` to the set of tasks at `base_version` gives a set of tasks identical
> to `tasks`.
## Server Implementation
The server implementation is simple.
It supports fetching versions keyed by number, and adding a new version.
In adding a new version, the version number must be one greater than the greatest existing version.
Critically, the server operates on nothing more than numbered, opaque blobs of data.

28
docs/src/taskdb.md Normal file
View file

@ -0,0 +1,28 @@
# Task Database
The task database is a layer of abstraction above the replica storage layer, responsible for maintaining some important invariants.
While the storage is pluggable, there is only one implementation of the task database.
## Reading Data
The task database provides read access to the data in the replica's storage through a variety of methods on the struct.
Each read operation is executed in a transaction, so data may not be consistent between read operations.
In practice, this is not an issue for TaskChampion's purposes.
## Working Set
The task database maintains the working set.
The working set maps small integers to current tasks, for easy reference by command-line users.
This is done in such a way that the task numbers remain stable until the working set is rebuilt, at which point gaps in the numbering, such as for completed tasks, are removed by shifting all higher-numbered tasks downward.
The working set is not replicated, and is not considered a part of any consistency guarantees in the task database.
## Modifying Data
Modifications to the data set are made by applying operations.
Operations are described in [Replica Storage](./storage.md).
Each operation is added to the list of operations in the storage, and simultaneously applied to the tasks in that storage.
Operations are checked for validity as they are applied.