Small, but with transaction isolation: writing Jepsen tests for Antietcd
2026-01-22
Since version 1.7.0, Vitastor has a built-in etcd replacement — Antietcd.
It’s implemented in Node.js and is very simple — it has just a couple thousand lines of code. It doesn’t implement all features of etcd, but it’s absolutely sufficient for a fully functional Vitastor cluster — all the essential features are present, and in some ways, it’s even better than etcd — for example, Antietcd allows to avoid storing “temporary” data on disk.
However, until recently, there was no answer to the question: can it really be used in production? Does it work correctly?
Below is the story of the search for an answer. A story with a happy ending :)
Table of contents
- Jepsen
- TinyRaft
- Antietcd
- Running Jepsen
- Test logic
- Transaction serializability
- Operation execution errors
- Generator freezes
- wget cache
- First errors
- Watcher tests
- G1a Aborted Read
- Transaction isolation
- Analysis of another anomaly
- Summary
- Links
Jepsen
A good way to check correctness is chaos testing using the Jepsen framework.
What is it?
Once upon a time a guy named Kyle Kingsbury (nicknamed Aphyr) decided that it would be fun to break distributed DBMSs and see if they were really as cool as their creators claimed. Do they really not lose data if processes crash, network flaps, disks fail, and the system clock fluctuates?
And he wrote Jepsen. It’s a framework which you feed with a database, a client implementation, a random operation generator, a checker and a breaker — “Nemesis”. All of this runs in a controlled environment (usually on virtual machines), performs random operations, and Jepsen monitors their execution and checks for inconsistencies.
Jepsen began a long time ago; the first commit to the Jepsen repository dates back to 2013, and it was already used for testing in 2015. I was still writing PHP back then :D, and I was first told about it around 2018. Since then, Aphyr creeps out about every six months and humiliates database after database, usually showing that everything is bad. During this time, Jepsen has become a practically universally accepted tool for testing distributed databases, and Aphyr has become a serious guy, registered a company, and, it seems, even stopped posting naughty pictures on Twitter. :-)
There are built-in generators, checkers, and nemeses, plus you can add your own. The “nasty things” are called phenomena (“Doctor, I’m a phenomenon, my balls are ringing — you’re not a phenomenon, you’re a phony”), or anomalies. They’re conveniently described on his website, and there’s a whole bunch of them.
The very first checkers were simple and checked only one transaction isolation level, the strictest one, but formulated only in terms of a single object — linearizability. Then the author churned out some scientific papers, hooked up with a few other savvy folks, and together they wrote a smarter checker — Elle — that can pinpoint all anomalies and tell you which isolation level you actually implemented — serializable or maybe… unserializable… It also provides text descriptions of anomalies and even draws dependency graphs. Terminology of dependencies (write-write, read-write, write-read, process, realtime) is also described on his website; I especially liked the “traSNACKtion” pun.
Well, Jepsen is great in everything, except for one thing: it’s written in Clojure, and I couldn’t bring myself to try this perversion for a long time. Clojure is a “modern Lisp dialect for the JVM”, that is, a functional programming language with lots of parentheses. Having tried writing tests in it, I can say it’s not all that bad; it’s quite expressive.
So, we make a conclusive decision that we really need Jepsen and Clojure, and we move on to Antietcd… no, first to TinyRaft.
TinyRaft
Antietcd actually started out largely as a model for testing TinyRaft, which is Raft without log replication, with only leader election left in place.
I created TinyRaft in an attempt to simplify Raft even more.
Actually, I always thought log replication in Raft was a bit of a heavy solution. Why replicate the log? It makes you write every change twice — first to the log, and then to the database itself. Moreover, the log must be written to disk, and all changes must run through the log, otherwise the algorithm’s correctness guarantees are broken. That’s why Raft databases are usually designed only for small objects.
At the same time, situations where the entire database is copied from the leader to another node by restoring a full dump are still quite common in Raft. Therefore, Raft databases are generally only intended for small databases. In etcd, for example, the default database size limit is only 2 gigabytes, and this limit can be raised to a maximum of 8 gigabytes.
But even with these limitations, Raft logs are far from lightweight, especially considering that typically 10-100 thousand of the most recent log entries are persistently stored for the algorithm to function properly. I especially remember a shitty built-in Raft implementation in OpenNebula, which stored logs as records in a MySQL table (up to 100000 entries). Log handling was constantly slow, resulting in Raft often running with 2 leaders out of 3 nodes or 0 leaders out of 3 nodes. Don’t use Raft in OpenNebula, better install something like Galera and live in peace.
Another real-life example is etcd itself, which once consumed 23 GB of RAM in my single-node (!) test cluster. It happened precisely because it couldn’t keep up with Raft log compaction and they accumulated almost endlessly. But even when etcd has no problems with compaction, with its default settings, it still eats up about 6 GB of RAM. This is despite Vitastor only stores a couple of megabytes of data in it! So, if you pay attention to etcd options in Vitastor’s make-etcd script, you may notice --snapshot-count 10000, which means “take a snapshot after 10000 committed transactions” (not the default 100000). This allows to reduce its memory consumption to about 1.5-2 gigabytes.
Libraries with ready-made Raft implementations (there is a lot of them) have also always seemed heavy to me. Often, in an effort to provide the most out-of-the-box implementation, they give you almost a complete etcd implementation, with a pre-built non-replaceable network layer and a finished log storage implementation based on some embedded K/V database engine. The log storage can usually be replaced, but there’s no benefit to doing so, since the semantics don’t allow for any significant deviations from the default logic. All this machinery naturally takes at least 10000 lines of code; in real libraries, it’s closer to 20000 or more. At least they let you use their own database (called a “state machine” — yes, that’s what it is).
So an idea came to my mind at some point: maybe we should just remove log replication from Raft?
That’s how TinyRaft came to be. It’s literally 300 lines long, can be trivially rewritten in any programming language, and solves exactly one problem: proper leader election. It knows nothing about the network; you just feed it messages via function calls. It knows nothing about synchronization either, so you can use any synchronization algorithm you want — one from the standard Raft with logs, or another one — I immediately came up with a couple when I’ve been writing the README for it.
I think it’s actually pretty awesome, because it’s exactly what something like Patroni (a Postgres clustering tool) needs — leader election without replication. Replication is left to Postgres itself.
Antietcd
So, how does replication work in Antietcd if it’s based on TinyRaft? It’s very simple:
- Replication is synchronous. Changes are made only on the leader, which sends them via websocket to its replicas (it knows the list of them from TinyRaft), and only then confirms the successful write to the client. If replication fails, the leader simply initiates reelection in TinyRaft.
- When a leader is successfully elected, a simple initial synchronization occurs: the new leader requests full database dumps from its replicas, selects dumps with the maximum Raft term, merges them into a single database, loads it as the reference, and copies it back to all replicas. Then the leader begins accepting write/read requests.
In essence, this is the simplest replication algorithm possible.
The entire database is stored in memory and written to disk (and fsynced!) in one piece, as a single JSON file. This is perfectly acceptable for use in Vitastor, as the actual data size in Vitastor’s etcd rarely exceeds a couple of megabytes, and the only frequent changes are “temporary” keys containing statistics-type data that don’t need to be stored on disk at all.
Moreover, Antietcd, like any other Node.js application, is single-threaded. So there are no atomicity issues — logically, all changes are applied synchronously and in memory. The etcd optimistic transaction API (txn with compare, success, failure) is very simple to implement.
Antietcd is quite nicely divided into modules, of which there are only a few:
- etctree.js — an implementation of an etcd-like in-memory database.
- antipersistence.js — persistence (on-disk data storage).
- anticluster.js — replication and synchronization.
- antietcd.js — the main module that glues it all together.
Running Jepsen
I could just go the copy-paste route: take Jepsen tests for etcd implemented by Aphyr himself, replace etcd with antietcd and enjoy.
However, Jepsen has a step-by-step tutorial, and to better understand what’s going on, it’s best to go through it first and repeat all the steps before starting to borrow etcd tests. There’s also a Clojure tutorial, but I skipped it even though I never coded in Clojure, Lisp, or any other functional programming language before. I just figured it out as I went.
The difference between the tutorial and real etcd tests is that the tutorial was apparently written back in the days of etcd 2.x, and it uses the author’s own client library called Verschlimmbessergung, a scary name about slums or oil (Schlumberger or something).
Well, it doesn’t matter to us, we’re replacing all of this with simple HTTP requests anyway, which we’ll make through httpkit, because Jepsen itself already uses it — one dependency less.
Jepsen requires five (by default) SSH-accessible virtual machines, named simply n1-n5. I set up these machines locally using plain qemu — it’s trivial: download the official Debian netinst image, install it in one virtual machine, then create five disk clones using qemu-img, and launch five virtual machines using a script like this:
#!/bin/bash
echo 1 > /proc/sys/net/ipv4/ip_forward
iptables -t nat -C POSTROUTING -o wlan0 -s 10.0.2.0/24 -j MASQUERADE || \
iptables -t nat -A POSTROUTING -o wlan0 -s 10.0.2.0/24 -j MASQUERADE
brctl addbr br0
for i in {1..5}; do
TAP=tap$((i-1))
sudo -E kvm -m 2048 \
-drive file=debian13_n$i.qcow2,if=virtio \
-cpu host \
-netdev tap,ifname=$TAP,script=no,id=n0 \
-device virtio-net-pci,netdev=n0,mac=52:54:00:12:34:5$i &
done
sleep 1
ip l set br0 up
for i in {1..5}; do
TAP=tap$((i-1))
ip l set $TAP up
brctl addif br0 $TAP
done
ip a a 10.0.2.2/24 dev br0
iptables -C FORWARD -i br0 -j ACCEPT || \
iptables -I FORWARD 1 -i br0 -j ACCEPT
iptables -C FORWARD -o br0 -m state --state RELATED,ESTABLISHED -j ACCEPT || \
iptables -I FORWARD 1 -o br0 -m state --state RELATED,ESTABLISHED -j ACCEPT
service dnsmasq start
wait
Well, if you’re not as red-eyed as me, you can repeat the same procedure with some virtualization UI.
As of January 2026, Jepsen wasn’t compatible with Debian 13 and tried to install a couple of packages that weren’t there anymore, so I had to patch it slightly: I removed libzip4 from the list of installed packages and replaced ntpdate with ntpsec-ntpdate. After that, I ran lein install — the jar was built and installed in the usual local Java Maven dumpster with -SNAPSHOT version, making it usable as a dependency for our tests.
Test logic
Now the interesting part finally begins — the logic of how the tests work.
Tutorial suggests us to implement 2 tests:
- register. A register test with random reads, writes, and CAS updates. A write simply writes a random value from 1 to 5 to the key, a CAS write attempts to update it from one random value to another, and a read simply reads the key. The results are validated by Knossos, the first Jepsen linearizability checker.
- set. A test for set completeness. Clients only append records to a list treated as a set, and then, at the very end, a check is made to ensure the set is complete, meaning that no records were lost in the process.
The same two tests are also a part of the Jepsen etcd test suite. However, it also includes more powerful ones:
- append (Elle list-append). The most powerful test, based on the “smart” Elle checker, which can catch anomalies and determine isolation levels by building a dependency graph between transactions. Elle both generates transactions and checks the results; you only need to interpret and execute them. The transactions themselves contain one or multiple operations on keys, of only two types: reading the entire list and adding an element to the end of the list. Not in the middle — the order of operations is verified by checking the element’s position!
- wr (Elle rw-register). This is also an Elle-based test, but a simpler one — it’s similar to the register test, but also generates and verifies transactions over multiple keys. However, it’s weaker than list tests, since lists allow to observe the entire sequence, not just the last value.
- watch. Tests the etcd watching API. It modifies keys and verifies that all watchers receive these changes correctly. Antietcd also has watchers, so we’ll need such test as well.
Transaction serializability
The transaction isolation level checked in etcd tests is Strong Serializable. It’s also the level claimed by the etcd authors themselves. It can be downgraded to just serializable (not strictly) by adding a query parameter ?serializable=true. The parameter name is dumb, as it intuitively seems that it enables serializable which isn’t present by default. But in fact, the default strict serializable is serializable as well. A more accurate name for the parameter would be stale=true or something similar, because it simply allows reading potentially stale data from a replica’s local database without communicating with the leader. This allows for a slight performance improvement at the cost of consistency.
Antietcd is a direct replacement for etcd. Furthermore, it’s single-threaded and works with in-memory data, so there’s no reason to use other isolation levels. We’ll also use strict serializable, or just serializable if the stale_read option is enabled in the Antietcd configuration, following the same principle as etcd. Stale reads are actually only possible when the network goes down and the replica doesn’t yet realize it’s no longer receiving replication. In Antietcd stale_read is enabled by default; if you disable it, the replica makes a trip to the leader before each read, checking that it’s still available.
Operation execution errors
The first important question I encountered while writing tests was what the client should do with unsuccessful operations. Should it retry them, return an error, because the operation’s result can be :ok or :fail, or something else?
Correct answer:
- An error (:fail) should only be returned if the operation is definitely unsuccessful. That is, if the client knows that the operation could not be applied to the database. For example, if a CAS update failed due to a mismatch in the original value.
- If the client is unsure of the write result — that is, if the operation may or may not have been applied, for example, if the request timed out — an exception should be thrown. This lets Jepsen know that the client crashed and the operation’s outcome is unknown.
- In some tests, however, any failed requests have to be repeated indefinitely — for example, it’s the main point of the set test — it checks that the set is complete, which means that all additions have to succeed in any case.
Generator freezes
Usually you can also repeat reads, since they don’t change the database. However, it’s not always a good idea, as it can cause the operation generator to hang.
This was the next problem I encountered while writing tests.
And here’s why this happens. The logical generator design for the test is a sequential generator (gen/phases from the jepsen.generator library) with two phases:
- The first phase is a combined generator for clients and Nemesis (gen/nemesis) wrapped in a time limit (gen/time-limit).
- The second phase is cluster recovery — the final generator from nemesis, which undoes all cluster failures, and then probably a gen/sleep, simply waiting for recovery.
However, gen/phases waits for all operations from the previous phase to complete via Synchronize before moving to the next phase. And some operations can’t proceed because each operation is tied to a specific cluster node, and nodes are broken by Nemesis.
The result is a freeze… It’s actually quite easy to fix — just put gen/phases inside gen/nemesis, not the other way around. Then you can test operations with endless retries.
wget cache
The next funny moment was when I made some changes, reran the test, and it ran again with the old version of the code. The old version was cached somewhere.
But where? It turns out install-archive! from jepsen.control.util caches downloaded files via cached-wget, and the cache is located on the nodes in the /tmp/jepsen directory. So you can clear the cache like this:
for i in {1..5}; do ssh root@n$i 'rm -rf /tmp/jepsen'; done
First errors
The first tests I wrote (register, set) revealed almost no bugs. That is, all the problems I encountered with them were test problems, not Antietcd bugs. There were a few minor issues:
- Antietcd returned HTTP 200 on request forwarding errors instead of an error status;
- Sometimes Antietcd tried to ping websockets that were not connected yet and crashed with an exception;
- If Antietcd was passed a non-string key, queries failed with an exception because of an attempt to work with a non-string value (such as a number) as a string.
But these, of course, were trivially corrected and the tests started working.
Next, I moved on to porting Elle’s append test. Surprisingly, with a parallelism level of 100, it also passed. It even worked with stale_read and isolation level changed to standard serializable. It took a bit of time, of course, I observed a lot of rather crazy “found” anomalies, but they were caused by write retries, by adding an element into the middle of the list, by skipping some keys in a read transaction… Elle dutifully tried to turn all of this into anomaly reports like, “Hey, dude, your database didn’t return the created key, it returned nil, which means the transaction executed before the previous one…” But in reality, these were all test bugs.
Then, I ported the Elle rw-register test in the same way. That said, the fact that the real problems weren’t caught here was simply a consequence of low parallelism. With parallelism of 200, they would have already become apparent. But I moved on to watcher tests before trying 200 threads.
Watcher tests
Okay, let’s move on to watcher tests. I replaced the etcd client with hato websockets, adapted the rest, and tried running the test.
It failed. Why? The reason was that the test verified that all watchers receive the same sequence of events, even during constant reconnections. In etcd, this works because it stores the entire change history and sends lost parts of it to reconnecting watchers. But in Antietcd, it doesn’t because it doesn’t store the history.
Who needs the history anyway? It’s not Kafka! It’s hard to imagine a use case for etcd that relies on the immutability of historical events. Vitastor definitely doesn’t need the full history — and probably neither does Kubernetes (etcd’s most famous user). Both only need final changes to be delivered at least at some point.
Moreover, in etcd, the delivery of the full history also does not always work — it doesn’t store history infinitely, and if you try to start watching from a revision preceding the already executed compaction, you get a message with canceled: true and filled compact_revision.
So we need to check that all watchers receive a correct subsequence of database states. The client may not receive every change individually, but what it does receive should lead it to some intermediate, consistent database state.
Well, let’s modify the test for new verification logic. This isn’t super easy by the way — the correct approach is to take all events received by all clients, extract individual key changes, and reassemble them into a new sequence of events, each containing changes for only one revision. Using the results of executed write queries to build the reference sequence is incorrect, because some of them time out, and then it’s unclear if they were applied or not.
The reference sequence can then be compared with the events seen by each client, reassembled into full database states at each revision number, and comparing whether what the client received matches what actually happened.
G1a Aborted Read
I ran the fixed watcher test… And finally, I found the real problem! I was pretty much aware of it during the initial Antietcd development, but I didn’t know if I should treat it as important.
The problem manifests itself in a simple way: some watchers see changes which don’t exist. In the test, only one key changes in each revision and some watchers see it correctly, but some others see two keys change in the same revision.
Essentially, this is equivalent to the G1a Aborted Read anomaly in Jepsen’s terminology (from Adya’s thesis). G1a is when a reader sees changes made by someone else’s aborted transaction.
In Antietcd, this could happen during a cluster failure if a change was only replicated to a node that immediately crashed and fell out of the quorum. The remaining nodes would then re-elect a leader and continue working without the changed key. The same read could even happen on the leader when it saved the change to its own database, but then fell out of the quorum just like a follower. From a write guarantee perspective, this is normal — the change hasn’t yet been confirmed to the user and can be lost. But from a reader’s perspective…
By the way, it could even happen if the cluster consisted of 1 node if the change was applied in memory, but failed to be flushed to disk, and then Antietcd restarted.
Watchers are so good at catching this because they receive changes very quickly — as soon as Antietcd makes a change, it immediately sends notifications to watchers. But the issue should also have been reproducible in regular read/write tests, too; it just required at least 200 parallelism.
Okay, but is G1a in Antietcd bad for Vitastor? Most likely, yes, it could potentially lead to some components getting stuck in an incorrect state, or maybe even to incorrect updates — the revision number in the aborted change is also incremented, and if we read a key, see an aborted version and perform a CAS transaction based on that read, we may overwrite someone else’s change to the same key performed in the meantime.
Transaction isolation
And here comes the most interesting part: how to fix this Aborted Read?
For some reason, the first idea that came to my mind was: maybe we should store two copies of the database, one “clean” for readers and one “dirty” for writers? This would be something like “read committed”. CAS would work correctly — client transactions based on the old version wouldn’t go through, but that’s okay — they would wait, retry, and everything would be fine. Hmm, but what if a transaction both writes and reads? Which copy should it work with? Probably the “dirty” one, but then it would potentially see aborted reads again.
Maybe we shouldn’t apply the change to the database at all until it’s replicated to other nodes? No, that’s not correct either — in this case transactions writing to the database won’t see the new version at all and would simply overwrite it.
So what should we do? The correct answer is to implement transaction isolation based on key-level locks. It doesn’t matter what kind of locks — pessimistic or optimistic.
When a transaction changes a key, it must be locked, and all other transactions must not read or write it until Antietcd saves it to disk and replicates it to all other nodes in the current quorum. If the locks are optimistic, the request can simply be terminated with the “try again later” status. If the locks are pessimistic, the request must be enqueued to take the lock.
Locks should also apply to watchers:
- First, if a client starts watching with the initial revision (start_revision), the watch also works like a read — it returns keys changed in meantime.
- Secondly, until the change is committed to the entire cluster, watchers shouldn’t receive notifications about it.
It’s interesting that without locks, the problem apparently can’t be solved at all. It also doesn’t depend on the moment changes are applied to the database or on the consensus algorithm used — the same problem would persist with log replication. And, by implementing locks, we inevitably delve a bit into the world of transaction isolation, despite the extreme simplicity of our database. “Row-level locking”, huh?
Analysis of another anomaly
Okay, now Antietcd has locks, we retest our watchers, and the test finally passes correctly! Hooray!
All that’s left is to go back and check that we didn’t break anything in the other tests. We run append again, this time with 200 parallelism… hmm. It tells us we have five types of anomalies: G-nonadjacent-item-realtime, G-single-item, G0-realtime, G1c, and incompatible-order. Interesting. Did locks break everything so badly? We roll back to a version without locks and recheck — no, it’s still the same, random anomalies pop out ever and again.
Okay, let’s look at the anomaly descriptions. Elle puts them in the store/current/elle/ directory. Here’s the beginning of the G-single-item.txt file from there:
G-single-item #0
Let:
T1 = {:index 276, :time 14732767003, :type :ok, :process 133, :f :txn, :value [[:r 4 [19]] [:append 0 26] [:r 4 [19]]]}
T2 = {:index 264, :time 14709137549, :type :ok, :process 93, :f :txn, :value [[:append 4 3] [:append 0 15]]}
Then:
- T1 < T2, because T1 did not observe T2's append of 3 to 4.
- However, T2 < T1, because T1 appended 26 after T2 appended 15 to 0: a contradiction!
Other descriptions are scarier, some of them consist of 12 transactions. This one is the simplest — it only has two items, so we’ll focus on it. Elle also shows us a graph, but overall, the text description is clearer:
Let’s dig into history.edn / jepsen.log. We see that:
- Yes, T2 at
:index 264seems to have added value 3 to key 4. - But T1 at
:index 276didn’t see this 3, but it saw 4 = [19]. And where did 19 come from? - It seems 19 was added by another transaction at index 273:
{:index 273, :time 14720717184, :type :ok, :process 118, :f :txn, :value [[:append 4 19]]} - Sooo, T1 saw the changes from index 273, but didn’t see 264?
- Oh, and at index 272 someone else also read key 4 and saw [3]:
{:index 272, :time 14719563878, :type :ok, :process 128, :f :txn, :value [[:r 4 [3]]]}
But we add elements to lists using CAS transactions, right? How did we overwrite 319?
Fortunately, we ran the test with antietcd access logs — by default, the test should be ran without them, as some bugs may not be reproducible due to an additional serialization against the log output. We dig into the log and find these successful write requests: 4 = [3] and 4 = [19]. Requests seem fine, only 3 is missing:
2026-01-18T09:04:03.709Z ::ffff:10.0.2.2:60946 POST /v3/kv/txn 200
{"compare":[{"key":"4","target":"MOD","result":"LESS","mod_revision":46},{"key":"0","target":"MOD","result":"LESS","mod_revision":46}],"success":[{"request_put":{"key":4,"value":[3]}},{"request_put":{"key":0,"value":[1,2,3,4,5,6,7,8,9,10,11,15]}}]}
{"header":{"revision":47},"succeeded":true,"responses":[{"response_put":{}},{"response_put":{}}]}
2026-01-18T09:04:03.721Z ::ffff:10.0.2.2:60914 POST /v3/kv/txn 200
{"compare":[{"key":"4","target":"MOD","result":"LESS","mod_revision":48}],"success":[{"request_put":{"key":4,"value":[19]}}]}
{"header":{"revision":48},"succeeded":true,"responses":[{"response_put":{}}]}
Let’s look for the read requests used for CAS comparisons. Here is one:
2026-01-18T09:04:03.708Z ::ffff:10.0.2.2:60990 POST /v3/kv/txn 200
{"success":[{"request_range":{"key":"4"}}]}
{"header":{"revision":47},"succeeded":true,"responses":[{"response_range":{"kvs":[]}}]}
That’s a strange response. Revision 47, but 4 = [3] is missing. Oh, and there’s also this entry nearby:
2026-01-18T09:04:03.709Z ::ffff:10.0.2.2:60930 POST /v3/kv/txn 200
{"compare":[{"key":"3","target":"MOD","result":"LESS","mod_revision":46},{"key":"1","target":"MOD","result":"LESS","mod_revision":46}],"success":[{"request_put":{"key":1,"value":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,25]}},{"request_put":{"key":3,"value":[1,2,3,4,5,6,7,9,10,11,12,13,14,32]}},{"request_range":{"key":3}}]}
{"header":{"revision":47},"succeeded":true,"responses":[{"response_put":{}},{"response_put":{}},{"response_range":{"kvs":[{"key":"3","value":[1,2,3,4,5,6,7,9,10,11,12,13,14,32],"mod_revision":46}]}}]}
Uh? How? Different changes with the same revision — 47?
It turned out that everything was simple — after applying the change, Antietcd responded with the current revision of the database at the time of the response, not at the time the change was applied.
I fix it, repeat the test — and all the anomalies disappear, the bug is defeated,
Everything looks good! ヽ(‘ー`)ノ`.
Summary
My long-read has finally come to an end.
For some reason, I feel like I’ve created an excellent training sample for practicing distributed database logic.
The Jepsen tests took up approximately 1500 lines of code. Now both my reinvented wheels — the TinyRaft leader election algorithm and the Antietcd consensus system are verified, provide fair STRICT SERIALIZABLE, and, as of Antietcd 1.2.0, can be used in production, for example, in Vitastor.
At the same time, Antietcd hasn’t become complicated — 3000 lines of code (+500 with locks) are still trivially rewriteable in any language. By the way, ALT Linux guys have already rewritten TinyRaft in Rust. On the other hand, Antietcd still has potential for future extension. Perhaps, for example, it could be turned into a full-fledged K/V database by adding the ability to store large amounts of data?
And I had a great time during the New Year holidays figuring out the logic of Jepsen tests. Now all these G0, G1a, wr and rw-dependencies don’t scare me so much. :-)
Thank you all for your time and see you in production!
Links
- Antietcd
- TinyRaft
- Jepsen
- Jepsen phenomena descriptions
- G1a Aborted Read phenomenon
- Jepsen transaction dependencies
- Strict serializable
- Jepsen tutorial
- Elle transaction checker
- Paper about Elle
- Elle list-append
- Elle rw-register
- Jepsen tests for etcd
- etcd consistency guarantees
- Raft algorithm description
- Comparison of embedded K/V DB engines for Go
- Hato HTTP client for Clojure
- Atul Adya thesis — Weak Consistency: A Generalized Theory and Optimistic Implementations for Distributed Transactions
- TinyRaft, rewritten in Rust