CoCalc Blog

Collaborative Editing

Collaborative Editing in CoCalc: OT, CRDT, or something else?

This paper about collaborative editing is on Hacker News today. I also recently talked with Chris Colbert about his new plans to use a CRDT approach to collaborative editing of Jupyter notebooks. This has caused me to be very curious again about how CoCalc’s collaborative editing is related to the many algorithms and research around the problem in the literature.

The Collaborative Editing Problem

Protocols for collaborative editing are a venerable problem in computer science, and there are probably over a hundred published research papers on it. The basic setup, going back three decades, is that sync algorithms are supposed to have three properties, which I’ve stated in simplified plain language below:

CoCalc has 1, of course; without that you’ve got nothing.

CoCalc has 2, when people’s clocks are synced, because all patches you’ve applied have timestamp less than now (=time when making the patch).

CoCalc does NOT have 3, for some meaning of 3. Patches are applied on a “best effort basis”. So instead of our changes being “insert the word ‘foo’ at position 7”, they are more vague, e.g., apply this patch with this context using these parameters to determine Levenshtein distance between strings. With intention preservation, if the operation is “insert word ‘foo’ at position 7”, definitely that’s exactly what happens whenever anybody does it (‘foo’ will appear in the document) – it does not depend at all on context. With diffmatchpatch patches (which we use in CoCalc), the effect of the patch depends very much on the document you’re applying the patch to. If there is insufficient context, then ‘foo’ might not get inserted at all.

Similar remark apply to how I designed the structured object sync in CoCalc, which is used, e.g., for CoCalc Jupyter Notebooks; it also applies patches on a best effort basis.

OT = operational transforms

This is a protocol that in theory has all of 1-3. Of course there are many, many specific versions of OT. The hard part is ensuring 3, and it can be complicated. The problem to be solved makes sense, and it can be done. The details (and implementing them) are certainly nontrivial to think about conceptually… There’s many academic research papers on OT, and it’s implemented (well) in many production systems.

In OT, the data structure that defines the document is simple (e.g., just a text string), and the operations are simple, but applying them in a meaningful way is very hard. This paper on HN that I mentioned above argues that OT is much more popular in production systems than CRDT.

CRDT = commutative replicated data type

This also does 1-3. It sets everything up so the data structure that defines the document is very complicated (and verbose), but it’s always possible to merge documents in a consistent way. What is difficult gets pushed to different places in the protocol than OT, but it’s still quite hard, and there are subtle issues involved with any non-toy implementation.

What about CoCalc’s approach…?

CoCalc’s text editing does synchronization as follows. Each user periodically computes a timestamped patch, then broadcasts it to everybody else editing the same file. When patches arrive, each user computes the current state of the document as the result of applying all patches in timestamp order. If everybody stops editing, then they all agree on the same document.

This protocol satisfies 1 and 2, but not 3. The reason is that patches are applied on a best-effort basis using the diff-match-patch algorithm. For example, a patch made from deleting a single letter in a document can, when applied to a different document end up deleting multiple letters (or none). Basically, CoCalc replaces all the very hard work needed for 3 that OT and CRDT’s have with a notion of applying patches on a “best effort” basis. The behavior is well defined (because of the timestamps), but may be surprising when multiple people do simultaneous nearby edits in a document.

The paper says:

“There are two basic ways to propagate local edits: one is to propagate the edits as operations [12,38,50,51,73]; the other is to propagate the edits as states [13]. Most real-time co-editors, including those based on OT and CRDT, have adopted the operation approach for propagation for communication efficiency, among others. The operation approach is assumed for all editors discussed in the rest of this paper”.

Here [13] is N. Fraser’s paper on Differential Sync. This was the sync algorithm in the first version of CoCalc, and was the inspiration for what CoCalc currently does.

In CoCalc, the data structure that defines the document is simple (just a text string, say), and the operations are less simple (computing diffs, defining patches), and applying them in a meaningful way is somewhat difficult (it’s what the diffmatchpatch library does). This approach is very easy to think about and generalize, since it is self contained and a local problem. After all, I mostly described the algorithm in a single paragraph above!

In CoCalc, we compute diffs of arbitrary documents periodically, much like how React.js DOM updates work. This seems to not be needed in OT or CRDT, which instead track the actual operations performed by users (i.e., type a character, delete something). Computing diffs has bad complexity in general, but very good complexity in many cases that matter in practice (that’s the trick behind React). Diffs involve observing state periodically, rather than tracking changes.

OT and CRDT really are solving a much harder problem than we solve. This is similar to how git uses the trick of “assume sha1 hashes don’t collide” to solve a much easier problem than the much harder problems other revision control systems like Darcs solve.

An Example in which CoCalc violates the intention preservation requirement

There is a nice example to illustrate how CoCalc fails for this third “user intention” requirement. This is called “the TP2 puzzle”. You can try the following in both CoCalc and Overleaf (which probably does some OT algorithm):

  1. Type in some blank lines, then “abcd”, then blank lines
  2. Open three windows on the doc you’re editing.
  3. Disconnect your Internet
  4. In each of the three window, make these changes, in order:
    • abcxd (put x after c)
    • abycd (put y before c)
    • acd (delete b)
  5. Reconnect and watch. The experts agree that the “correct” intention preserving convergent state is “aycxd” (which overleaf produces), but CoCalc will produce “acxd”.

I do NOT consider this a bug in CoCalc – it’s doing exactly what is implemented, and what I as the author of the realtime sync system intended. The issue is that the patch to delete “b” has “a” and “cd” as surrounding context, and if you look at how diffmatchpatch patch application works, this is a case where it just deletes everything inside the context.

Evidently, Google Wave also had issues with TP2 because fully implementing OT is…

“… hard! In fact, almost all published algorithms that claim to satisfy TP2 have been shown to be flawed.”

More details…

The “famous” TP2 puzzle for CoCalc ends up like this (in at least 1 of the 6 possibilities!).

Start with

abc

then add an x and a y on either side of b, and delete b.

In one order, end up with

acxd

The patches are:

 [[[[0,"--\n\n\nabc"],[1,"x"],[0,"d\n\n\n\n\n"]],5291,5291,14,15]]
 [[[[0,"---\n\n\nab"],[1,"y"],[0,"cd\n\n\n\n\n"]],5290,5290,15,16]]
 [[[[0,"\n---\n\n\na"],[-1,"b"],[0,"cd\n\n\n\n\n"]],5289,5289,16,15]]

Applying the “delete b” patch, also deletes the y:

apply_patch([[[[0,"abc"],[1,"x"],[0,"d"]],5291,5291,14,15]], 'abcd')
(2) ["abcxd", true]
apply_patch([[[[0,"ab"],[1,"y"],[0,"cd"]],5290,5290,15,16]], "abcxd")
(2) ["abycxd", true]
apply_patch([[[[0,"a"],[-1,"b"],[0,"cd"]],5289,5289,16,15]], "abycxd")
(2) ["acxd", true]

Looking at the source code of diffmatchpatch, this is just what DMP does. If there is a lot more badness and the strings are bigger, it’ll refuse to delete. It really is a sort of “best effort application of patches” with parameters and heuristics; no magic there.

Where does CoCalc come from?

William Stein and Hal Snyder • • cocalc

Meet the team and company that provides CoCalc.

CoCalc Origins

Prior to CoCalc, William Stein spent 15 years teaching and doing research using mathematical software at Berkeley, Harvard, UCSD, and Univ of Washington. Based on this experience, he launched the CoCalc web application in April 2013, under the name SageMathCloud, with the mission to make it very easy to collaboratively use free open source mathematics and data science software in classes and research.

After over 5 years of extremely active development, CoCalc is now a modern web application that provides collaborative access to most free open source technical software, including LaTeX, Jupyter Notebooks, the Python numerical ecosystem, the R statistics software, and SageMath. Thus CoCalc brings together the work of thousands of contributors to open source software under one roof, which is easily accessible from your web browser.

Who Is Cocalc

Dash with CoCalc

Hal Snyder • • cocalc and python

Create interactive data visualizations for collaborators in your CoCalc projects using Dash.

Dash is an open-source framework to create web applications with Python. With CoCalc’s HTTPWebserver capability, you can run a Dash application from inside a CoCalc project.


Dash application running in a CoCalc project

Use CoCalc to Learn How to Program

Hal Snyder • • cocalc, python, and r

If you are new to coding, CoCalc makes it easy to get started. All you need is a web browser and an Internet connection. You don’t have to install software on your computer to start learning Python, R, Julia, and other leading open-source languages.


learning R by example

Embedding CoCalc in Your Application

Harald Schilly and Hal Snyder • • cocalc

Add scientific computing to any online training platform by embedding CoCalc.

Embedding CoCalc into an online learning platform or learning management system (LMS) adds:

Examples Assistant

Harald Schilly • • cocalc

CoCalc wants you to fully accomplish your computational work online. To archive this goal, CoCals has to provide a reliable service, offer and maintain the software you need, and package this in a powerful interface.

The new “Assistant” is one of the latest additions to the interface. Its goal is to help you by offering a curated set of annotated code snippets.

Is KaTeX ready for Prime Time? You be the judge.

Hal Snyder • • latex

CoCalc now offers an option to render LaTeX using KaTeX rather than MathJax. At the moment, KaTeX is an experimental feature which is turned off by default. To enable it, open Account / Preferences, and under Other Settings, check the box next to “KaTeX: render using KaTeX when possible, instead of MathJax”.


enabling KaTeX in Account Preferences

KaTeX is often over 100 times faster than MathJax, but it doesn’t handle all expressions covered by MathJax (or LaTeX). In these cases, CoCalc with KaTeX enabled will still fall back to MathJax. The selection happens for individual expressions, so one expression in a markdown file or a notebook cell might be rendered with KaTeX, while another would be rendered with MathJax.

Juno and CoCalc: Bringing Jupyter Notebooks to the iPad

Alex Staravoitau • • jupyter

About the author: Alex Staravoitau is a software engineer and machine learning enthusiast, the creator and main contributor of Juno, which is a Jupyter Notebook client for iOS.

Juno

At Juno, we all have been huge fans of Jupyter for awhile, and most importantly of the flexibility it offers: we strongly believe that the fact that you only need a screen and network connection to get access to pretty much unlimited computational resources has enormous potential. Naturally, we thought that Jupyter could use a proper client iOS application with a native interface, that would let you connect to a remote backend and work with Jupyter on your iPad. Now, after months of making and beta testing our app, Juno has made it to the AppStore!


screen capture of Juno running on iPad

Introducing the CoCalc Library

Hal Snyder • • cocalc

Get a Running Start on the Learning Curve

The CoCalc Library is a great way to get up to speed with new material.

Getting started with a new programming language or software package can be a slow process. It takes time to find your way around new terminology, development workflow, and error messages. Reading documentation and tutorials is helpful, but for most of us, the process of learning speeds up dramatically when we start writing code and actively experimenting.

The CoCalc Library makes it easier to get started with a new toolset or a new topic. It gives you access to a curated collection of open source documents and code from educators and software developers. With a single click, copy a selected topic into a CoCalc project. Then you can review, modify, and run files as desired.

Using the Library

Here’s an example. Suppose you want to start working with linear algebra in Sage.

1. Open the Files view in the directory where you want example documents to be placed. In the screenshot, that directory is ‘EXAMPLES’.

2. Click (+)New, and scroll to the middle of the browser tab to see the section with a label of “Library” on the left.