CAREER: Building Reliable Distributed Systems with Refinement Types

composition.al 2022-02-01

Friends, I’m very happy to announce that I received an NSF CAREER Award to support my research at the intersection of distributed systems, programming languages, and software verification!!

Here’s the award abstract (which I was asked to write in a rather specific two-paragraph style, with sentences beginning “The project’s novelties…” and “The project’s impacts…” in the first paragraph):

Today’s most important computer systems are distributed systems: those that consist of multiple machines that communicate by sending messages over a network, and where individual machines or network connections may fail independently. Programming such systems is notoriously difficult and error-prone due to messages being reordered or delayed and the possibility of machines and network connections failing. Widely-used protocols are meant to ensure, for instance, a given message delivery order or a given data consistency policy, but machine-checked proofs ensuring the correctness of executable implementations of these protocols are rare. The goal of this project is to mechanically verify real, executable implementations of distributed systems, using powerful tools built into a general-purpose programming language. The project’s novelties are in its focus on immediately executable implementations, and in its modular verification approach that separates lower-level message delivery concerns from higher-level application semantics. The project’s impacts will be in improving the overall trustworthiness and reliability of distributed software systems. The project further aims to create approachable entry points to research, and to develop students’ scientific communication skills, by integrating the creation of zines and videos about distributed systems verification into the investigator and project team’s teaching and research practices, and making the resulting materials freely available online.

The project’s approach to language-level distributed systems verification centers around refinement types: data types that let programmers specify logical predicates that restrict the set of values described by a type, and that can be checked at compile time by an SMT solver. The project uses Liquid Haskell, which extends the Haskell programming language with support for refinement types, and which can further be used for extrinsic verification, that is, defining functions to state and prove theorems. The investigator and team will use Liquid Haskell to develop mechanically verified, modular libraries for implementing distributed systems. A novel verified message delivery library will form the foundation for additional verified components, such as libraries providing replicated data types and distributed data stores. Furthermore, the project investigates combining the strengths of Liquid Haskell with those of interactive proof assistants such as Agda. These investigations culminate in a comprehensive survey of such hybrid automated/interactive verification tools, and provide a holistic scientific understanding of the design space.

I’m making the project description I submitted public, for two reasons. First, as I was working on my proposal, it was extremely helpful to look at examples that other CAREER awardees were willing to share with me when I asked for them. It’s a particular genre of writing, and looking at other successful proposals helped me understand the genre conventions, so I want to pay that forward to somebody else, without the added friction of them having to ask. Second, my ulterior motive in sharing the project description is to get you interested in my research agenda, so that you (yes, you personally) will collaborate with me and we can do something cool. A lot of grant proposals never get read by anyone other than a handful of reviewers, but this document is the clearest and most detailed articulation I have so far of this part of my research plan, so why shouldn’t I share it?

Some history that didn’t make it into the project description

This research agenda has been taking shape for me for a very long time. Back when I was a lowly grad student working on LVars, I was always a little dissatisfied with our LVish framework implemented in Haskell, because the Haskell type system wasn’t expressive enough to enforce properties like “updates are inflationary with respect to a given partial order”. Instead, users of LVar data structures just had to trust the data structure implementor to implement something that really did behave like an LVar. Ugh! I wished there was some kind of way to supercharge the expressivity of Haskell types, without having to leave behind the familiar Haskell ecosystem (and the decidability of type checking and type inference). At the time, Liquid Haskell was beginning to emerge as just such an approach. By then, though, I had moved on to a new job at Intel Labs and new projects like ParallelAccelerator, and wasn’t pushing things forward with LVish, or even thinking much about Haskell at all.¹

Then, a couple of things happened: first, I started getting interested in SMT-based software verification, an interest that would eventually lead me back around to Liquid Haskell and other solver-aided languages. Second, I began to feel that it was time for me, a PL person at heart, to really learn all that distributed systems stuff that I had spent years blithely claiming to already know something about. I started catching up on the distributed systems literature that I had missed during my Ph.D. years. (The feeling of needing to catch up only intensified when, against all odds, I was hired for a “distributed systems” faculty job despite really only having PL pubs.)

Not too long after I knew enough to pass as knowledgeable about distributed systems (at least when talking to undergrads or PL people), James Parker and Mike Hicks asked me to pitch in on a CRDT verification project that they were working on in Liquid Haskell in collaboration with Yiyun Liu and Niki Vazou. It couldn’t have been a more timely or fitting combination of my interests. We had a great meeting at POPL ‘20 — the last in-person conference I attended before the pandemic pushed everything online — and I quickly recruited my soon-to-be-grad-student Patrick Redmond to help.

With our powers combined, we used Liquid Haskell to implement a library of replicated data structures that are verified to be strongly convergent, meaning that replicas that have received and applied the same (unordered) set of updates are guaranteed to have equivalent state. Thanks to Liquid Haskell’s brand-new support (spearheaded by Yiyun and James) for stating and proving properties of Haskell type classes, we could express and prove convergence at the level of a generic interface in Liquid Haskell (that is, a type class), then plug in a data-structure-specific commutativity proof for each data structure we wanted to verify.

Despite the modularity and proof reuse we get from type classes, though, those data-structure-specific proofs still took a lot of effort — on the order of thousands of lines of Liquid Haskell code (and hours of solver time) for the more sophisticated data structures.² The strenuous verification effort suggests that a different approach is called for — one that separates lower-level message delivery concerns from higher-level data structure semantics. (This different approach would not have occurred to me had I not spent time reading those dusty old distributed systems papers!) Moreover, our experience carrying out large-scale proofs in Liquid Haskell highlighted the need for usable hybrid verification tools that integrate solver automation with visibility into the in-progress proof state.

So, my CAREER project aims to address both of those needs in two interrelated research thrusts. In the first thrust of the project, we’ll aim to develop a stack of verified libraries for building distributed applications in Liquid Haskell, starting with a messaging layer and building up from there. In the second, complementary thrust, we’ll focus on improving the verification tools themselves, and we’re excited to continue working closely with Niki and the other Liquid Haskell developers. For more details, please check out the project description!

Going all in on weird outreach

I’ve done a lot of stuff that I collectively term “weird outreach”: teaching on Twitch, organizing !!Con and !!Con West, this blog itself, and so on. In a CAREER proposal, you’re supposed to propose a plan for how you will integrate your research activities with education and outreach activities. Often, the done thing is to propose teaching a class that ties in with your research, maybe organizing an academic workshop that ties in with your research, that kind of thing. I was originally going to go with something along those lines, but every time I tried to write that part of my proposal, I got bored. Who wants another workshop or class that’s just going to serve more of the same old people who are currently being served, in the same old way they’re already being served? So instead, I decided to go all in on “weird”.

As many readers of this blog probably already know, there’s a rich tradition of using zines as a medium for scientific and technical communication. (You might know my friend and celebrated computing zine creator Julia Evans’s awesome work, but hers are far from the only computing zines out there. Check out these quantum computing zines, for instance, or the work of the Computery Zine Fest exhibitors!) I experimented with zine creation as an optional assignment in my distributed systems course back in 2019 — an idea I originally got from Cynthia Taylor, who did something similar in her operating systems classes — and it went well. So for this project, I proposed integrating zine creation into my team’s research process. My plan is to bring back the optional zine project in my undergrad distributed systems class, then use it as a starting point for recruiting interested and capable undergrads into paid summer research positions on my team. The undergrads on the team will focus on creating zines about the team’s research. The resulting student-created zines will be freely available online and will serve as an approachable and fun introduction to our work.

More important than the zines themselves, though, is the act of creating them, and the participatory science literacy that it fosters. It’s very hard to really engage in research as an undergrad, particularly if it’s research on a topic like distributed systems verification that requires a great deal of background to even get started. As Andrew S. Yang writes:

In a world where scientific knowledge is increasingly complex and technical, the participatory literacy of zines can foster a sense of ownership that is often lacking for those who don’t have the chance to study science formally or at an advanced level, or who had a bad experience with science learning in their educational past. By giving students a chance to be media producers and educators, a zine project can make their role and responsibility in the ecology of scientific information more apparent and potentially more rewarding. As a form of active learning, it also extends what they learn beyond the confines of the lab report, research paper, or test and into the broader community.

I’m particularly proud that my grant includes funding for Julia to consult with our team and help us make our zines the best they can be. (I’m also pretty sure that mine is the first-ever NSF proposal, let alone the first funded NSF proposal, to include a letter of collaboration from Julia.) I was nervous about requesting budget for something “weird” like this, but the proposal review panel liked the idea. Here’s hoping that in a few years, one won’t encounter so many ignoramuses on Hacker News who find it “jarring” to see zines about advanced computing topics!

Meanwhile, my Ph.D. advisor Ryan Newton collaborated with Niki Vazou and several other cool people on the POPL ‘18 paper that introduced Liquid Haskell’s refinement reflection mechanism, and one of the case studies in that paper involves verifying properties of “LVar-style concurrent sets”. They used Liquid Haskell to define a VerifiedOrd type class with instances that are guaranteed by LH to be totally ordered (which all instances of Ord are supposed to be, but there’s nothing about vanilla Haskell’s Ord that actually enforces total ordering – which is why we need LH). My understanding, then, is that what they did isn’t anything actually specific to LVars, or even specific to concurrent sets – regular old (sequential) sets are also supposed to be Ord in Haskell. The paper does also demonstrate how to use LH to prove commutativity, associativity, and idempotence properties of some functions, but not of operations on LVars in particular. The bottom line is that I think there’s still a “verifying that LVar implementations really act like LVars, using Liquid Haskell” paper to be written – so, if you want to work on that paper, please get in touch with me! ↩
If you look at Table 3 in the paper, it’s interesting to observe that for each of the CRDTs we verified, it was either really easy (a one-line Liquid Haskell proof and a second or so of solver time) or really hard (hundreds or thousands of lines of proofs and up to hours of solver time), and there doesn’t seem to be much of an in-between. The easy-to-verify ones are those where message ordering dosen’t matter for convergence. The hard-to-verify ones are those that really could have benefited from stronger guarantees at the messaging layer or at least that’s my hypothesis. ↩