A Year Long Rewrite
A year ago, I wrote a prototype for an IDE-like text editor. It had bugs, was hard to features to, and required significant changes. I tried to explore as much as I could before bugs prevented progress. I thought rewriting it all would be quick, but now, a year later, I haven't reached feature parity (for good reasons). I want to write a little about what happens when you rewrite a product and why I estimated poorly.
Throwing Out Code
When you rewrite a product, you're likely replacing something with years of work put into it. You're probably not rewriting the product just to fix bugs. An API can be extended incrementally, so that's probably not the reason for a rewrite either. Chances are, you and the team want to completely change the API interface and the internals. Given that the internal data structures are different, probably very little internal code can be reused. Given that the API is completely different, everything calling it would need to change too. For some apps (like the text editor I've been writing), it means just about everything will be thrown away. Sure, there are some pieces I was able to keep, like font loading, text search, and gdb related code. But those, in total, are less than 10% of the prototype and 5% of the new codebase. It didn't really make a dent.
If you're doing a rewrite, you really need to consider that everything will be thrown away. I only use two libraries (SDL and FreeType), and even those needed a change (I went from SDL 2 to 3, which was an extremely easy upgrade, but required a source change). With the GUI being thrown out and my original rendering code not being that good (I didn't try to make it good for a prototype), I didn't port anything; everything related to rendering and windowing was thrown out.
I thought it'd be quick
Everyone knows that once you understand the problem and the solution, it's relatively quick to write the code to apply it. Sometimes typos happen, or a person overlooks something, but it's straightforward to fix it. Some of us may have accidentally or intentionally deleted 50-300 lines of code written across multiple sittings, only to rewrite it in a single sitting. It always takes a fraction of the time, and it's always clearer and more maintainable. I found that it's not unusual for it to take 1/5th of the time. I assumed that if I spent 12 months on the prototype, a generous 1/3rd of the time is 4 months, and a year is three times as long, it's a no-brainer of course it can be done within a year.
Well... the problem is... I didn't want to rewrite the code I understood. I wanted to write the new pieces, the larger parts that may affect architecture. I wanted to write the complicated code that I haven't implemented, so I could understand the problem space before I write the code around it.
As an example, I didn't completely understand how I should implement async or the requirements I have. In the prototype, I used a pipe as a poor man's message queue. I used a mutex the few times where I needed synchronization. I thought I’d have very little code I’d want to run asynchronously. One such code I didn't think I'd want to run asynchronously is searching text in the current document. The code ran so quickly, I thought people wouldn't notice a 50ms delay. When I tried it, it was absolutely noticeable that the keystrokes had a delay. Especially when it's 16ms everywhere else. When a file is large enough, searching text, even with SIMD, may take 100+ms. People who deal with 6+gb log files know a search isn't keystroke fast. The delay in rendering the input felt amateurish.
I needed to replace my poor-man pipe with a proper thread-safe queue and have a design where implementing a message would both be easy and not error-prone. It should be easy to send an open file message to the IO queue, but it shouldn't be easy to accidentally send the "find text" to that same queue; it should go to the general worker queue, which a worker thread will consume when it has nothing to do. The results shouldn't be easy to send to another general worker when it's meant for the main thread. There were cases where I wanted a message to go to an IO thread, a general worker, then return to the main thread in that order. I wanted the async system not to make that pattern annoying to implement. The async related code took over a month to implement, which includes a mini IO lib to deal with OSes blocking on open.
To contrast this, rendering was quick and took maybe 1/4th of the time it took to implement in the prototype, despite the GUI (data structure), rendering API, and implementation changing. The original implementation I had syntax highlighting as a separate function. It didn't seem to make anything simpler, so this time I wrote it as part of the rendering function. It went fine, and I prefer this version more.
Goals of a rewrite
My goal was to write a fast, bug-free text editor. This means having a complete implementation and not returning a not-implemented error when the user hits a corner case. The rewrite is 150% of the size of the prototype, yet has significantly fewer features. As an example, the prototype uses the LSP to get syntax highlighting. The information an LSP provides is the starting+ending line and column, and the highlight information. Typing a new line would have that information go out of sync, and LSPs can take a moment to send a new highlight message. In the prototype, typing a comment looked strange because it wouldn’t highlight as a comment for another 2 seconds or so. It largely depends on the LSP and how large the source file is. The rewrite accounts for newlines, insertions, deletions, and has a supplementary highlighter so keywords and comments are highlighted instantly.
Having a more complete implementation accidentally increased the scope, which threw off my estimation. I plan to only support a handful of LSP events (some are for notebooks, which is out of scope, at least for now). For DAP, I was considering doing the same. There's an official JSON file that describes all the DAP events and data structures. I wrote code to generate structs and serialization code. It seems fairly reasonable to support most messages now, which wasn't what I intended.
As an aside, I also did the same for the LSP JSON after looking over the file. I completely regretted it. The language server protocol is completely insane. Even if I am able to generate a significant amount of data structures and serialization code, I wouldn't want to use it. It would pollute my codebase with if statements and corner cases because the LSP people couldn't decide how to represent something. There's at least one message where data can be represented in 5 different ways, while many others represent data in 3 different ways. The worst part is that many of them provide the same data, but it is different because it moved a field into a child node or the parent.
The biggest issue with my estimation is that I didn't account for how many todos there were. Even if the code would take 1/3rd of the time to write, there would be more than twice as much of it, so it'd be closer to 2/3rds of the original time. My estimation was 6-8 months for the complicated new code, and 4-6 months for the parts I did understand. After the todos, it ended up being more like 8 months, and I still need at least 6 months for the complicated new code. My estimation became pretty off. I may have code for a debugger, but there isn't a UI to interact with it.
Estimating Next Time
I would rather not estimate, but if I had to, I'd write out the scope of the project to remind myself how large it might be, and I would assume I have no code I could use. I would schedule extra time to reread libraries and OS calls, because more than once I hit a surprising behaviour and limitation that I never ran into before, even when using the library regularly (in my case, it was a system call). It's very easy to overlook something for a rewrite, and I'm not sure how to solve that. I didn't account for the time I'd spend on tooling (scripting more static analysis/linters, coverage, etc). I didn't think about CI, backing files up, and I completely forgot I should include building a new website as part of my estimate.
Usually, I try to have as few dependencies as possible. Many workplaces prefer to use a library over an in-house solution, which sometimes ends up being more work. I'm not good at estimating libs, but if you are rewriting a product that has many libs, you may want to consider a handful of them not exposing information you need, or not handling a use case you have. You should consider how much smaller the team will be for the rewrite, and if you're losing a person with key knowledge. You may also want to consider what happens when you have nothing to show until late into the project, which may be relevant if you need feedback from third parties, or want to translate/localize a product.
If you're reading this because you need to estimate, good luck.