What a classic rant, can't believe I hadn't seen this before. (I hadn't seen the original Crockford DEC64 page either. But the rant is at least amusing, while the DEC64 proposal didn't really have many redeeming properties).
A lovely walk through optimizing a single Opus Magnum solution, step by step.
A tale of a surprisingly long-running positive-return lottery syndicate. Or actually, two of them. Who then end up eating into each others' profits, and start a dirty fight on both sides. (Ok, this last bit isn't a big part of the story. But it's what I chuckled at the most.)
This story did not go where I expected it to from the title.
> Dataflow and data dependencies can be viewed as the fundamental expression of the structure of a particular computation, whether it’s done on a small sequential machine, a larger superscalar out-of-order CPU, a GPU, or in hardware (be it a hand-soldered digital circuit, a FPGA, or an ASIC). Dataflow and keeping track of the shape of data dependencies is an organizing principle of both the machines themselves and the compilers that target them.
How to think about optimization.
>The other thing I’ll say is that even though I’ve been talking about adding cycle estimates for compute-bound loops here, this technique works and is useful at pretty much any scale. It’s applicable in any system where work is started and then processed asynchronously, with the results arriving some time later
Paul Khuong on one of the hardest bugs he has had to debug.
Debugging story. A Windows service causing an invisible 40GB kernel memory leak due to some kind of race condition where it occasionally fails to properly release process handles.
Some interesting thoughts on the value of writing blog posts vs. living documents that work as long-term resources.
Early versions of Windows did bit-blitting by JITting a routine specialized for the actual parameters. This is the evolution of those routines.
Thoughts from Martin Cracauer on what the GC implications would be on using LLVM as a SBCL code generator.
.What goes into implementing enough of the OS/2 ABI to get an absolutely minimal graphical application working?
> There are lots of plausible ways to pack bits into bytes, and all have their strengths and weaknesses that I’ll go into later. For now, let’s just cover the differences.
Also, part 2
Writing an asynchronous (when allowed by the semantics) D3D to OpenGL shim.
Writing a SQL parser in Haskell isn't very interesting. The good part is everything else about this. All the way from the genesis of the tool (need to figure out what all the relations in the system really are, for a hellish schema transition), to where the system actually ended up at and what other use cases naturally appeared.
The typical HN second guessing comments feel even more depressing than usual. Why didn't they just read the documentation of the tables to figure out the details? Why not use a Python SQL parser instead of writing a new one? Why did they want this schema transition anyway? It's like there's zero empathy for other people's problems being more complicated than can be explained in the setup of a blog post.
A deep dive into reverse-engineering an ultra-obfuscated piece of malware, with multiple layers of custom virtual machines. Really awesome.
A network debugging war story, involved IPv6, fragementation and QUIC.
I'm probably going to disagree a bit on the moral of the story. The authors' takeaway here is that routers should not be reordering packets.
What I see here is yet another instance of full transport layer header encryption making it impossible to do the right thing. Why does the server need to MTU-probe with a massive packet? Because there's way for the path to give a signal about the packets size (MSS clamping in TCP). Why does the receiver end up blocking the queue on fragmentation? Because there's no way for it to know what the intended order was, the packet numbers are encrypted. So it has to assume the receiving order is the delivery order.
But look Ma! No ossification!
> I do not believe in objectivity in rankings. This is not to say I think being objective with regards to rankings is impossible, nor do I think "objective" tools serve no purpose (the tools I've written have already proven highly useful in generating baselines for seeding tournaments). No, more specifically I want to stress that "objective" ranking systems are much less objective than they actually seem, and the word "algorithmic" or "empirical" might be better.
Rating systems, once again. I don't think I agree with much of this article (e.g. the reasoning for Elo not working for double-elimination seem totally nonsensical). But the core idea of not having tournament seeding be purely algorithmic? Sure.
Blizzard going above and beyond on remastering an old game. There was apparently a large number of user-made StarCraft maps that relied on buffer overflows to read/modify game internals (all the way to basically rewriting some of the game logic). How do you not break these maps when the game is basically completely rewritten? By basically building an elaborate buffer overflow emulation layer.
Just a crazy level of dedication.
"How do you write a minesweeper puzzle generator that always generates a level that can be won without guessing" is a boring question. That kind of level generation sucks. For a moment it looks like it's where this article is going. It's not, though. The core idea here is a lot more clever.
"Dear ImGUI" in the browser with WebGL and webasm. I've been wanting to do something like this for a couple of small browser-based games.
(Something a bit odd going on with the keyboard handling though).
Using the Rust type system to make access to the global register state of embedded devices safe. (And some thoughts on API design).
A reasonably general-purpose system for fuzzing servers all the way from the main event loop, not just at some arbitrary "this is where we can feed the system a continous block of bytes" boundary.
(Also: Remind me to write about fuzzing the TCP stack itself, at some point).
Distributing video encoding with more granularity than by keyframe.
Absolutely lovely networking war stories (yay!) about early HTF (blech). All of the hacks are lovely. The one that resonated the most with me was figuring out a way of finding an application-level use case for out-of-order TCP transmitting (sending the header and footer of an order early, and once you've decided on a trade to make, sending out the price/count/stock id as a tiny packet that fills in the gap between the header and footer).
Sparklines of the values of individual memory locations. Works because early game consoles had so little RAM. Such a simple idea, such cool output.
> It was said that ‘‘you really can’t appreciate troff (and runoff and scribe) unless you do all of your document preparation on a fixed width font 24 line by 80 column terminal’’. ‘‘Challenge accepted’’ I said to myself.
But the title is a misleading: this was a modern v7 port with some extra amenities. It's hard to appreciate just how primitive these early systems were without using them in their original form.
Speaking of which, here's some people attempting to get the original PDP-7 unix (i.e. pre-v1) running agin. There aren't scans for all of the source code though. Most importantly the shell is missing. So they had to rewrite one themselves.
A story full of "I didn't realize it was impossible, so I went ahead and did it" moments.
A bit like a a very compressed "Soul of a New Machine". I've been reading a lot of old timesharing papers, most of them are dreadfully boring even for me. (Don't ask why I've been reading that stuff...) But this particular kind of personal story of the creation of influential but totally forgotten technology is like catnip.
Designing a timeout system for a Python IO API. (Feels like a very Common Lisp-y solution to me, with the hidden global state with enforced dynamic extent).
> One tiny, ugly bug. Fifteen years. Full system compromise.
A step-by-step walk through an OS X local vulnerability (and it's a lot of steps). Another of those writeups that make you wonder how anyone ever manages to get from concept to an actual exploit.
Oh, ok... So that's what the Linux page table changes discussed a couple of weeks ago were about. This looks really bad. It seems amazing that nobody found it before now, but on the other hand at least the exploits for Spectre look really hard to pull off (needing to e.g. reverse engineer the branch predictor, so that it can be trained to expose one bit of data...) So maybe a lot of people had tried, and nobody just managed to do it.
Now that's a much more approachable speculative execution bug!
I like this paper. It doesn't just naively measure TCP vs QUIC, but also tries to map the results to the underlying mechanisms. But "ouch" on the fairness tests.
The dangers of just throwing garbage data at a machine learning model. (And in a medical setting, even!)
What happens if you switch the example code of a difficult systems programming course from gnarly '80s style C to modern C? Apparently the students are able to implement much more complex memory allocator features.
(They also changed the malloc test suite at the same time. But it seems hard to believe that the tests would have a major effect here).
> The correct solution to the “integer printing is too slow” problem is simple: don’t do that.
...
> However, once you find yourself in this bad spot, it’s trivial to do better than generic libc conversion code. This makes it a dangerously fun problem in a way… especially given that the data distribution can matter so much.
Paul Khuong on fast integer -> decimal string conversions.
Reporting on some mysterious ongoing Linux development. Just how bad a security bug does this have to be, if they're willing to take a 5% system-wide performance hit to work around it?
Insane retro-hardware maintenance.
A funny story about extreme capitalism in an MMO setting, the kind you'd expect to be a digression in a good Neal Stephenson novel. I have no idea whether this is actually true or not, but it probable doesn't matter either way :)
Input-to-display latency measurements for 40 years of computers.
How the voxel physics engine of Roblox works.
And then for physics of a different kind... Basically modeling crowds of people as a fluid dynamics system.
> Developing and testing a virtual version of Unix on OS/32 has practical advantages. There was no need for exclusive use of the machine; [...]. And the OS/32 interactive debugger was available for breakingpoint and single-stepping through the Unix kernel just like any other program.
A port of Unix v6, from before it was really meant to be portable. A lovely systems programming story.
A very amusing system programmer's lament.
Ignore the title. It's not actually a rant about Skype sucking, but a really cool article series on someone writing their own codec + packet-loss tolerant UDP networking for a prototype video conferencing app.
Micro-optimizing lockless message passing between threads.
Then use this to replace locks on data structures. Instead of data structures being shared, they're owned by a specific server process. If a client needs to operate on data structure, it asks the server to do it instead. Assuming heavy contention, this'll be much faster since fewer cache coherency roundtrips are required.
(Obviously not widely applicable, due to the scheme requiring busylooping to work well.)
This'll go into the hall of fame of great debugging stories.
The PDP-I 1 was designed to be a small computer, yet its design has been successfully extended to high-performance models. This paper recollects the experience of designing the PDP-I I, commenting on its success from the point of view of its goals, its use of technology, and on the people who designed, built and marketed it.
A lovely mid-life postmortem for the PDP-11.
(Via Dave Cheney; a useful companion piece putting the paper in the historical context, but not a replacement for reading the original.)
Could you replace B-Tree/hash/bloom filter database indexes with machine learning models? The depressing answer appears to be that it's viable. I thought the systems programmer was going to be the last job in the world!
But assuming this is the state of the art (rather than a more typical "this is what we were deploying 5 years ago" Google paper), it's not quite practical yet. CPUs aren't efficient enough, communication overhead with GPUs/TPUs too large. But that's an architecture problem that will get solved.