A while back, I got a bug report for json-to-multicsv. The user was getting the following error for any input file, including the one used as an example in the documentation:
, or } expected while parsing object/hash, at character offset 2 (before "n")
The full facts of the matter were:
- The JSON parser was failing on the third character of the file.
- That was also the end of the first line in the file. (I.e. the first line of the JSON file contained just the opening bracket).
- The user was running it on Windows.
- The same input file worked fine for me on Linux.
Now, there's an obvious root cause here. It's almost impossible not to blame this on Windows using CR-LF line endings, where Unix uses just LF. The pattern match is irresistible: works on Linux, fails on Windows, fails at the end of the first line. And I almost answered the email based on this assumption.
Except... Something feels off with that theory. What would be the root cause here? "Wow, I can't believe that the JSON spec missed specifying the CR as whitespace"? No, that makes no sense, nobody would define a text-based file format that sloppily. 0
How about: "Wow, I can't believe the JSON module of a major programming language has a bug making it fail on all inputs on a major operating system, and it took a decade for anyone to notice". That doesn't seem plausible either.
So I tried to reproduce the problem, by making a file with DOS line endings and running it through the script on Linux. That worked fine. Hm. Put in some invalid garbage, and you get a parser error as expected. Double-hm. But the error message I got was very different from that in the bug report. Could it be that it's using a totally different JSON module altogether?
Turns out that's basically what was going on. Perl's JSON module
doesn't actually do any parsing itself. It's mostly a
shim layer, the actual work is done by one of several
different parser modules. On Linux, I'd been getting JSON::XS
as the backend (XS
is Perl-talk for "native code"). In cases
where JSON::XS
is not available, the shim module would use
a pure Perl fallback, e.g. JSON::PP
.
Ok, so force the JSON module to dispatch to JSON::PP
.
Success! Problem reproduced. Guess it really was buggy parser after
all. Remove the DOS line endings, just to be sure... And it's still
failing. WTF?
A bit more digging revealed that the error message was actually
a lie. The problem wasn't with the whitespace, but with
there being an end of file right after said whitespace. The input
to JSON::PP
contained just a single line, not the whole
file! At that point, the actual problem becomes obvious and the
fix trivial:
- my $json = decode_json read_file $file; + my $json = decode_json scalar read_file $file;
I was using the read_file
function from
File::Slurp
to read the contents of the file.
Unfortunately that function behaves differently in scalar and list
contexts. In scalar context, it returns the contents of the file
in a single string. In list context, an array of strings. What had
to be happening was that the context was changing based on the
backend.
And just why would changing the parser backend change the context
for that read_file
call? As it happens, the JSON
module does not actually define decode_json
, but
directly aliases to the matching function in the backend. For
example:
*{"JSON::decode_json"} = &{"JSON::XS::decode_json"};
JSON::XS
declares the function with a $
prototype forcing the argument to be evaluated in scalar context.
JSON::PP
uses no prototype and thus the arguments
defaulted to being evaluated in list context.
The blame game
So, that's the bug. But what was the real culprit? I could come up with the following suspects.
- Me, for using
File::Slurp
for this in the first place. "Oh, I just always pass a file-handle todecode_json
" said one coworker when I described this bug. And that would indeed have side-stepped the problem, andread_file
is just saving a couple of lines of code. But it's exactly the couple of lines of code I don't want to be writing: pairing up file opens/closes, and boilerplate error handling. - Me, for not realizing that the code was only working by
accident. I knew
read_file
works differently in scalar and list contexts. I also knew this case needed scalar context, and had no special reason to believe thatdecode_json
would provide it. The default assumption should have bene for this code not to work. When it did, I should not have accepted it, but figured out why it worked and whether it was guaranteed to work in the future. - The
JSON
module, for not explicitly documenting the inconsistent prototypes as part of the interface. I don't know that anyone would actually notice that in the documentation though. It might end up as just cover-your-ass documentation. - The
JSON
module, for directly exposing the backend functions with aliasing, for a minimal performance gain. It's a shim: isn't the whole point to hide away the implementation differences from the user? - The
File::Slurp
module, for usingwantarray
to switch behavior ofread_file
based on the context. - Perl for having the concept of different contexts in the first place.
- Perl for allowing random library code to detect different contexts
via
wantarray
.
The thing that really sticks out to me here is overloading
of File::Slurp::read_file
based on the
context. Returning a file as a single string vs. an array of
lines are very different operations. There is absolutely no
reason for them to share a name. It'd be simpler to implement, simpler to use,
and simpler to document. It's even already in a library, so it's
not like there would be any kind of namespace pollution by using
different names. (Unlike for the uses of context-sensitive
overloading in core Perl. Sure, count
would probably
make more sense than scalar grep
. But it would be a
new name in the global namespace).
What about wantarray
? It's what's enabling this
bogus overloading in the first place. I've been using Perl for 20
years, writing some pretty hairy stuff. As far as I can remember,
I haven't used wantarray
once. And what's more,
I don't remember ever using a library that used it to good
effect. The reason context-sensitivity works in core Perl
is the limited set of operations. One can reasonably learn the
entire set of context-sensitive operations, and their (sometimes
surprising) behavior. It's a lot less reasonable to expect people to
learn this for arbitrary amounts of user code.
It's a bit unfortunate that function aliasing can cause action at a distance like this. But at least that's a feature with solid use cases.
So I think that's where I fall on this. It's all because of a horrible
and mostly unnecessary language feature, used for particularly bad effect
in a library. It feels like avoiding this kind of problem on the consumer
side is almost impossible; it'd just require superhuman levels of attention
to detail. Avoiding it on the producer side is really easy:
wantarray
: just say no.
Yeah, APIs should only use wantarray to emulate one of Perl’s regular behaviours, never to invent exciting new semantics. Even if a new semantic seems sorta intuitive, like in this example, it just never ends well. (E.g. there are some modules that will give you a list in list context but an array ref in scalar context… ugh.)