To Learn a New Language, Read Its Standard Library
If I was learning to read English as a foreign language,
I would need something simple to get started.
(from The Remarkable Story of Chicken Little, 1840)
The best way to learn a new programming language, just like a human language, is from example. To learn how to write code you first need to read someone else’s code. But who is the best person to learn from? Which code should we read? Where should we look to find it?
This year in my spare time I was learning about Crystal. I had played around with some simple scripts, but I wanted to learn more. Then I stumbled on to Crystal’s standard library. I was relieved to see that Crystal’s core classes are implemented using Crystal itself!
Crystal’s standard library is clear, simple, concise and well documented. Reading Crystal’s internal implementation of Array or Hash is like reading a fairy tale in a children’s book. Anyone can understand it, even people without a Ph.D. in Computer Science or systems programming experience.
Update: There was a long discussion on Hacker News about whether reading the standard library really is a good idea for various different languages.
At First Glance, Crystal Is Ruby
At first glance, when I read Crystal’s Array implementation, I thought I was reading a Ruby program:
class Array(T) include Indexable::Mutable(T) include Comparable(Array) # Size of an Array that we consider small to do linear scans or other optimizations. private SMALL_ARRAY_SIZE = 16 # The size of this array. @size : Int32 # The capacity of `@buffer`. # Note that, because `@buffer` moves on shift, the actual # capacity (the allocated memory) starts at `@buffer - @offset_to_buffer`. # The actual capacity is also given by the `remaining_capacity` internal method. @capacity : Int32 # Offset to the buffer that was originally allocated, and which needs to # be reallocated on resize. On shift this value gets increased, together with # `@buffer`. To reach the root buffer you have to do `@buffer - @offset_to_buffer`, # and this is also provided by the `root_buffer` internal method. @offset_to_buffer : Int32 = 0 # The buffer where elements start. @buffer : Pointer(T)
There are lots of familiar keywords, like class
, include
and private
. I also
see Ruby’s @
character indicating an instance variable. This code is about 100x
easier to read vs. Ruby’s own C implementation of
Array.
Along with the familiar Ruby-like syntax, notice the helpful comments. Even
though I’ve just started reading I can already make an educated guess at how
Crystal arrays function internally. I can see there’s a pointer to memory which
holds the array elements, and that the code keeps track of the capacity of this
memory along with the actual size of the array. Finally, reading the comment for
offset_to_buffer
I can imagine there are some optimizations related to adding
and removing elements. The comment is both helpful and intriguing.
But I’m not reading Ruby code. There are important differences here: generic
type syntax and most importantly each of the instance variables is declared
with a static type known at compile time. How do I use static types in Crystal?
What types are available? What about the generic type parameter T
? Should I
use that in my own Crystal code? What other syntax differences vs. Ruby are
there?
The best way to learn how to write Crystal code is simply to scroll down and read one of the Array methods.
Array#uniq
Here’s how Crystal finds the unique elements of an array:
def uniq if size <= 1 return dup end # Heuristic: for a small array it's faster to do a linear scan # than creating a Hash to find out duplicates. if size <= SMALL_ARRAY_SIZE ary = Array(T).new each do |elem| ary << elem unless ary.includes?(elem) end return ary end # Convert the Array into a Hash and then ask for its values to_lookup_hash.values end
The first three lines handle the trivial case of when an array is empty or contains only one element:
if size <= 1 return dup end
Obviously in this case, there are no duplicate elements and Array#uniq
should
simply return the original array. One important detail: Crystal uses dup
to
return a copy of the array. This reminds me that in Ruby uniq
returns a copy
of the receiver, while uniq!
mutates the receiver. My guess is that Crystal
implements Array methods in the same way…
The second passage is an optimization:
# Heuristic: for a small array it's faster to do a linear scan # than creating a Hash to find out duplicates. if size <= SMALL_ARRAY_SIZE ary = Array(T).new each do |elem| ary << elem unless ary.includes?(elem) end return ary end
For small arrays (16 or fewer elements) Crystal iterates over them and removes duplicates using a simple algorithm. I’ll take a look at how that works in a moment.
The final line of code handles arrays with 17 or more elements:
# Convert the Array into a Hash and then ask for its values to_lookup_hash.values
As you might guess, Crystal removes duplicate values from larger arrays using a hash. I'll dive into the details about how this works in my next post.
Arrays With 16 Or Fewer Elements
But first, let’s take a closer look at case #2 from above, when the array contains 16 or fewer elements. First, Crystal creates a new, empty array called ary:
ary = Array(T).new
Note the generic type syntax Array(T).new
. This tells the Crystal compiler
that the new array, what will become the return value from Array#uniq
, will
only contain elements of the same type as the original array.
Ruby developers will find the rest of this code easy to follow…
each do |elem| ary << elem unless ary.includes?(elem) end
Crystal calls each
to iterate over all the elements in the receiver, the
array we are calling uniq
on. Then using <<
, Crystal appends each of the
original array’s elements to the new array, unless the new array already
contains a given element.
Like Ruby, Crystal implements the includes?
method inside the Enumerable
module. Crystal arrays are enumerable because of the include Indexable::Mutable(T)
statement we read above. (Indexable::Mutable
includes
Indexable
which includes Enumerable
). You can find Crystal’s implementation
of includes?
(not include?
as in Ruby) in
enumerable.cr:
def includes?(obj) : Bool any? { |e| e == obj } end
Here the any?
method calls the given block once for each element in the
array, and returns true if the block returns true for any of the elements. In
other words, this code searches the array in a linear fashion, one element at a
time. Crystal’s development team has decided that it’s faster to filter out
repeated elements from small arrays by repeatedly searching the array using
linear scans. Since there are never more than 16 elements, those scans won’t
take too much time.
Simple and Concise
You might be thinking: This is an incredibly simple algorithm; anyone could have written this code! Why bother writing a blog post about this?
That’s exactly my point: This is simple and concise code. I could have written it - you could have also. There’s nothing superfluous, not an extra word here. Just enough code to get the job done. And there’s no noise… no macros, no odd C memory tricks, no weird bitwise mask operations. This is the kind of code I need to read now when I’m learning how to use Crystal. As a side benefit, I also get to learn how Crystal works internally.
But what happens for longer arrays, with 100s or 1000s of elements? How does Crystal remove duplicates from longer arrays efficiently? I'll take a look at how that works in my next post.