A few hints about Scala sequences

This blog post is inspired by a cool talk “The Dark Side of Scala” given by Tomek Nurkiewicz on Scalar conference. I’m going to focus on one particular problem he mentioned – confusing Scala sequence types like Seq, IndexedSeq, Traversable, List, Vector and others.
As non-sequential types like Map or Set are pretty straighforward to use, let’s put them aside for now.
All the information gathered in this article are mostly a summary of different discussions I found on the Internet, especially these two threads:

http://stackoverflow.com/questions/11702798/scala-guidelines-on-return-type-when-prefer-seq-iterable-traversable
http://stackoverflow.com/questions/6928327/when-should-i-choose-vector-in-scala

My point is to put it all in a form of simple Q&A which can serve as a cheat sheet for most non-exotic cases.

Q: What type should my API accept as input?
Answer: As general as possible. In most cases this will be Traversable <- Seq <- List.
Explanation: We want our API consumers to be able to call our code without needing them to convert types. If our function takes a Traversable, the caller can put almost any type of collection. This is usually sufficient if we just map(), fold(), head(), tail(), drop(), find(), filter() or groupBy(). In case you want to use length(), make it a Seq. If you need to efficiently prepend with ::, use a List.

Q: What type should my API return?
Answer: As specific as possible. Typically you want to return a List, Vector, Stack or Queue.
Explanation: This will not put any constraints on your API consumers but will allow them to eventually process returned data in optimal way and worry less about conversions.

Q: Should I use List or Vector?
Answer: You most probably want a Vector, unless your algorithm can be expressed only using ::, head and tail, then use a List.
Explanation: Some people compare List vs Vector in Scala to LinkedList vs ArrayList in Java. This is partially OK, because:

  • Scala Vector is a collection with good random access (like java.util.ArrayList)
  • Scala List is indeed a linked list with very fast append-first operation (::), but the links are unidirectional (for bi-directional use DoubleLinkedList).

However, Scala Vector has a very effective iteration implementation and, comparing to List is faster in traversing linearly (which is weird?), so unless you are planning to do quite some appending, Vector is a better choice. Additionally, Lists don’t work well with parallel algorithms, it’s hard to break them down into pieces and put together in an efficient way.

Q: What about other traits like IndexedSeq, LinearSeq, GenTraversable, TraversableOnce, Iterable or IterableLike?
Answer
: In many cases you don’t need to refer specifically to these types.
Explanation: Most of these types reveal some additional information about underlying implementation which may be important if your code is really performance-critical. Iterable may be familiar from Java world and usable when you really need to use an iterator with state (which is not really a functional apporach). I encourage you to not dig into other types unless you are not satisfied with your current performance and want to squeeze out some more.

 

Advertisements

2 thoughts on “A few hints about Scala sequences

  1. Maciej

    My 2¢
    – I’d rather return IndexedSeq (or even Seq if I don’t expect the client to use random access) instead of Vector. I don’t see why I should tie myself to concrete class impl.
    – note that Java LinkedList are double linked. AFAIR Scala’s List is faster than LinkedList. It also, of course, consumes much 1/3 less memory (sans the size of objects it refers to)
    – access the .par() collections wisely. I’d rather not design for using them unless there is a very specific need to do so.
    – ListBuilder, SetBuilder et al are also something you might want to consider for performance optimization

    Reply
  2. Dzmitry Lazerka

    > Vector comparing to List is faster in traversing linearly (which is weird?)
    It’s not weird at all, same way why Java ArrayList is faster to traverse than LinkedList. The trick is that OS usually loads _pages_ of RAM into CPU cache, not just one address value. That way if we traverse over an array with consecutive memory addresses, most of the time CPU finds it in cache.
    > Q: What type should my API return? Answer: As specific as possible.
    It depends. If your API implementation is not going to change, then yes. But for most live projects, implementation will be changed some day. For example, you accepted Traversable originally, and now you want to add a logging message of how much data caller has given, so you now need .size(). Or you’re optimizing algorithm, and it will be possible only by knowing .size(). I’m cannot give an answer here, except that “it depends”, so I’d suggest to use your own judgment every time.

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s