Must access to scala.collection.immutable.List and

2020-02-23 06:36发布

问题:

I'm going through Learning Concurrent Programming in Scala, and encountered the following:

In current versions of Scala, however, certain collections that are deemed immutable, such as List and Vector, cannot be shared without synchronization. Although their external API does not allow you to modify them, they contain non-final fields.

Tip: Even if an object seems immutable, always use proper synchronization to share any object between the threads.

From Learning Concurrent Programming in Scala by Aleksandar Prokopec, end of Chapter 2 (p.58), Packt Publishing, Nov 2014.

Can that be right?

My working assumption has always been that any internal mutability (to implement laziness, caching, whatever) in Scala library data structures described as immutable would be idempotent, such that the worst that might happen in a bad race is work would be unnecessarily duplicated. This author seems to suggest correctness may be imperiled by concurrent access to immutable structures. Is that true? Do we really need to synchronize access to Lists?

Much of my transition to an immutable-heavy style has been motivated by a desire to avoid synchronization and the potential contention overhead it entails. It would be an unhappy big deal to learn that synchronization cannot be eschewed for Scala's core "immutable" data structures. Is this author simply overconservative?

Scala's documentation of collections includes the following:

A collection in package scala.collection.immutable is guaranteed to be immutable for everyone. Such a collection will never change after it is created. Therefore, you can rely on the fact that accessing the same collection value repeatedly at different points in time will always yield a collection with the same elements.

That doesn't quite say that they are safe for concurrent access by multiple threads. Does anyone know of an authoritative statement that they are (or aren't)?

回答1:

It depends on where you share them:

  • it's not safe to share them inside scala-library
  • it's not safe to share them with Java-code, reflection

Simply saying, these collections are less protected than objects with only final fields. Regardless that they're same on JVM level (without optimization like ldc) - both may be fields with some mutable address, so you can change them with putfield bytecode command. Anyway, var is still less protected by the compiler, in comparision with java's final, scala's final val and val.

However, it's still fine to use them in most cases as their behaviour is logically immutable - all mutable operations are encapsulated (for Scala-code). Let's look at the Vector. It requires mutable fields to implement appending algorithm:

private var dirty = false

//from VectorPointer
private[immutable] var depth: Int = _
private[immutable] var display0: Array[AnyRef] = _
private[immutable] var display1: Array[AnyRef] = _
private[immutable] var display2: Array[AnyRef] = _
private[immutable] var display3: Array[AnyRef] = _
private[immutable] var display4: Array[AnyRef] = _
private[immutable] var display5: Array[AnyRef] = _

which is implemented like:

val s = new Vector(startIndex, endIndex + 1, blockIndex)
s.initFrom(this) //uses displayN and depth
s.gotoPos(startIndex, startIndex ^ focus) //uses displayN
s.gotoPosWritable //uses dirty
...
s.dirty = dirty

And s comes to the user only after method returned it. So it's not even concern of happens-before guarantees - all mutable operations are performed in the same thread (thread where you call :+, +: or updated), it's just kind of initialization. The only problem here is that private[somePackage] is accessible directly from Java code and from scala-library itself, so if you pass it to some Java's method it could modify them.

I don't think you should worry about thread-safety of let's say cons operator. It also has mutable fields:

final case class ::[B](override val head: B, private[scala] var tl: List[B]) extends List[B] {
  override def tail : List[B] = tl
  override def isEmpty: Boolean = false
}

But they used only inside library methods (inside one-thread) without any explicit sharing or thread creation, and they always return a new collection, let's consider take as an example:

override def take(n: Int): List[A] = if (isEmpty || n <= 0) Nil else {

    val h = new ::(head, Nil)
    var t = h
    var rest = tail
    var i = 1
    while ({if (rest.isEmpty) return this; i < n}) {
      i += 1
      val nx = new ::(rest.head, Nil)
      t.tl = nx //here is mutation of t's filed 
      t = nx
      rest = rest.tail
    }
    h
}

So here t.tl = nx is not much differ from t = nx in the meaning of thread-safety. They both are reffered only from the single stack (take's stack). Althrought, if I add let's say someActor ! t (or any other async operation), someField = t or someFunctionWithExternalSideEffect(t) right inside the while loop - I could break this contract.


A little addtion here about relations with JSR-133:

1) new ::(head, Nil) creates new object in the heap and puts its address (lets say 0x100500) into the stack(val h =)

2) as long as this address is in the stack, it's known only to the current thread

3) Other threads could be involved only after sharing this address by putting it into some field; in case of take it has to flush any caches (to restore the stack and registers) before calling areturn (return h), so returned object will be consistent.

So all operations on 0x100500's object are out of scope of JSR-133 as long as 0x100500 is a part of stack only (not heap, not other's stacks). However, some fields of 0x100500's object may point to some shared objects (which might be in scope JSR-133), but it's not the case here (as these objects are immutable for outside).


I think (hope) the author meant logical synchronization guarantees for library's developers - you still need to be careful with these things if you're developing scala-library, as these vars are private[scala], private[immutable] so, it's possible to write some code to mutate them from different threads. From scala-library developer's perspective, it usually means that all mutations on single instance should be applied in single thread and only on collection that invisible to a user (at the moment). Or, simply saying - don't open mutable fields for outer users in any way.

P.S. Scala had several unexpected issues with synchronization, which caused some parts of the library to be surprisely not thread-safe, so I wouldn't wonder if something may be wrong (and this is a bug then), but in let's say 99% cases for 99% methods immutable collections are thread safe. In worst case you might be pushed from usage of some broken method or just (it might be not just "just" for some cases) need to clone the collection for every thread.

Anyway, immutability is still a good way for thread-safety.

P.S.2 Exotic case which might break immutable collections' thread-safety is using reflection to access their non-final fields.


A little addition about another exotic but really terrifying way, as it pointed out in comments with @Steve Waldman and @axel22 (the author). If you share immutable collection as member of some object shared netween threads && if collection's constructor becomes physically (by JIT) inlined (it's not logically inlined by default) && if your JIT-implementation allows to rearrange inlined code with normal one - then you have to synchronize it (usually is enough to have @volatile). However, IMHO, I don't believe that last condition is a correct behaviour - but for now, can't neither prove nor disprove that.



回答2:

In your question you are asking for an authoritative statement. I found the following in "Programming in Scala" from Martin Odersky et al: "Third, there is no way for two threads concurrently accessing an immutable to corrupt its state once it has been properbly constructed, because no thread can change the state of an immutable"

If you look for example at the implementation you see that this is followed in the implementation, see below.

There are some fields inside vector which are not final and could lead to data races. But since they are only changed inside a method creating a new instance and since you need an Synchronization action to access the newly created instance in different threads anyway everyting is fine.

The pattern used here is to create and modify an object. Than make it visible to other threads, for example by assigning this instance to a volatile static or static final. And after that make sure that it is not changed anymore.

As an Example the creation of two vectors:

  val vector = Vector(4,5,5)
  val vector2 =  vector.updated(1, 2);

The method updated uses the var field dirty inside:

private[immutable] def updateAt[B >: A](index: Int, elem: B): Vector[B] = {
    val idx = checkRangeConvert(index)
    val s = new Vector[B](startIndex, endIndex, idx)
    s.initFrom(this)
    s.dirty = dirty
    s.gotoPosWritable(focus, idx, focus ^ idx)  // if dirty commit changes; go to new pos and prepare for writing
    s.display0(idx & 0x1f) = elem.asInstanceOf[AnyRef]
    s
  }

but since after creation of vector2 it is assigned to a final variable: Bytecode of variable declaration:

private final scala.collection.immutable.Vector vector2;

Byte code of constructor:

61  invokevirtual scala.collection.immutable.Vector.updated(int, java.lang.Object, scala.collection.generic.CanBuildFrom) : java.lang.Object [52]
64  checkcast scala.collection.immutable.Vector [48]
67  putfield trace.agent.test.scala.TestVector$.vector2 : scala.collection.immutable.Vector [22]

Everything is o.k.