Making shapeless practical
Recently, I’ve started using shapeless, first time for a practical task. During that I’ve met a series of obstacles, which required a lot of consultancy, source code browsing, wit and luck, hence I considered worth describing those and the way through them. I know that it is always useful to have a beginner’s feedback as it may guide the next beginners and suggest developers how to improve experience for a newcomers as with their knowledge it might be not so evident.
Implementing a simple clustering with open-source library Weka, I bumped into the need to initialize a set of Attribute
objects, set it up, cluster, and then to process after clustering. That was way painful to add or modify features, as there was four places in code that needed modification, leading to a reduced productivity and possible mistakes.
With the growth of the project, bringing more entities with its own sets of features, it menaced to be just unmanageable. Then I realised that it must be a perfect case for shapeless as it is exactly supposed to facilitate wider generalization with Scala.
I tried to express all the details to provide as much context as possible and to explain all the decisions made and possibilities considered. The post happened to be quite a longread, but I hope that I hit the base for providing a valuable piece of feedback.
Definitions
First, I want to define some helper functions, first to use newly created objects in the more functional way and the second is for basic resource management:
def make[T](p: => T)(s: T => Unit) = { val v = p; s(v); v }
def using[T <: { def close() }, R](p: => T)(s: T => R): R = {
val r = s(p); Try(p.close()); r
}
And as I was about to generalize clustering with weka.clusterers.SimpleKMeans
over any object and its features, I also defined cluster and its unique id, as follows:
case class Id[+T](value: String)
case class Cluster[T](id: Id[Cluster[T]], objects: Array[T], typicalObject: T)
where typicalObject
is the centroid (object with feature values most typical for a cluster, e.g. centre point), it is not one of the objects passed initially for clustering, but rather an interpolation generated in runtime.
As Weka return the “iterable” stuff as java.util.Enumeration
and receives weka.core.FastVector
, in order to use Scala collections combinators I introduced the following conversions:
object Weka {
// Makes Weka `FastVector` from Scala `Iterable` implicitly
implicit def iterableToFastVector[T](iterable: Iterable[T]): FastVector =
make(new FastVector(iterable.size)) { vector =>
iterable.foreach(vector.addElement)
}
// Makes Scala `Iterator` from `java.util.Enumeration`
def enumerate[T](enumeration: java.util.Enumeration[T]) = {
Iterator.continually {
enumeration.nextElement()
} takeWhile { _ =>
enumeration.hasMoreElements
}
}
}
Specialized algorithm
Initially my specialized algorithm, lets say on Example
class instances,
case class Example(
feature1: Option[Long],
feature2: Option[Double],
feature3: Option[String],
feature4: Option[Long],
feature5: Option[String],
feature6: Option[Boolean])
had method signature pretty much like that,
def kMeans(objectsList: List[Example]): Seq[Cluster[Example]]
Which returned Nil
for empty objectsList
and for non-empty did the following:
Collected some statistics of the data input to determine the number of clusters and possible values of the discrete features (in this case — of types String
and Boolean
),
val feature3Values = objects.flatMap(_.feature3).distinct
val feature5Values = objects.flatMap(_.feature5).distinct
val feature6Values = objects.flatMap(_.feature6).distinct.map(_.toString)
val cardinality = feature3Values.size + feature5Values.size + feature6Values.size
val clusterGrowthFactor = 12
val clusterAmount = math.min(cardinality/clusterGrowthFactor, objects.size-1)
After that, created the feature vector:
val features = Seq(
new Attribute("feature1", Seq()),
new Attribute("feature2", Seq()),
new Attribute("feature3", feature3Values),
new Attribute("feature4", Seq()),
new Attribute("feature5", feature5Values),
new Attribute("feature6", feature6Values)
)
Next, initialised the algorithm: create SimpleKMeans
object, fill in the instances that will be clustered with the actual features values with the specific conversions for each feature type, and provide options (e.g. clusterAmount
etc):
val kmeans = make(new SimpleKMeans()) { clusterer =>
val instances = make(new Instances("objects",
iterableToFastVector(features), objects.size)) { instances =>
objects foreach { obj =>
instances.add(make(new Instance(features.size)) { example =>
if(obj.feature1.isDefined)
example.setValue(features(0), obj.feature1.get.toDouble)
else example.setMissing(features(0))
if(obj.feature2.isDefined)
example.setValue(features(1), obj.feature2.get)
else example.setMissing(features(1))
if(obj.feature3.isDefined)
example.setValue(features(2), obj.feature3.get)
else example.setMissing(features(2))
if(obj.feature4.isDefined)
example.setValue(features(3), obj.feature4.get.toDouble)
else example.setMissing(features(3))
if(obj.feature5.isDefined)
example.setValue(features(4), obj.feature5.get)
else example.setMissing(features(4))
if(obj.feature6.isDefined)
example.setValue(features(5), obj.feature6.get.toString)
else example.setMissing(features(5))
})
}
}
clusterer.setNumClusters(clusterAmount)
clusterer.setPreserveInstancesOrder(true)
clusterer.buildClusterer(instances)
}
Then, we are launching an algorithm and grouping objects by cluster:
val centroids = enumerate(kmeans.getClusterCentroids.enumerateInstances())
val objectsByCluster = kmeans.getAssignments.zip(objects).groupBy(_._1)
Eventually, Cluster
objects are created and centroids are extracted:
val clusters = centroids.zipWithIndex map {
case (instance, index) =>
Cluster[Example](
Id(index.toString),
objectsByCluster.getOrElse(index, Array[(Int, Example)]()).map(_._2),
typicalObject = Example(
Option(instance.asInstanceOf[Instance].value(0).toLong),
Option(instance.asInstanceOf[Instance].value(1)),
Option(instance.asInstanceOf[Instance].stringValue(3)),
Option(instance.asInstanceOf[Instance].value(4).toLong),
Option(instance.asInstanceOf[Instance].stringValue(5)),
Option("true".equals(instance.asInstanceOf[Instance].stringValue(6))))
)
} toSeq
Option(clusters).filterNot(_.isEmpty && objects.isEmpty) getOrElse {
List(Cluster[Example](Id("0"), objects.toArray, objects.head))
}
With the fallback in the very end, we are done. As it might be noticed above, there are four places where features are enumerated explicitly with some type dependent treatment, those are:
- Taking the distinct values;
Attribute
s creation;- Filling in the
Instance
s; - Centroid extraction.
It is clear that it must be painful for different types of input objects and non-trivial how to generalize it in a way to preserve type safety (unlike reflection and dynamic typing).
General algorithm
Well, seems like shapeless is to the rescue! To preserve type-safety while iterating over the case class fields HList
is definitely the right abstraction to work with and Generic
to convert case classes to HList
, back and forth. Actually, I took LabelledGeneric
to convert case classes to “records”, which are nothing more that HList
s with values tagged with field names (read further to some intuitive explanation of that). I wanted to use my case class field names as the Weka Attribute
names, instead of generated ones for the better debugging capabilities.
So, I created generic and converted all of my objects to the records,
val objectGeneric = new LabelledGeneric[Example]
val generics = objects.map(objectGeneric.to)
After that I needed to generalize to exclude direct Example
class reference, but let’s omit this for simplicity for a while.
Well, it is time to calculate statistics for the data. The idea is to make an HList
of List
with at most one value (when it presented in the input object) for each record and the to fold
those records, calculating union for its feature values. Soo, I was needed a few combinators for that. First obstacle was that there is no combinators defined at shapeless.HList
and I found no reference docs of which combinators are present. Hm, after some search through the source code and googling I found that combinators for records, hlists, nat etc are provided implicitly (for HList
s with shapeless.HListOps
) and are placed at shapeless.ops._
. Thus, importing,
import shapeless.ops.hlist._
Well, I checked that and found that there are map
and zip
presented. Fine, the algorithm might look like,
val distinctValues = generics.map(_.map(getDistinctValues))
.reduce((dv1, dv2) => (dv1 zip dv2).map(combineValues))
Ok, but how to define function that maps over the HList
? HList
might contain values of different types, e.g. [String, Long, Boolean]
. Therefore, function to preserve type-safety must have different implementations for different types of values returning yet different types. Scala functions definitely don’t provide this. That is why polymorphic functions are helpful. I can’t remember where I’ve seen this picture, but it explains how they work pretty well:
To define a Poly
there should be an object created with different functions for different argument types, where the best match for arguments is found with implicit search. Thus, for getDistinctValues
we are having those four options for Double
, Boolean
, String
, Long
for fields of different types:
object getDistinctValues extends Poly1 {
implicit def caseFieldDouble [K] = at[FieldType[K, Option[Double]]] { field => … }
implicit def caseFieldBoolean[K] = at[FieldType[K, Option[Boolean]]] { field => … }
implicit def caseFieldString [K] = at[FieldType[K, Option[String]]] { field => … }
implicit def caseFieldLong [K] = at[FieldType[K, Option[Long]]] { field => … }
}
The next issue for me was that the field
does not contain a key
. Thus, no way to use key value during the mapping of this field. This happens as the keys are tags and are available at the compile time only. Because of this, runtime evaluation of the field object haven’t shown any presence of the key. Solution came from my colleague, he explained that I need to define an implicit parameter of type Witness.Aux[K]
for my functions that will provide me the key value. Key comes literally out of nowhere and I found neither list nor explanation of these Aux
es. So it is not clear what functionality you could get out of those, and you need to browse different source code packages in order to collect different Aux
implementation and get some understanding. At least I’ve got that working, so the getDistinctValues
and combineValues
look like following,
object getDistinctValues extends Poly1 {
implicit def caseFieldDouble [K](implicit wk: Witness.Aux[K]) =
at[FieldType[K, Option[Double]]] { field => wk.value -> List[Double]() }
implicit def caseFieldBoolean[K](implicit wk: Witness.Aux[K]) =
at[FieldType[K, Option[Boolean]]] { field => wk.value -> field.toList }
implicit def caseFieldString [K](implicit wk: Witness.Aux[K]) =
at[FieldType[K, Option[String]]] { field => wk.value -> field.toList }
implicit def caseFieldLong [K](implicit wk: Witness.Aux[K]) =
at[FieldType[K, Option[Long]]] { field => wk.value -> List[Long]() }
}
object combineValues extends Poly1 {
implicit def caseAny[K,V] = at[((K, List[V]), (K, List[V]))](fields =>
fields._1._1 -> (fields._1._2 ++ fields._2._2).distinct)
}
Poly combineValues
has just one case as getDistinctValues
produces the same result type List
for all arguments. In this case a simple monomorphic might be enough. However, as we are continuing to work with HList
s we still need to define a Poly
. That is just concatenating two lists of different records and excluding the repetitive values.
After that we came to the statistics calculation, that is pretty straitforward,
val cardinality = distinctValues.map(size).toList.reduce(_ + _)
val clusterAmount = math.min(cardinality/clusterGrowthFactor, objects.size - 1)
except that, again, we need to define custom Poly
function size
as we are working with HList
s (that might be quite useful to have an implicit from usual monomorphic functions to a polymorphic with just one case, if that is possible at all):
object size extends Poly1 {
implicit def caseList[K,V] = at[(K, List[V])](_._2.size)
}
So far so good, first part is done! And we are going to build features. There are no new conceptions here, just one new Poly
named createAttribute
(key values are already available in runtime due to the previous mapping):
val features = distinctValues.map(createAttribute)
object createAttribute extends Poly1 {
implicit def caseDouble [K] = at[(K, List[Double])] { values =>
new Attribute(values._1.toString) }
implicit def caseBoolean[K] = at[(K, List[Boolean])] { values =>
new Attribute(values._1.toString, values._2.map(_.toString)) }
implicit def caseString [K] = at[(K, List[String])] { values =>
new Attribute(values._1.toString, values._2) }
implicit def caseLong [K] = at[(K, List[Long])] { values =>
new Attribute(values._1.toString, Seq()) }
}
Ok, we’ve come to the configuration of the clustering algorithm. Now we need to transform the HList
to some regular list to feed it to the Instances
‘ constructor not loosing the type information. For that purpose unify
function is very handy. With toList
it will produce List
of the least upper bounds (List[Attribute]
in our case):
def makeInstances = new Instances("objects", features.unify.toList, objects.size)
And after that we need to add instance to this object. Instance creation requires the number of features in this instance, but features
is an HList
and its length
method returns Nat
which is a peano number available in compile time and it is not the easiest thing to convert it to an Int
. Fortunately there is runtimeLength
method on HList
implemented doing just this!
To fill instance with input object data, we’re going to combine the newly created instance
, features
HList and specific input object
. Features and object combined smoothly as both are HList
s and have zip
. Although, passing instance
to the Poly
filing that instance was tricky for me, because it is the same object to all the features. I’ve tried to pass it as an implicit parameter, but had some mysterious compiler errors. Currying on Poly
s is also tricky as those are objects and not accepting constructor parameters. Finally, I found method zipConst
that matched best. Eventually, algorithm initialization code it as follows,
val kmeans = make(new SimpleKMeans()) { clusterer =>
val instances = make(makeInstances) { instances =>
objects foreach { obj =>
instances.add(make(new Instance(features.runtimeLength)) { instance =>
features.zip(objectGeneric.to(obj)).zipConst(instance).map(fillInstance)
})
}
}
clusterer.setNumClusters(clusterAmount)
clusterer.setPreserveInstancesOrder(true)
clusterer.buildClusterer(instances)
}
Implementation for fillInstance
tackles with the Java API that actually setting value to instance by producing an effect. To mimic functional behaviour the instance filled with value returned from the fillInstance
in fve._2
. There is again different treatment for different types, the only difference though, lies in the value conversion before passing it to setValue
method:
object fillInstance extends Poly1 {
implicit def caseDouble[K] =
at[((Attribute, FieldType[K, Option[Double]]), Instance)] { fve =>
if(fve._1._2.isDefined) fve._2.setValue(fve._1._1, fve._1._2.get)
else fve._2.setMissing(fve._1._1); fve._2
}
implicit def caseBoolean[K] =
at[((Attribute, FieldType[K, Option[Boolean]]), Instance)] { fve =>
if(fve._1._2.isDefined) fve._2.setValue(fve._1._1, fve._1._2.get.toString)
else fve._2.setMissing(fve._1._1); fve._2
}
implicit def caseString[K] =
at[((Attribute, FieldType[K, Option[String]]), Instance)] { fve =>
if(fve._1._2.isDefined) fve._2.setValue(fve._1._1, fve._1._2.get)
else fve._2.setMissing(fve._1._1); fve._2
}
implicit def caseLong[K] =
at[((Attribute, FieldType[K, Option[Long]]), Instance)] { fve =>
if(fve._1._2.isDefined) fve._2.setValue(fve._1._1, fve._1._2.get.toDouble)
else fve._2.setMissing(fve._1._1); fve._2
}
}
Launching the clustering algorithm is all the same,
val centroids = enumerate(kmeans.getClusterCentroids.enumerateInstances())
val objectsByCluster = kmeans.getAssignments.zip(objects).groupBy(_._1)
And by now building clusters is what it is left. First, let’s remember that there should be centroid constructed for each cluster. That is an instance of the input object class (e.g. Example
). I know of no ways in Scala to model polymorphic class constructors except providing a factory method, which is a burden for the algorithm api client, though I found a trick to avoid that burden. As input objects list is not empty (it is checked by the condition at the very beginning), there is a head, and it might be used as an initial object to map it to HList
, fill with values from the corresponding instance
and convert back to the input object class (e.g. Example
).
To obtain a value from the instance
an index of a feature is needed, that is why I was searching for zipWithIndex
on HList
and it wasn’t there. Great help came from the community. Travis Brown implemented that and posted on Stack Overflow.
Applying this zipWithIndex
on reference
and adding instance
with zipConst
, as mentioned above, we are able to apply Poly
named fitWithInstanceData
producing record for the instance
and map back to Example
with objectGeneric.from
:
val reference = objectGeneric.to(objects.head)
val typicalObject = objectGeneric.from(
zipWithIndex(reference)
.zipConst(instance.asInstanceOf[Instance])
.map(fitWithInstanceData))
object fitWithInstanceData extends Poly1 {
implicit def caseDouble[K, N <: Nat: ToInt] =
at[((FieldType[K, Option[Double]], N), Instance)] { fv =>
field[K](Option(fv._2.value(toInt[N]))) }
implicit def caseBoolean[K, N <: Nat: ToInt] =
at[((FieldType[K, Option[Boolean]], N), Instance)] { fv =>
field[K](Option("true".equals(fv._2.stringValue(toInt[N])))) }
implicit def caseString[K, N <: Nat: ToInt] =
at[((FieldType[K, Option[String]], N), Instance)] { fv =>
field[K](Option(fv._2.stringValue(toInt[N]))) }
implicit def caseLong[K, N <: Nat: ToInt] =
at[((FieldType[K, Option[Long]], N), Instance)] { fv =>
field[K](Option(fv._2.value(toInt[N]).toLong)) }
}
There is one implementation detail worth mentioning. As I was mapping the records processing the fields, I was needed to construct fields back with the same keys but different values. Receiving values with Witness.Aux[K]
and then using ->>
operator haven’t helped as the value loses type information and it is not provable that it is the same K
. Fortunately, I was making the finishing touches during Scala Days ’15, so I was able to ask Miles. He suggested using some hardly noticeable function in the source code, namely field[K]
. That was exactly what I was searching for. Unfortunately, there is no trace of this function description in the documentation.
Well, now we are picking that typicalObject
and mapping the centroids
returned by KMeans to the Cluster
instances:
val clusters = centroids.zipWithIndex map { case (instance, index) =>
Cluster[T](
Id(index.toString),
objectsByCluster.getOrElse(index, Array[(Int, T)]()).map(_._2),
typicalObject)
} toSeq
Weka gives no clusters for a single object, hence we are providing fallback for uniformity, the final result is, as follows,
Option(clusters).filterNot(_.isEmpty && objects.isEmpty) getOrElse {
List(Cluster[T](Id("0"), objects.toArray, objects.head))
}
Well! That must work fine! But remember, as I mentioned earlier, LabelledGeneric
was created with specific Example
class parameter. It is time to generalize over that, along with the necessity to have an enclosure (function and class) for the code aforementioned.
Enclosure and manual type inference
It is impossible to create LabelledGeneric[T]
with regular constructor, like that:
val objectGeneric = new LabelledGeneric[T]
It might be obtained just as an implicit parameter of type LabelledGeneric.Aux[T,R]
. Along with that, new type parameter R
(for “representation”) is required. We must declare this and at least specify that it is going to be an HList
(thanks to xeno_by for clarification). Thus, complete enclosure for the algorithm looks like,
class KMeans[T : ClassTag, R <: HList]
(objectsList: List[T], clusterGrowthFactor: Int = 12)
(implicit objectGeneric: LabelledGeneric.Aux[T, R]) {
def clusters: Seq[Cluster[T]] = objectsList match {
case Nil => Nil
case objects =>
...
}
}
Then it happens that type R
doesn’t contain any information on whether this HList
is able to do something, e.g. be mapped (Mapper.Aux
), zipped (Zip.Aux
), asked of length (Length.Aux
) etc. Each combinator on that needs some kind of a special evidence, that might be received as implicit parameters and its evidence types all must be enumerated at the type parameters. There must be as many type parameters, as many intermediate steps in the algorithm and all those must be commutated well (see the code for clarification on what I mean by the “commutation”). Some combinators require also specific type parameters, like Nat
or, for example Mapper.Aux
required types of the functions that are going to map an HList
, hence all the Poly
definitions where moved to a separate object Functions
and the code turned to the following,
class KMeans[T : ClassTag, R <: HList, R2 <: HList, R3 <: HList, R4 <: HList,
R5 <: HList, R6 <: HList, R7 <: HList, R8 <: HList, R9 <: HList,
R10 <: HList, R11 <: HList, R12 <: HList, L <: Nat]
(objectsList: List[T], clusterGrowthFactor: Int = 12)
(implicit objectGeneric: LabelledGeneric.Aux[T, R],
gdvm: Mapper.Aux[Functions.getDistinctValues.type, R, R2],
dvz: Zip.Aux[R2 :: R2 :: HNil, R3],
cvm: Mapper.Aux[Functions.combineValues.type, R3, R2],
sm: Mapper.Aux[Functions.size.type, R2, R4],
vtt: ToTraversable.Aux[R4, List, Int],
cam: Mapper.Aux[Functions.createAttribute.type, R2, R5],
fu: Unifier.Aux[R5, R6],
uft: ToTraversable.Aux[R6, List, Attribute],
len: Length.Aux[R, L],
fz: Zip.Aux[R5 :: R :: HNil, R7],
fgez: ZipConst.Aux[Instance, R7, R10],
gem: Mapper.Aux[Functions.fillExample.type, R10, R11],
rng: Range.Aux[nat._0, L, R8],
rfz: Zip.Aux[R :: R8 :: HNil, R9],
riz: ZipConst.Aux[Instance, R9, R12],
rfm: Mapper.Aux[Functions.fitWithInstanceData.type, R12, R]) {
def clusters: Seq[Cluster[T]] = objectsList match {
case Nil => Nil
case objects =>
...
}
}
As you may imagine, it was a way tough to implement that. I had plenty of time to think when was matching those R
’s together. And all that time I was thinking that I am doing the compiler’s job, namely type inference. I have no clear idea of how to return that back to the compiler, but hopefully there is one, maybe macros are to the rescue?
Conclusion
It happened that for my single entity with a bunch of fields there was more of this general code than I had specific code. That was peculiar with the shapeless’ promise to scrap down the boilerplate. One idea I’ve came up with is similar to the Expression problem: there is a trade-off between general and specialized code. Writing specialized code is more laconic when algorithms are complex and there are just a few of data structures, and more verbose when algorithm is simple and there are plenty of the data structures, general code is just the opposite. With specialised code you’re to tackle more with each of the data field (data), and with general with each algorithm step (codata). I would call that “Generality problem”. For optimum, code generality must gravitate to the Data-Code ratio, and shapeless is the way to push that ratio.
Regarding shapeless, it is obviously cool library but lacks some sufficient documentation. Either reference of all the available combinators and Aux
es, or some “concepts” paper with description of the core ideas and related source code packages to browse. Otherwise users will be doomed to learn all of its implementation details even for the simple cases. What, according to Dick Wall, makes using of a library pointless. Effort to write such a paper seems feasible and maybe I’ll do that myself at some point in the future.
All the source code from this post might be found at the github project.
Hopefully, that post was useful.