Scaling up Inference in MLNs with Spark


Typically, inference algorithms for big data address non-relational data. However, clearly, a lot of real-world data such as social network data, healthcare data, etc. are relational in nature. Therefore, we need more powerful techniques that can scale up richer inference algorithms on relational data. Markov Logic Networks (MLNs) are arguably one of the most popular statistical relational models that can represent complex, uncertain knowledge succinctly. In this paper, we scale up inference algorithms for MLNs to big relational data. Specifically, the probabilistic graphical model underlying an MLN is typically extremely large even for small-sized problems, and performing inference on this model is highly challenging. A pre-dominant approach that is used to improve scalability is to perform lifted inference that does not construct the full graphical model underlying the MLN. Instead, the idea in lifted inference is to use symmetries in the distribution to reduce the size of the model, thus improving scalability. A popular approach to perform lifting utilizes clustering techniques to group together variables with similar distributional characteristics. However, for big relational data, it quickly becomes infeasible to identify these symmetries scalably. In this paper, we design a novel lifted inference system built on top of Spark that takes advantage of parallelism to identify symmetries in the MLN. Thus our work unifies advances in inference for relational data with advances in big data processing technologies. Utilizing the power of Spark, we show that we can perform more accurate inference and scale up relational inference to orders of magnitude larger sized datasets than currently possible by state-of-the-art MLN systems.

Publication Title

Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018