By Michael Strobl, PhD Student, University of Alberta
A common use of Cybera’s Rapid Access Cloud has been for classroom learning. Instructors from a number of Alberta institutions have used the cloud to run student assignments, teach the basics of cloud computing, or run modern-day analytics. This post highlights one such use case from the University of Alberta.
Introduction
For our third year computing science undergraduate course “Database Management Systems” (which ran in the 2016 winter term), the main course project was to develop a big data system. Students were required to use a NoSQL (Not Only SQL) database system running on the Cybera cloud as their computing infrastructure. NoSQL databases provide a different database architecture, and are well suited for hosting very large datasets that could not otherwise be handled by conventional SQL databases. In this project, students (working in three groups of three) were asked to develop a simple big data system for a wireless carrier to store a list of all call detail records (CDR) in a distributed database management system. They were also asked to conduct experimental evaluation of the SQL query performance on the distributed database system with different partitions of the data.
The system they were asked to build was required to:
- store the list of all CDRs as an online backup system for their mission-critical system
- evaluate the Online Analytical Processing (OLAP) queries
The students had free choice of a NoSQL system or their own implementation using Hadoop, but most settled on the NoSQL systems Apache Cassandra and RubatoDB — a NoSQL system developed at the University of Alberta.
Table Creation
We provided different table schemas with at least 100 columns that the students could use. Since the table was populated with a large number of rows, the students had to partition the table into:
- a list of node-partitions (to store the table into a list of distributed nodes)
- a list of column-partitions (to store the table as several tables to meet a number of conditions, ex. all partitioned tables must contain the same primary key)
Table Population
The table was populated with CDR data using these rules:
- the list of CDRs for each day should be generated according to the table schema
- the populated table must contain all the CDRs of the last N days such that the total size of the table must be at least 2TB (in terms of disk space occupied)
- all the values must be generated using a random number generator, and/or sequences
Query Evaluation
Based on the students’ understanding of telecommunication transactions, they were asked to specify five queries to retrieve information from the populated table, satisfying the following requirements:
- Four of the five queries will retrieve information from all distributed nodes
- One query must contain at least 10 conditions (atomic formulas) in the “where” clause
- Two queries must contain both the “group by” and “order by” clauses
- At least three queries are range queries, i.e., a query retrieving all rows where a column value is between an upper and lower boundary
- None of the queries are trivial, i.e., a simple keysearch
This included runtime experiments of the queries for two different partitions, such that at least one query ran faster on the second partition.
Computing infrastructure
Each of the three groups got access to a small cluster of three instances, with 4GB of memory and two CPUs each. This was sufficient to generate a fairly large dataset, create partitions and test the performance throughout the term. Towards the end, they got access to more powerful resources (three instances, eight CPUs and 16GB memory each) in order to evaluate the performance of the two different partitions on a generated dataset of at least 2TB.
Conclusion
Through the flexible configuration of the Cybera cloud, it was easily possible to adjust the system to our students’ needs as required. In both cases (using Apache Cassandra and RubatoDB), the students were able to configure the system properly and run their experiments according to the project specifications.
Using this setup, the students were able to gain a hands-on, real-world understanding of working with a cluster of machines for setting up, configuring, and querying NoSQL databases. This is particularly important when dealing with very large datasets of several terabytes in size, which is becoming increasingly common as more and more devices come online and create data. Students learned about optimizing database queries, and what applications may or may not be suitable for handling such datasets and challenges.
If you are interested in using the Rapid Access Cloud in your classroom, please contact Cybera’s cloud administration team.