Introduction to Cassandra Query Language (CQL) – Part 3

In this part, I will explain important CQL commands. In previous parts (part1 and part2) I wrote about basic CQL concepts and data types. If you have read previous parts, you know what a keyspace or table is and how to define them.

As I said in the first part, there are two categories of CQL commands:

  1. Data Definition commands: Used to define or manipulate data structure and schema of the data
  2. Data Manipulation commands: Used to read or write data.

The first group is explained in this post. I will explain the second group in a future post.

Data Definition commands

List of important DD commands:

  1. CREATE/ALTER/DROP KEYSPACE
  2. CREATE/ALTER/DROP TABLE
  3. USE
  4. DESCRIBE

Keyspace related commands

You are familiar with “CREATE KEYSPACE” command which is used to create a new keyspace. Once a keyspace is created you can run “DROP KEYSPACE” to delete it.

CREATE KEYSPACE test WITH REPLICATION={'class':'SimpleStrategy', 'replication_factor':3};
DROP KEYSPACE test;
DROP KEYSPACE test2;  --this will give error because test2 does not exit
DROP KEYSPACE IF EXISTS test2; --Does nothing as test2 does not exist

The “CREATE KEYSPACE” command has some arguments which can be used to specify replication strategy. Replication strategy determines how and where replicas of the data will be stored when you are adding data to the tables inside the keyspace. In a future post, I will explain the different replication strategies that can be used in Cassandra. You can use “ALTER KEYSPACE” command to change replication strategy of a keyspace.

ALTER KEYSPACE "test" WITH REPLICATION =
  { 'class' : 'NetworkTopologyStrategy', 'datacenter1' : 3 };

Another option when creating/updating a keyspace is “DURABLE WRITES“. Normally Cassandra caches data in memory and flushed them to persistent storage periodically. If something wrong happens between two consecutive flush operations (e.g. machine shutdown), you will loose whatever data stored in the memory but not flushed to disk. If you enable “DURABLE WRITES” for a keyspace, Cassandra will use a special storage called “commit log” which will contain a copy of whatever there is in memory but not flushed to the disk. In the case of any data loss for memory-saved data, Cassandra will use the “commit log” to re-create the lost data and write them to the disk. Of course, this has it’s own load on the database. So you have to make a compromise. If data is not very important, you can disable this option but if not, you will have to enable it.

Here is an example of how to create a keyspace with durable writes disabled:

CREATE KEYSPACE test2
  WITH REPLICATION = 
  { 'class' : 'NetworkTopologyStrategy', 'datacenter1' : 3 } 
  AND DURABLE_WRITES = false;

Table related commands

These commands are used to create, edit or delete a table. You have seen “CREATE TABLE” command before:

-- create an empty table
CREATE TABLE test1 (
  field_name1 field_type1,
  field_name2 field_type2, 
...
  PRIMARY KEY (field_name1, field_name2, ...)
);

DROP TABLE test1;  -- delete the table and all of it's rows

To change a table you can use “ALTER TABLE” command. This command can be used to change the type of an existing column, add a new column or delete an existing column.

ALTER TABLE test1 ALTER field_name1 TYPE integer;

ALTER TABLE test1 ADD field_name3 integer;

ALTER TABLE test1 DROP field_name2;

You can also use “ALTER TABLE” command to edit table’s metadata. These metadata tell Cassandra how it should handle internal storage mechanisms for this table (e.g. caching parameters, compaction settings, …).

Other Commands

There are two other utility commands: USE and DESCRIBE.

You can use “USE x” to switch currently active keyspace.

You can use “DESCRIBE keyspace1” or “DESCRIBE table1” to see definition of a keyspace or a table. Note that to run the second version, you must have already switched to the keyspace which contains the table.

Introduction to Cassandra Query Language (CQL) – Part 2

In this part, I am going to give some more explanations about different data types that you can store in Cassandra. If you are familiar with programming you may know most of these data types. Here is the list of important data types:

  1. int: represents signed integer numbers (32-bit)
  2. float: floating point number (32-bit)
  3. double: floating point number (64-bit)
  4. boolean: can only hold true or false
  5. text: represents a string of characters
  6. blob
  7. counter
  8. list
  9. map
  10. set

The first five data types are almost familiar for most people because they are used in programming languages and other databases too. But maybe the second half of the list is not quite clear. So I will explain them with examples.

‘blob’ data type

This data type is used to store a very large chunk of data bytes. Mostly this data type is used to store files on the database.

‘counter’ data type

This data type is basically a 64-bit integer number. The main differences are:

  • You have to define dedicated tables for counter columns. This means that if you want to have a counter column, you will need to define a new table containing only primary key and the counter column you want. You cannot add non-counter fields which are not part of primary key.
  • This data type is optimized for concurrent updates. That’s the main difference between counter and integer. If you have an int column, to update its value, the database needs to lock, read the current value, update value and unlock. This is quite a heavy processing operation. But when you use counter, there will be no locking mechanism and there is no need to read current value before doing the update.

Collection data types

Cassandra supports collection data types. These are data types that ‘contain’ other non-collection data types.

A ‘list‘ can be defined like ‘list<text>‘ or ‘list<boolean>‘ and indicates an array or vector of zero or more values of that types.

map‘ means a relationship between two item which are called key and value. We can say that ‘key’ is mapped to ‘value’. For example, suppose that we have a list of U.S. states and their population. We can then say that ‘state name’ is mapped to ‘population’. Hence, when we need the population of a state we can easily use this list to figure that out. Here ‘state name’ is key and ‘population’ is value. We can define a map by specifying the data type of its key and value items, like ‘map<text, int>‘.

set‘ is exactly the same as ‘list’ with one exception: it cannot have duplicate values.

Example1

CREATE TABLE table1 (
key1 text PRIMARY KEY,
population_map map<text, int>,
list1 list<int>,
set1 set<boolean>
);

INSERT INTO table1 (key1, population_map, list1, set1) 
VALUES ('mahdi', {'NY': 10, 'CA':30, 'TX': 6}, [1, 2, 3], {true, false, false});

Above statements create a table named table1 containing 4 columns (We assume keyspace is already created and activated). The first column is the primary key of the table (Each table must have a primary key) and remaining 3 columns are of different collection types which were introduced.

Next statement inserts a single row into this table. Note the syntax used to specify a value for a map, list, and set.

Using a combination of collections, you can define quite complex data structures in a table in Cassandra. The reason for having complex data types is that in Cassandra everything is optimized for a lot of write operations and a few reads. Hence, it is better if you can read everything that you need with a single query on a single table rather than multiple queries across multiple tables. In contrast, in traditional RDBMSs, you will scatter columns onto multiple tables with relations to each other and when you want to read, you can join all those tables together to read multiple rows in a single query.

Example2

UPDATE table1 SET population_map['NY'] = 12 
WHERE key1 = 'mahdi';

DELETE population_map['TX'] FROM table1
WHERE key1 = 'mahdi';

Used-defined data types

You can define your own data type with its own attributes in Cassandra. If you are familiar with C or C++ this is same as ‘struct‘. You will then be able to use that data type as the type of columns when you define tables.

Below sample shows how to define a custom data type and use it when creating a new table and when inserting (or updating) data of the new table.

CREATE TYPE person_info(
                    name text,
                    family text,
                    age int
);

CREATE TABLE student(
                    student_id text PRIMARY KEY,
                    personal_info person_info
);                  

INSERT INTO student (student_id, personal_info) VALUES 
('A19201', { name: 'mahdi', family: 'mohammadi', age: 32 });

UPDATE student SET personal_info.age = 31 where id = 'A19201';

In the next part, I will explain more about CQL commands.

Introduction to Cassandra Query Language (CQL) – Part 1

Cassandra has it’s own language when you want to talk to it. This language is called CQL (Cassandra Query Language). If you are familiar with SQL (Structured Query Language) then this post will be very familiar to you because CQL is designed to be very similar to SQL.

In this post I will start with an introduction to the concept of database and it’s parts. Then I will explain some of the most important commands in CQL.

Basically, Cassandra (and most other database systems) organizes data in a way that is more intuitive and easy to retrieve. The highest level of this organization is named keyspace (same as a database in traditional database systems). You can think of a keyspace as a container into which you store all data that are related to each other. Normally this means all data of a single application.

Inside a keyspace, there are a number of tables. You can think of a table as an excel spreadsheet which has rows and columns. Each row represents a single piece of information. Each column means a specific attribute you wish to store in that table. For example, if we define a table to store student scores, its columns would be “student_id”, “student_name” and “score”. Each row would represent a single score of a student. Note that each column is capable of storing a specific type of data. Some columns are designed to store numbers while other may only store texts.

Generally, there are 2 types of actions we can do with a database:

  1. Working with data structures (keyspace, table, column)
  2. Working with data itself

The first group of commands is called DDL (Data Definition Language). The second group is called DML (Data Modification Language).

Obviously, without some place to store your data, you cannot do anything! so first of all you will need to create a data structure to support storage of your data. This structure will contain keyspaces, tables and columns. Normally you will define this structure to reflect the business requirements of your application (But in case of Cassandra there are other factors determining the structure of your data about which I will talk later).

I believe in “explanation by example”. So, as a very basic example, let’s create a keyspace and a table and insert some data into this table. Below are the commands that you need to execute:

 CREATE KEYSPACE test1 WITH REPLICATION  = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };

USE test1;
CREATE TABLE users (
 userid text PRIMARY KEY,
 first_name text,
 last_name text,
 age int
);
INSERT INTO users (userid, first_name, last_name, age) VALUES ('12345', 'mahdi', 'mohammadinasab', 32);

SELECT * FROM users;

How can I run these commands?“, you may ask. I assume you have downloaded and installed Cassandra. If you have not done so, please refer to Getting Started .

After starting Cassandra, you will need to run the cqlsh utility. This utility is located inside './bin' directory.

Explanation for three commands:

  1. CREATE KEYSPACE: This command will create a new keyspace for you. Remember that all data is organized in keyspaces so you cannot define data structure or read/write data unless you have at least one keyspace. Note that CQL commands end with a semicolon. In next parts I will explain about the ‘WITH REPLICATION‘ section of the command.
  2. USE test1: After creating the initial and only keyspace you will need to select it for next commands. This command activates the ‘test1‘ keyspace so future commands will be running inside this keyspace and won’t affect any other keyspace. Note that after running this command the cqlsh prompt changes to denote current active keyspace.
  3. CREATE TABLE: Although you have a keyspace, you cannot read or write anything from/to the keyspace. It’s because keyspace is not used to organize your data directly. You have to put them inside tables which in turn are located inside a keyspace. This command will create a simple table named ‘users‘ which has 4 attributes (columns): userid, first_name, last_name and age. Notice the use of underscore in field names? That’s because you cannot have spaces in the name of a column. For each column you will specify the type of the data it will accept. Later I will explain about the PRIMARY KEY option of this command.
  4. INSERT INTO: This command writes data into a table. You need to specify a table name, columns you want to write values for and provide corresponding values. This will insert a new record in users table under test1 keyspace. Note that for text fields, you will need to enclose values within single quotes.
  5. SELECT: This command is used to read all of the rows from a table. ‘*’ means you want to read values for all columns. When you run this command, it will output data you have written in the previous step.

So this is the most basic set of commands you could execute. You created a simple data structure, wrote some data in it and read the data back. Try to experiment with these commands. Create multiple keyspaces and tables. Write data and read them back.

In the next part I will explain different data types in Cassandra.

 

How Apache Cassandra executes CQL queries?

Apache Cassandra is a distributed, highly scalable database which organizes data into rows and columns (Somehow like traditional DBs). Each row must have a unique primary key. This key is used to distribute the data to the DB nodes. If you are not familiar with Apache Cassandra please refer to cassandra.apache.org.

I have been a member of Cassandra developer community and I would like to shed some lights on the query execution process from a developer’s perspective.

The query language of Cassandra is called CQL (Cassandra Query Language). Although it is much similar to SQL language, it has a lot of underlying differences. You can use `cqlsh` tool to run CQL commands against a Cassandra system or you can run queries from source code (using one of Cassandra drivers to talk to a Cassandra instance).

When you submit a command from `cqlsh` tool, it is sent to Cassandra over a network connection. On the other side of the connection, a daemon is waiting for commands. It will produce a response for each received commands and send it to the originator. Cassandra uses Netty for its network protocol handling. Netty allows Cassandra to define complex message encoding, decoding and dispatching to handle the communication protocol with its clients.

Here is the general overview of the flow of method calls during processing of a CQL query:

Cassandra

  1. The starting point, on the server side, is CassandraDaemon which initializes the service and Server which will be network endpoint to receive requests from clients. When you run ‘./bin/cassandra‘ it will invoke JVM to run ‘main‘ method of this class. This will start the initialization process for Cassandra service and then set up NativeTransportService. This service will configure and start the network service (Defined in Server class).
  2. org.apache.cassandra.transport.Server: This class is responsible for setting up a socket server. According to Netty terminology, it defines a ‘childHandler’ which will be responsible for setting up and initialize each newly established Channel (network connection). This is done by an internal static class, named Initializer (If Cassandra is set up to use SSL mode, this will be SecureInitializer). This is the component which receives a CQL query request, delegates the processing and sends the response.
  3.  org.apache.cassandra.transport.Server.Initializer: Upon establishment of a new connection with a client, this class will configure the pipeline for processing incoming and outgoing messages through this channel. This includes frame encoding/decoding, compression/decompression and message encoding/decoding. Additionally, it will add a message dispatcher to the pipeline (An instance of Message.Dispatcher) which is responsible for handling incoming messages, delegate execution to corresponding modules and produce an appropriate response.
  4. org.apache.cassandra.transport.Message.Dispatcher: All network related classes act on Request and Response classes. Requests in Cassandra are all based on org.apache.cassandra.transport.Request abstract class. For each different type of request, there is a separate concrete implementation of this class. The most important of which for this post is org.apache.cassandra.transport.messages.QueryMessage. Message dispatcher class will extend SimpleChannelInboundHandler from Netty library. The main implemented method here is ‘channelRead0‘. This method is called when a new message is received (and processed through the configured pipeline). This method calls ‘execute’ method of the received request. The output of this method will be a Response which will be sent back to the client.
  5. org.apache.cassandra.transport.Message.Request$execute: This is an abstract method which is implemented for different types of messages. The message type in which we are interested is QueryMessage.
  6. org.apache.cassandra.transport.messages.QueryMessage.execute: This method delegates the processing to QueryProcessor.process method. The only logic here is for tracing and exception handling.
  7. org.apache.cassandra.cql3.QueryProcessor$process: This method first calls ‘getStatement‘ to parse the query string (using ANTLR parser) and create an instance of ‘ParsedStatement.Prepared‘. This will include a ‘statement‘ attribute which is of type CQLStatement. CQLStatement is an interface which has implementations for each query type. For example org.apache.cassandra.cql3.statements.UpdateStatement handles INSERT/UPDATE statements. After getting an instance of CQLStatement, its ‘execute‘ method will be invoked to do the processing and return the response.