How Cassandra Writes Data – Part 1

As you know, Cassandra is a distributed database. This means that the Cassandra application will be running on multiple machines. We call each running application, a Cassandra instance and each of those machines, a host machine (or node). The collection of these host machines is called a Cluster.

Instances on host machines, always communicate to each other, sending and receiving data. The data and operations on them are distributed between hosts based on this ongoing communication. The details of this distribution depend on the operation which is taking place.

In this post (and the follow-up post), I am going to explain what exactly happens when an INSERT statement is executed on a Cassandra cluster. Of course, this post assumes that a cluster is already set-up and all nodes know each other. The explanation of the set-up process needs it’s own blog post which I will do in future.

First I am going to explain some of the terms I will need:

  1. Client: This is the machine, on which the INSERT command is issued. We assume this machine is connected to one of the nodes in the cluster and sends the command to the node to be executed. This can be any node in the cluster. There is no special ‘master‘ node in the cluster, all of them are the same (from a client’s perspective).
  2. Coordinator: This is one of the nodes in the cluster that is being contacted by a client. It is responsible for inserting data into Cassandra database and communicate with other nodes in the cluster to do their job and finally returning the result to the client.

What is replication?

One feature of Cassandra is it’s replication capabilities. This means that any data that you write to Cassandra can be stored on multiple nodes in the cluster. You have the option to determine how many nodes should store copies of the data and how they are chosen.

So if replication factor is 3, inserted data will be written to persistent storage on 3 different nodes. The advantage of this is that in case of failure (in network, hardware, …) the system can handle the situation and return correct data in a timely manner to the client.

You can set-up different replication strategies which determine replica nodes for data. For clarity, we assume that replication strategy is set to replicate data to 3 nodes in the cluster. There are more advanced replication strategies explained here.

Primary key, Cluster key and Partition Key

When you create a table, you specify one or more columns as the primary key. If you specify only one column, you will have a simple primary key. If you specify two or more columns, it will be a called a compound primary key.

CREATE TABLE tbl1 (user_id int, name text, PRIMARY KEY (user_id));
CREATE TABLE tbl2 (user_id int, ssn text, name text, PRIMARY KEY (user_id, ssn))

In the above example, the first table (tbl1) has a simple primary key while the second one has a compound primary key.

When we have a compound primary key, it will have two parts: Partition key and Cluster key. If you do not specify which fields belong to which part, by default, the first field of the primary key will be considered partition key and the rest will be cluster key (obviously, a simple primary key has no cluster key). In the above example, for tbl2, user_id is partition key and ssn is clustering key.

(user_id int, ssn text, region int, order_number int, name text, 
PRIMARY KEY ((user_id, ssn), region, order_number);

In the above example, we have a table with a compound primary key (it has 4 fields). The user_id and ssn pair is the partition key and region, order_number are clustering key.

In the next post I will explain how these keys are used when writing data.


Introduction to Cassandra Query Language (CQL) – Part 5

In this part, I will explain the two remaining CQL commands: UPDATE and DELETE.

DELETE command

As the name suggests, this command is used to delete data from a table. The structure of the command is very similar to SELECT command. You can specify a condition determining which rows you want to delete.

DELETE FROM table2 WHERE user_id = 'mahdix';

The first command will remove all rows from table table1. The second command will only remove rows whose user_id is matching the given criteria.

Note that you can also use keyspace_name.table_name syntax instead of the table name.

In Cassandra, the relation between rows and columns is looser than in traditional RDBMSs. Because of that, each row can have its own columns which can be same as other rows or different. You can use DELETE command to delete some columns from a row. This will not remove the whole row but some columns will be removed without touching other remaining columns. For example:

DELETE field1, field2 FROM table1 WHERE name = 'mahdi';

This will delete two fields from rows in table1 which match the given criteria.

If you want to delete all rows from a table, there is a simpler command for that:

TRUNCATE keyspace1.table1;

UPDATE command

This command is used to change values of attributes for one or more rows:

UPDATE table1 SET age = age + 1 WHERE name = 'mahdi';

This command will update all rows in table1 who have ‘mahdi‘ as the value of their name field. This update statement will increment the value of age attribute.

To change value of set collection data types you should use plus and minus sign:

UPDATE table1 SET emails = emails + { '' } WHERE name = 'mahdi';
UPDATE table1 SET emails = emails - { '' } WHERE name = 'mahdi';
UPDATE table1 SET emails = {} WHERE name = 'mahdi';

Above syntax is for set data type. For a list, follow the same syntax, except using brackets instead of braces:

UPDATE table1 SET members = ['person1', 'person2'] WHERE id = 1;
UPDATE table1 SET members = members + ['person3'] WHERE id = 3;
UPDATE table1 SET members = ['person4'] + members WHERE id = 4;
UPDATE table1 SET members = members - ['person1'] WHERE id = 2;

As you can see, you can either append or prepend data to a list. This is doable because, in a list, the order of items is important (unlike a set where the order or repetition does not matter).

For map data type, you follow the same syntax used for insertion of data:

UPDATE table1 SET state_population = { 'OH': 8, 'CA': 30} WHERE code = 'US';
UPDATE table1 SET state_population['OH'] = 9 WHERE code = 'US';
UPDATE table1 SET state_population = state_population + { 'TX': 17 } WHERE code = 'US';

Other commands

There are some other commands which are not discussed in my posts. Mostly because they are not very common. For example, there are permission control commands (CREATE USER, ALTER USER, GRANT, REVOKE) to manage users and their permissions (such as CREATE, DROP, SELECT, MODIFY, …) to a table or keyspace.

Another useful set of commands is for working with index (CREATE INDEX, DROP INDEX). Indexes are used to help Cassandra find rows faster. If you have some columns which are not part of primary key but are commonly used in your queries’ criteria, they may be a good candidate to create an index. In future posts, I will explain more about indexes.

Introduction to Cassandra Query Language (CQL) – Part 4

After discussing CQL commands for Data Definition in this post I am going to explain “Data Manipulation” commands. These commands are used to add new data to a table or read/update/delete data.

Each of above four types of actions has it’s own command:

  1. INSERT: Is used to add new a row to a table,
  2. SELECT: Is used to read one or multiple rows from a table,
  3. UPDATE: Used to change values of rows in a table,
  4. DELETE: To delete one or more rows.

INSERT command

Suppose that we have a keyspace and a table in our Cassandra database. Here is how we can insert a single row into the table:

INSERT INTO mytable (field1, field2, field3) VALUES ('value1', 'value2', 'value3);

Above command will set value for three fields in the mytable table. It’s a common practice to capitalize all CQL keywords but it’s not necessary. It’s assumed that all fields have text type. Let’s see how to insert data when fields have other types (If you are not familiar with different data types in Cassandra, you can refer to this post).

INSERT INTO mytable (text_field, set_field, map_field, list_field) VALUES 
   'this is a text', 
   {'set_element1', 'set_element2', 'set_element3'}, 
   {'key1': 'value1', 'key2': 'value2', 'key3': 'value3'},
   ['list_element1', 'list_element2', 'list_element3']

As you can see, the syntax used to specify values for collection data types is pretty intuitive.

Some notes about INSERT statement:

  • You can specify keyspace name and table name altogether, so you don’t need to USE the keyspace before inserting data (INSERT INTO keyspace1.table1 …)
  • You can set a TTL when inserting data. TTL means Time To Live and specifies the seconds after which, the values you are inserting will have to be deleted. So for example:
INSERT INTO table1 (field1) VALUES ('temp_data') USING TTL 10;

After 10 seconds, this row will be removed. You can query TTL For a column using below syntax:

SELECT TTL(field1) FROM table1;

Above command will return current TTL of field1 for all rows of the table.

SELECT command

This command is used to read (or select) one or more rows from a table. The simplest version of this command is:

SELECT * FROM table1;
SELECT * FROM keytspace1.table1;

As with the INSERT command, here you can either use table name or keyspace name + table name to indicate the target. Above commands will return all the rows that we have in table1 table.

But what if you need a special row or rows? There is a keyword, “WHERE” which you can use to describe common properties of the rows you are looking for:

SELECT * FROM table1 WHERE id=12;
SELECT * FROM table2 WHERE id=12 AND age < 5;
SELECT * FROM table3 WHERE state='OH' or age > 30;

Above three commands, will return a set of rows according to given conditions. The syntax used to describe condition is pretty straightforward. I will explain some more advanced examples of SELECT command below:

SELECT * FROM table1 WHERE id IN (10, 11, 12);
SELECT * from table1 WHERE id = 10 OR id = 11 OR id = 12;

Above two commands are exactly the same and will returns rows that their ‘id‘ field either has a value of 10, 11 or 12.

SELECT * FROM table1 WHERE emails_set CONTAINS 'mahdix at';
SELECT * FROM table2 WHERE map_field1 CONTAINS 'The Value1' OR map_field1 CONTAINS KEY 'OH';

Note that if you are filtering SELECT output using a condition which relies on a non-primary-key field, then this will be considered an expensive query and Cassandra will refuse to run it unless you have included “ALLOW FILTERING” at the end of the query.

SELECT * FROM tbl_people ORDER BY age DESC LIMIT 100;

The above query will return top 100 oldest people who have a record in our ‘tbl_people‘ table.

I try to limit my posts to 500 words, that’s why I will explain ‘UPDATE‘ and ‘DELETE‘ statements in another post :-)