Why do I prefer statically typed programming languages?

tldr;  Dynamic typing sucks!

I have been working with a dynamic programming language, Perl, for the past 1.5 years. I also have worked with Java and C# languages during my career and at this point, I can really say that static typing makes the life of developers much easier.

So let me first explain the difference between these two terms and then I will talk about my reasons.

Dynamic Typing

Dynamically typed programming languages, are languages where the type of the data is not specified in the source code, but when the program is being executed. So the developer is not responsible for thinking about whether some variable should hold a number, string or an instance of DateTime object. He just uses a variable name. When the code is being executed, the runtime environment (interpreter, JIT compiler, …) will deduce the type which is most appropriate for that variable.

For example, the developer can write something like this, in Perl:

my $birth_date = Some::Module::get_birth_date();
$birth_date->plus_time_interval('18y');
my $day_of_week = $birth_date->day_of_week;

In the above code, type of ‘$birth_date‘ and ‘$day_of_week‘ are not explicitly specified. When someone reads this piece of code, it clearly implies that `$day_of_week` should be an integer number, but in a large codebase, this is not always possible when you look at the code. So, for example, you really don’t know the type of `$birth_date` unless you read the source code of the `get_birth_date` function, or you can’t know if `plus_time_interval‘ expects an integer or string or something else unless you have used that function before.

Now assume someone is responsible for maintaining or fixing a bug in a 100K SLOC codebase. When investigating the flow of execution of the code, it sometimes becomes really hard and complicated to find out the real type of a variable.

Examples of dynamically typed programming languages are: Perl, Python and Ruby and Javascript.

Static typing

On the other hand, in statically typed programming languages, the developer is responsible to explicitly state the expected type of all of the variables or function outputs. This will limit the scope of valid operations on a variable and help the compiler (or possibly the interpreter) to check for invalid operations. For example, when the developer defines an integer variable, he cannot treat that variable as a string or assign a string literal to it because the compiler will catch such errors. An example of statically typed code:

int x = 12;
int y = x+y;
Customer c = find_customer("id", y);
c.name = "Mahdi";
c.save();

In the above code, type of the variables `x`, `y` and the input/output of the function `find_customer` are all specified statically within the source code. You cannot pass an integer parameter as the first argument of `find_customer` and compile the code because you will get a compiler error.

So writing a statically typed code seems harder because the developer should write more code and keep track of variable types in his mind when writing code. Right? Yes! That is right if and only if you want to write a few lines of code and keep that for a relatively short period of time and ignore all of that afterward. But if you want to keep a relatively large code base over a longer period of time (which is the case most of the time, in thr software industry) this burden proves to have some benefits.

But if you want to keep a relatively large code base over a longer period of time (which is the case most of the time, in the software industry) this burden proves to have some benefits.

Examples of statically typed programming languages include C, C++, Java and C#.

Why static typing?

As I pointed out in the previous section, unless you want to write a few lines of code, keep it for a short period of time and discard everything afterward, the statically typed programming language have a clear advantage over dynamically typed ones.

The point is, the source code of a software is written once, but it will be read a lot of times (for review, bug fix, refactoring, migrations, optimizations, …). So the basic idea in static typing is that you write some more code and take care of some more things in the code when writing the code, and in exchange, you (or others) will have a much easier life when they want to read it. Another clear benefit is that compiler can help you find out lots of possible errors before releasing your code to the production environment.

I have had numerous cases of “Can't locate object method 'xyz' via package "ABC::DEF"` error in production server (Because in the code, someone was trying to call a method on a variable which was expected to be of type T, but at runtime, it had another type). That is because of we, human beings, make mistakes. And computers are there to help us prevent those mistakes. If we just ignore their help (And use dynamic typing), then this is what happens next 🙂

But don’t get me wrong. There are cases where dynamic typing is the preferred choice when writing a software system. Generally speaking, if the code base size is relatively small (say 1-10K SLOC) with a small development team (~5 people), then dynamic typing can have benefits. Because obviously developers can write the code much faster and time-to-market will be much shorter. And this is what normally happens for startups where founders want to deliver the product/prototype as soon as possible.

I have seen some people say that we can achieve same advantages of static typing in a dynamically typed programming language by writing good tests. This is possible in theory but if you implement all of the required test cases for this purpose, then you will be doing what the computer is supposed to do! Manually doing something which can be automated. And by the way, the result code based will become much larger because of all those tests! which eliminates one of the important advantages of dynamic typing: more concise code with less boilerplate.

What is wrong with Object Oriented Programming and what is right with Functional Programming?

I have been reading numerous articles about OOP and FP and comparison of these two recently. There are lots of differences and pro/cons for each of them but one important and interesting thing I recently found out was the fact that how tangled behavior and data are in OOP. I have been writing OO code for years, but I had not found out this until I read about FP and the way it handles data/behavior.

When you write a software in an OO approach, you must relate every behavior to a single isolated data entity (or class) in the business domain of the application. For example, when writing a code to send email, you have a data entity called “Email Message” and a behavior “Send a message”. There is no way for you to define these two separately. By separately I mean being able to invoke “Send Email Message” functionality without an “Email Message” object. Of course, you need a message when you want to send one, but this is a difference in the way we organize and approach things in OOP. When writing software in OO approach, you MUST define each behavior under one and exactly one class.

Question is, what if we cannot fit a behavior in this pattern? What if it doesn’t make sense or it is confusing to attach a behavior to one and only one class?

Considering above example about sending an email message, what happens if we have other entities in the application like Sender, Recipient, and Storage. Then where should I put “Send email message” functionality? Each one of the below candidates can be an acceptable place for putting this behavior.

  1. You can put “Send” method in “Message” class: msg1.Send();
  2. It can also be placed inside Recipient class: recipient.ReceiveMessage(msg1);
  3. It can also be part of Sender class: sender1.SendMessage(msg1);
  4. It can be part of a separate class. For example MessageBroker: MessageBroker.getInstace().SendMessage(msg1);

This is really confusing and IMHO unneeded complexity.

The advantage of Functional Programming approach is that you don’t have to bind each behavior to one class. They are completely separated. So you can define and use them separately. This type of modeling is more consistent with the way we think about software world concepts.

Of course for things which exist in the physical world, this is less of an issue (A car is the only object which can do the “Start” functionality). But for more abstract concepts which can only be found in the software world, I think FP approach makes more sense.

 

What is Apache Maven and why do we need it?

I was going to write a blog post about Spring framework but I thought maybe it’s better to start with Maven as the first step.

With an increase in the number of Java libraries and frameworks and projects, there was a need for a standard way of describing how a piece of software should be built and what are it’s dependencies. Normally a Java project is using a lot of frameworks and tools (e.g. Logging, Database, Encryption, Math, …). We need a standard (machine-readable) way of describing these dependencies and how to compile and build the project.

Using Maven enforces a specific directory structure to your project. There is a specific directory path for your source code, runtime resource files, … Another change in your project is adding a file named pom.xml which will describe dependencies and build options.

I will explain some basic concepts in the Maven world, then will proceed to create a new project based on Maven.

Artifact

An Artifact, is a JAR file which is available for use in a Java project. Artifacts can be downloaded from Maven Repository (a publicly available website), but normally this is done automatically by maven command line.

Each Artifact represents a small piece of software that can be re-used in a software project. For example, there are artifacts for logging and encryption. There are a number of Maven repositories available, but the well-known repository is the one provided by Maven itself which is available for search and browse at http://search.maven.org/. You can search for artifacts and download them. You also have a local repository which contains JAR files downloaded from remote repositories or installed locally.

Each artifact is uniquely identified by using 3 elements: group-id, artifact-id and version.

Group-id represents the name of the company or organization which has produced the artifact (e.g. org.apache.logging.log4j). Artifact-id is a unique string for the artifact (e.g. log4j) and version string (e.g. 1.2.17). When you specify your project’s dependencies, you have to write their Group-id, artifact-id and version.

You can also tell Maven to produce an artifact when compiling your project. The result will be a .jar file including metadata about the artifact. You can then install this jar file into a Maven repository and use it as a dependency in other projects.

Build lifecycle

A build lifecyle is a set of steps (phases) which Maven does to build the project. You can define your own lifecycle but the default lifecycle is used most of the time. Here is a list of most important phases in the default lifecycle:

  1. validate: Make sure project structure is correct and all required data are available
  2. compile: Compile source code files
  3. test: Run unit test files
  4. package: Package compiled files in a distributable format (e.g. JAR)
  5. install: Install package into the local repository.

There is also another important lifecycle named “clean“. This lifecycle will clean-up artifacts created during build lifecycle.

You can call Maven command like tool and give the name of the phase or lifecycle to execute:

mvn clean package

Above command will run clean lifecycle and then will run build lifecycle up to ‘package‘ step. The target of the operation is the current directory.

Directory structure

Maven expects a standard project structure in your project. You have to follow this structure:

Suppose the root directory of the project is named ‘myapp‘:

  • myapp/pom.xml
  • myapp/src/main/java/com/mahdix/App.java: Source code files should reside in this location and below, according to your package names.
  • myapp/src/test/java/com/mahdix/Test.java: Location for test files

You will run Maven command line utility on the root directory of the project where there is pom.xml file.

POM.xml

Here is the general structure of a pom.xml file:

<project xmlns="http://maven.apache.org/POM/4.0.0">
 <modelVersion>4.0.0</modelVersion>
 
 <groupId>com.mahdix</groupId>
 <artifactId>my-app</artifactId>
 <version>1.0</version>
 <packaging>jar</packaging>
 
 <name>Test Application</name>
 <url>http://www.mahdix.com</url>
 
 <dependencies>
   <dependency>
     <groupId>junit</groupId>
     <artifactId>junit</artifactId>
     <version>4.8.2</version>
   </dependency>
</dependencies>
</project>

Here is explanation of important parts:

  • modelVersion: Specifies the model version of the Maven tool we use, this should always be 4.0.0
  • groupId, artifactId, version: These tags specify information about the current project
  • dependencies: This tag lists dependencies of the project. Maven will automatically fetch, download and install required JAR files according to the dependency list provided. You just need to specify artifacts that you need.

 

Tutorial: Java Persistence API (JPA)

 

Java Persistence API or JPA is a standard specification and a set of classes which helps developers write a provider-independent persistence layer in his applications. JPA has taken best ideas from different persistence providers (e.g. Hibernate, TopLink, etc.).

The general workflow for JPA is:

  1. Define POJO classes corresponding to database entities. You annotate these classes to let JPA find them at runtime.
  2. Add persistence.xml file which explains about your POJO classes and database connection information.
  3. Write the code to work with database entities (CRUD operations) using JPA classes and JPQL language (which is similar to SQL).

Defining Entities

Entity classes are simple Java classes. These classes don’t need to inherit from a specific class. All you need to do is annotate the class, add relevant fields representing database table fields and adding getter/setter methods.

You can also annotate fields to indicate primary key or give some more information about database representation of the field.

Below is example of a student entity class:

import javax.persistence.*;

@Entity(name = "TBL_STUDENT")
public class Student {
    @Id
    @Column(name="ST_ID", nullable=false)
    @GeneratedValue
    private long id;

    @Column(name="FIRST_NAME")
    private String firstName;

    @Column(name="LAST_NAME")
    private String lastName;

    @Column(name="AVG_SCORE")
    private float avgScore;

    //getter and setter for fields go here
}

Note that we use Java annotations to represent metadata explaining more information about the entity.

  • @Column annotation is used to define database field name of the class field. You can omit this is names are the same.
  • @Entity is used to declare this class as a DB entity.
  • @Id is used to represent primary key of the entity
  • @GeneratedValue is used to declare JPA provider should automatically provide value for this field upon data insertion.

JPA has a lot of other annotations to explain the type of data that can be stored in a field, table inheritance, composite primary key, entity relationships and much more. These will be explained in a future post.

Adding persistence.xml

In this XML file, you declare some metadata which cannot be written in the code easily. This is a sample XML file:

<?xml version="1.0" encoding="UTF-8"?>
<persistence>
 <persistence-unit name="mytest" transaction-type="RESOURCE_LOCAL">
  <provider>org.eclipse.persistence.jpa.PersistenceProvider</provider>
  <class>test.Student</class>
  <properties>
  <property name="javax.persistence.jdbc.driver" value="org.sqlite.JDBC" />
  <property name="javax.persistence.jdbc.url" value="jdbc:sqlite:sample.db" />
  <property name="eclipselink.ddl-generation" value="drop-and-create-tables" />
  <property name="eclipselink.ddl-generation.output-mode" value="database" />
  </properties>
 </persistence-unit>
</persistence>

The XML file structure is almost the same across different projects. I have highlighted two important parts of the file:

  • name of the persistence-unit is a unique string which is used in the code to refer to this XML file.
  • provider XML tag is used to indicate which provider will be used at runtime to talk to the database.

The properties section of the XML file contains database connection information (e.g. database driver type, database name, …).

The code

The last step is using JPA classes to get a connection to the database and working with your POJO classes and EntityManager to do CRUD operations.

The basic flow is:

  1. Getting an instance of EntityManagerFactory class through Persistence class. Here you specify the persistence-unit name from XML file.
  2. Getting an instance of EntityManager class through EntityManagerFactory.
  3. Working with EntityManager instance and it’s EntityTransaction instance to create, read, update or delete POJO classes.

Here is the basic code:

//step 1
EntityManagerFactory factory = Persistence.createEntityManagerFactory("mytest");
//step 2
EntityManager entityManager = factory.createEntityManager();
//step 3
EntityTransaction txn = entityManager.getTransaction();
txn.begin();

//step 4 - CRUD operations
...

//step 5 - finalization
txn.commit(); //you need this line if you have changed some db data
entityManager.close();
factory.close();

CRUD operations

The CRUD operations become pretty straightforward considering the fact that you just need to work with your POJO classes as a proxy to the database table rows.

  • Create: You instantiate POJO class, set field values, call persist method of EntityManager.
  • Read: You create a JPQL query and get the result as a List<POJOClass>.
  • Find: Calling find method on EntityManager and giving the primary key value of the row. The result will be an instance of a POJO class.
  • Update: First find the row, update the POJO class fields.
  • Delete: Find the row, then call EntityManager.remove method.

Below is sample code for CRUD operations:

//CREATE a new row
Student st = new Student();
st.setFirstName("mahdi");

entityManager.persist(st);

//UPDATE row
Student st = entityManager.find(Student.class, 19);
st.setLastName("Mohammadinasab");

//DELETE row
Student st = entityManager.find(Student.class, 19);
entityManager.remove(st);

//READ rows
Query q = entityManager.createQuery("SELECT s FROM Student s");
List<Student> students = q.getResultList();

Note that in the update process you don’t need to call any JPA method. Just update POJO fields and when you commit the transaction, data will be updated.

For the query which reads rows from a database, JPQL language is used. I will explain more about JPQL syntax and how to handle relationships (one-to-many and many-to-many) in a future post.

 

Cassandra Internals: Bloom filters

In a previous post I explained about SSTable and their role in the persistent storage of data in Cassandra.

Suppose that following query is received on a node (news_id is primary key):

SELECT * FROM tblNewsItems WHERE news_id = '8bacfe891fa89bfab98d9e99f9a9';

How can the node determine if this specific row exists in it’s local data? As we know we have in-memory tables (mem-table) but they are used as a caching mechanism and not the real storage. So any data lookup in these in-memory storages won’t be enough as they don’t have the whole range of data.

Another solution is to read the SSTable and check if the table contains the key that we want. Even in case of using effective search algorithms like binary search, this will be too expensive because it involves a lot of disk reads which may not be efficient enough for a system with a lot of data and a lot of read queries.

The solution is to use Bloom Filter data structure. This is an in-memory, probabilistic data structure which can help you find if an element is member of a set or no? Here the set is the Cassandra table (which is actually a set of rows) and the element is the primary key we are looking for. So we have these features:

  1. It is in-memory, so it will be fast and efficient.
  2. It is probabilistic means the result will not be guaranteed to be correct. There is a high chance that the result is correct and this is enough for us. This is the trade-off that we pay to have such an efficient in-memory data structure. As a result, if the bloom filter says a row is inside a set, there is a very low probability that this is not correct. But the other side of the result is guaranteed to be true. If it says some row ir NOT in the set, it definitely is not.

How bloom filter works

A bloom filter is a bit-map containing n bits and a set of hash functions which are expected to be independant. Let’s assume we have a bloom filter with 100 bits and 3 hash functions (f1, f2 and f3). In practice the size of bloom filter will be much larger (MB or GBs). Each of has functions will receive an arbitrary input (row primary key) and output a number between 1 to n. You can easily achieve this output by using modulo operator and common hash functions.

There are two algorithms for insertion of data into a bloom filter and checking whether something exists in the set or no. These alrogithms are pretty simple and straightforward. To insert data x into bloom filter:

  1. Apply hash functions on x. So we will have a=f1(x), b=f2(x) and c=f3(x).
  2. Set bits numbered a,b and c in the bloom filter to one (Note that they may have already been set to one but this doesn’t matter).

To check if data x exists in the bloom filter:

  1. Apply hash functions on y: So we will have d=f1(y), e=f2(y), f=f3(y).
  2. Check bits numbered d, e, and f in the bit-map of bloom filter. Are they all set to one? If so, the row probably exists in the set. If even one of those bits is zero, means that row definitely does not exist in the set.

How false positives are handled?

False positive for a bloom filter means cases where the filter indicates that a row exists in the table but it was not. Remmber that this is a probabilistic data structure so these cases may happen.

If there is a false positive, we won’t know about it until we scan the SSTable. So generally Cassandra will scan the SSTable looking for a specific row, if the bloom filter indicates the row exists in the table. If after scan completion, row is not found, this will be recorded as a false positive.

You can run ‘nodetool cfstats‘ command in a Cassandra node to view a lot of statistics about the node. One of those stats is for bloom filters which shows you the memory consumed by bloom filter and number of false positives.

 

Cassandra Consistency Levels

In this post, I will explain more about data consistency in Cassandra. In a non-distributed system, it rarely happens that some data is missing or corrupt. But in a large scale distributed database system, this will be the norm rather than the exception. So there should be mechanisms to detect, handle and fix these situations. Customizable consistency levels are one of those mechanisms.

Write Consistency

In the last post, I explained what happens when data is being written in Cassandra. But how Cassandra determines if a write operation is successful or not?

Each write operation can specify the consistency level it needs. This is a number which determines how many replicas have to send success reply to the coordinator so that the whole operation can be considered successful.

If for example this is set to ONE, with a replication factor of 3, the coordinator will ask three nodes to store the data and will return a success status upon receiving the first success response from any of these three nodes. This, in the worst case, may mean that the other two nodes have failed to write the data. This is not something which happens normally but in a large system, with a lot of nodes and a lot of data flying around, things may go wrong.

We can use the consistency level to adjust a trade-off between performance (lower consistency level = faster response time) vs. reliability (higher consistency level = prevent corrupt write).

Read Consistency

Same as what we have for write operations, we can specify the same thing when reading data. When data is written in Cassandra, it will be written to more than one node (Refer to this and this post). Now when we want to read the data back, how should we proceed? Which of those nodes should be contacted and what if some of the contacted nodes doesn’t return a response?

The Read Consistency Level determines how the coordinator node should respond to the cases where some of the nodes, don’t reply a READ request or reply too late. For example, if Read Consistency is set to ALL, this means that coordinator should wait to get a response from all replicas. This will provide the highest level of reliability but the lowest performance. You can set it to TWO or THREE so coordinator will wait for two or three nearest replicas to return a response.

Example

Above figure shows a cluster with 12 nodes and a replication factor of 3. This means that each row of data will be written to 3 nodes (R1, R2 and R3 in this figure). When a client asks the coordinator (Node 10) to read data with Consistency Level of ONE, it contacts the nearest node (R1) for the data. In the background, it will make sure R2 and R3 have the most recent data and if not, a read repair operation will be initiated.

Handling Inconsistency

If a coordinator gets different data from different replicas, which one should it pick? The answer is, Cassandra timestamps all the data so the coordinator can easily determine the most recent data which will be returned to the client. Also after this happens, the coordinator will start a read repair process in the background to make sure all replicas have up to date data.

 

How Cassandra Writes Data – Part 2

This is the second part in a two-part series about the internals of Apache Cassandra for writing data. Click here to read the first part.

Flow of execution

The process begins by a client sending a request to the coordinator containing an INSERT statement. Let’s assume we have a table named table1 according to below definition:

CREATE TABLE table1 (id int, name text, PRIMARY KET id);

And the cluster consists of five nodes, which we will call node1, ..., node5, and the replication factor is 3. The cluster will have a partitioner algorithm which given the primary key of the row, outputs a big random number. We call this random number, the identifier of the row. We assume an identifier number is a 128-bit number which means it will be between 0 and 2^128-1 (max number).  Now consider this as a range of (0, max). Upon cluster configuration, this range is divided into five equal sub-ranges (because the cluster has five nodes):

R1=(0, max/5), R2=(max/5+1, 2*max/5), ... .

Each node will have its own range. R1 will be assigned to node1, R2 to node2 etc.

All nodes in the cluster know about other nodes and their corresponding range.

Coordinator receives this CQL statement:

INSERT INTO table1(id, name) VALUES (100, 'mahdi');

First, it applies the partitioner algorithm on ‘100‘ (the primary key). The result will be a number. It then determines the corresponding range (Ri, i=1,..,5) within which the number lies. Let’s assume the result number lies within R2 range. So node2 will receive the first copy of the row. But we have a replication factor of 3 which means data needs to be stored on three different nodes. In a simple replication strategy, additional nodes will be next nodes after the original receiver of the data which are R3 and R4 in our example. So coordinator will send 3 requests to R2, R3 and R4 to store the values for the new row in their local data (There is another more complex strategy called NetworkTopologyStrategy, see here for more information).

Note that R1 and R5 know nothing about the new row but the whole cluster contains the new data and you can later query this data from any of the nodes in the cluster.

SSTable, commit log and memtable

In a node which belongs to a Cassandra Cluster, there are 3 components which are used to store data:

  • SSTable: The storage of database on the persistent disk (e.g. Hard Disk). When data is written to this storage, it is permanently persisted, but problem is, writing to this storage is expensive. That’s why we need the other two components.
  • Memtable: For each table in the database, there is a memory space allocated to store its data. It is extremely fast because it’s in memory but it is not reliable, because in case of a problem in the node, all it’s data will be cleared.
  • Commit log: This is a persistent data file written to the local storage which contains a copy of all the actions applied on the database. These data can be used to re-construct SSTable in case of a problem on the node.

When writing data to the database, the data is written on Memtable and then on Commit log. After that a successful response is sent to the requester indicating the write operation is done successfully. Note that data is not written to SStable but it’s on persistent storage (Commit log) so it is safe. Periodically, the node requests memtables to be flushed to SStable which will write all updates to SStable (the final permanent storage for the data).

How Cassandra Writes Data – Part 1

As you know, Cassandra is a distributed database. This means that the Cassandra application will be running on multiple machines. We call each running application, a Cassandra instance and each of those machines, a host machine (or node). The collection of these host machines is called a Cluster.

Instances on host machines, always communicate to each other, sending and receiving data. The data and operations on them are distributed between hosts based on this ongoing communication. The details of this distribution depend on the operation which is taking place.

In this post (and the follow-up post), I am going to explain what exactly happens when an INSERT statement is executed on a Cassandra cluster. Of course, this post assumes that a cluster is already set-up and all nodes know each other. The explanation of the set-up process needs it’s own blog post which I will do in future.

First I am going to explain some of the terms I will need:

  1. Client: This is the machine, on which the INSERT command is issued. We assume this machine is connected to one of the nodes in the cluster and sends the command to the node to be executed. This can be any node in the cluster. There is no special ‘master‘ node in the cluster, all of them are the same (from a client’s perspective).
  2. Coordinator: This is one of the nodes in the cluster that is being contacted by a client. It is responsible for inserting data into Cassandra database and communicate with other nodes in the cluster to do their job and finally returning the result to the client.

What is replication?

One feature of Cassandra is it’s replication capabilities. This means that any data that you write to Cassandra can be stored on multiple nodes in the cluster. You have the option to determine how many nodes should store copies of the data and how they are chosen.

So if replication factor is 3, inserted data will be written to persistent storage on 3 different nodes. The advantage of this is that in case of failure (in network, hardware, …) the system can handle the situation and return correct data in a timely manner to the client.

You can set-up different replication strategies which determine replica nodes for data. For clarity, we assume that replication strategy is set to replicate data to 3 nodes in the cluster. There are more advanced replication strategies explained here.

Primary key, Cluster key and Partition Key

When you create a table, you specify one or more columns as the primary key. If you specify only one column, you will have a simple primary key. If you specify two or more columns, it will be a called a compound primary key.

CREATE TABLE tbl1 (user_id int, name text, PRIMARY KEY (user_id));
CREATE TABLE tbl2 (user_id int, ssn text, name text, PRIMARY KEY (user_id, ssn))

In the above example, the first table (tbl1) has a simple primary key while the second one has a compound primary key.

When we have a compound primary key, it will have two parts: Partition key and Cluster key. If you do not specify which fields belong to which part, by default, the first field of the primary key will be considered partition key and the rest will be cluster key (obviously, a simple primary key has no cluster key). In the above example, for tbl2, user_id is partition key and ssn is clustering key.

CREATE TABLE tbl3 
(user_id int, ssn text, region int, order_number int, name text, 
PRIMARY KEY ((user_id, ssn), region, order_number);

In the above example, we have a table with a compound primary key (it has 4 fields). The user_id and ssn pair is the partition key and region, order_number are clustering key.

In the next post I will explain how these keys are used when writing data.

 

Introduction to Cassandra Query Language (CQL) – Part 5

In this part, I will explain the two remaining CQL commands: UPDATE and DELETE.

DELETE command

As the name suggests, this command is used to delete data from a table. The structure of the command is very similar to SELECT command. You can specify a condition determining which rows you want to delete.

DELETE FROM table1;
DELETE FROM table2 WHERE user_id = 'mahdix';

The first command will remove all rows from table table1. The second command will only remove rows whose user_id is matching the given criteria.

Note that you can also use keyspace_name.table_name syntax instead of the table name.

In Cassandra, the relation between rows and columns is looser than in traditional RDBMSs. Because of that, each row can have its own columns which can be same as other rows or different. You can use DELETE command to delete some columns from a row. This will not remove the whole row but some columns will be removed without touching other remaining columns. For example:

DELETE field1, field2 FROM table1 WHERE name = 'mahdi';

This will delete two fields from rows in table1 which match the given criteria.

If you want to delete all rows from a table, there is a simpler command for that:

TRUNCATE keyspace1.table1;

UPDATE command

This command is used to change values of attributes for one or more rows:

UPDATE table1 SET age = age + 1 WHERE name = 'mahdi';

This command will update all rows in table1 who have ‘mahdi‘ as the value of their name field. This update statement will increment the value of age attribute.

To change value of set collection data types you should use plus and minus sign:

UPDATE table1 SET emails = emails + { 'mahdix@gmail.com' } WHERE name = 'mahdi';
UPDATE table1 SET emails = emails - { 'info@mahdix.com' } WHERE name = 'mahdi';
UPDATE table1 SET emails = {} WHERE name = 'mahdi';

Above syntax is for set data type. For a list, follow the same syntax, except using brackets instead of braces:

UPDATE table1 SET members = ['person1', 'person2'] WHERE id = 1;
UPDATE table1 SET members = members + ['person3'] WHERE id = 3;
UPDATE table1 SET members = ['person4'] + members WHERE id = 4;
UPDATE table1 SET members = members - ['person1'] WHERE id = 2;

As you can see, you can either append or prepend data to a list. This is doable because, in a list, the order of items is important (unlike a set where the order or repetition does not matter).

For map data type, you follow the same syntax used for insertion of data:

UPDATE table1 SET state_population = { 'OH': 8, 'CA': 30} WHERE code = 'US';
UPDATE table1 SET state_population['OH'] = 9 WHERE code = 'US';
UPDATE table1 SET state_population = state_population + { 'TX': 17 } WHERE code = 'US';

Other commands

There are some other commands which are not discussed in my posts. Mostly because they are not very common. For example, there are permission control commands (CREATE USER, ALTER USER, GRANT, REVOKE) to manage users and their permissions (such as CREATE, DROP, SELECT, MODIFY, …) to a table or keyspace.

Another useful set of commands is for working with index (CREATE INDEX, DROP INDEX). Indexes are used to help Cassandra find rows faster. If you have some columns which are not part of primary key but are commonly used in your queries’ criteria, they may be a good candidate to create an index. In future posts, I will explain more about indexes.

Introduction to Cassandra Query Language (CQL) – Part 4

After discussing CQL commands for Data Definition in this post I am going to explain “Data Manipulation” commands. These commands are used to add new data to a table or read/update/delete data.

Each of above four types of actions has it’s own command:

  1. INSERT: Is used to add new a row to a table,
  2. SELECT: Is used to read one or multiple rows from a table,
  3. UPDATE: Used to change values of rows in a table,
  4. DELETE: To delete one or more rows.

INSERT command

Suppose that we have a keyspace and a table in our Cassandra database. Here is how we can insert a single row into the table:

INSERT INTO mytable (field1, field2, field3) VALUES ('value1', 'value2', 'value3);

Above command will set value for three fields in the mytable table. It’s a common practice to capitalize all CQL keywords but it’s not necessary. It’s assumed that all fields have text type. Let’s see how to insert data when fields have other types (If you are not familiar with different data types in Cassandra, you can refer to this post).

INSERT INTO mytable (text_field, set_field, map_field, list_field) VALUES 
(
   'this is a text', 
   {'set_element1', 'set_element2', 'set_element3'}, 
   {'key1': 'value1', 'key2': 'value2', 'key3': 'value3'},
   ['list_element1', 'list_element2', 'list_element3']
)

As you can see, the syntax used to specify values for collection data types is pretty intuitive.

Some notes about INSERT statement:

  • You can specify keyspace name and table name altogether, so you don’t need to USE the keyspace before inserting data (INSERT INTO keyspace1.table1 …)
  • You can set a TTL when inserting data. TTL means Time To Live and specifies the seconds after which, the values you are inserting will have to be deleted. So for example:
INSERT INTO table1 (field1) VALUES ('temp_data') USING TTL 10;

After 10 seconds, this row will be removed. You can query TTL For a column using below syntax:

SELECT TTL(field1) FROM table1;

Above command will return current TTL of field1 for all rows of the table.

SELECT command

This command is used to read (or select) one or more rows from a table. The simplest version of this command is:

SELECT * FROM table1;
SELECT * FROM keytspace1.table1;

As with the INSERT command, here you can either use table name or keyspace name + table name to indicate the target. Above commands will return all the rows that we have in table1 table.

But what if you need a special row or rows? There is a keyword, “WHERE” which you can use to describe common properties of the rows you are looking for:

SELECT * FROM table1 WHERE id=12;
SELECT * FROM table2 WHERE id=12 AND age < 5;
SELECT * FROM table3 WHERE state='OH' or age > 30;

Above three commands, will return a set of rows according to given conditions. The syntax used to describe condition is pretty straightforward. I will explain some more advanced examples of SELECT command below:

SELECT * FROM table1 WHERE id IN (10, 11, 12);
SELECT * from table1 WHERE id = 10 OR id = 11 OR id = 12;

Above two commands are exactly the same and will returns rows that their ‘id‘ field either has a value of 10, 11 or 12.

SELECT * FROM table1 WHERE emails_set CONTAINS 'mahdix at gmail.com';
SELECT * FROM table2 WHERE map_field1 CONTAINS 'The Value1' OR map_field1 CONTAINS KEY 'OH';

Note that if you are filtering SELECT output using a condition which relies on a non-primary-key field, then this will be considered an expensive query and Cassandra will refuse to run it unless you have included “ALLOW FILTERING” at the end of the query.

SELECT * FROM tbl_people ORDER BY age DESC LIMIT 100;

The above query will return top 100 oldest people who have a record in our ‘tbl_people‘ table.

I try to limit my posts to 500 words, that’s why I will explain ‘UPDATE‘ and ‘DELETE‘ statements in another post 🙂