Jump to content

Apache Cassandra: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Slebresne (talk | contribs)
mNo edit summary
Data model: just a sentence to introduce the acronym RDBMS
 
(648 intermediate revisions by more than 100 users not shown)
Line 1: Line 1:
{{short description|Free and open-source database management system}}
{{External links|date=October 2023}}
{{Use mdy dates|date=September 2024}}
{{Use American English|date=September 2024}}
{{Infobox software
{{Infobox software
| name = Apache Cassandra
| name = Apache Cassandra
| logo = [[File:Cassandra logo.svg|frameless|Cassandra logo]]
| logo = [[File:Cassandra logo.svg|frameless|Cassandra logo]]
| author = Avinash Lakshman, Prashant Malik / [[Facebook]]
| screenshot =
| caption =
| author = Avinash Lakshman, Prashant Malik
| developer = [[Apache Software Foundation]]
| developer = [[Apache Software Foundation]]
| released = 2008
| released = {{Start date and age|2008|07}}
| latest release version = {{wikidata|property|edit|reference|P548=Q2804309|P348}}
| status = Active
| latest release date = {{start date and age|{{wikidata|qualifier|mdy|P548=Q2804309|P348|P577}}}}
| latest release version = 1.2.4
| latest release date = {{release date|2013|04|11}}
| frequently updated = yes
| programming language = [[Java (programming language)|Java]]
| programming language = [[Java (programming language)|Java]]
| operating system = [[Cross-platform]]
| operating system = [[Cross-platform]]
| language = English
| language = English
| genre = [[key-value store]]
| genre = [[NoSQL]] [[Database]], [[data store]]
| license = [[Apache License 2.0]]
| license = [[Apache License 2.0]]
| website = {{URL|http://cassandra.apache.org/}}
}}
}}
'''Apache Cassandra''' is a [[free and open-source software|free and open-source]] [[database management system]] designed to handle large volumes of data across multiple [[Commodity computing|commodity servers]]. The system prioritizes availability and [[scalability]] over [[consistency (database systems)|consistency]], making it particularly suited for systems with high write throughput requirements due to its [[Log-structured merge-tree|LSM tree]] indexing storage layer.<ref name="carpenter2022">{{cite book |last1=Carpenter |first1=Jeff |last2=Hewitt |first2=Eben |title=Cassandra: The Definitive Guide |edition=3rd |publisher=[[O'Reilly Media]] |year=2022 |isbn=978-1-4920-9710-5 |pages=}}</ref> As a [[wide column store|wide-column database]], Cassandra supports flexible schemas and efficiently handles data models with numerous sparse columns. The system is optimized for applications with well-defined data access patterns that can be incorporated into the schema design.<ref name="carpenter2022" /> Cassandra supports [[computer cluster]]s which may span multiple [[data center]]s,<ref>{{cite web |access-date=2013-07-25 |first=Joaquin |last=Casares |date=2012-11-05 |publisher=DataStax |title=Multi-datacenter Replication in Cassandra |quote=Cassandra's innate datacenter concepts are important as they allow multiple workloads to be run across multiple datacenters... |url=http://www.datastax.com/dev/blog/multi-datacenter-replication}}</ref> featuring [[Asynchrony (computer programming)|asynchronous]] and masterless replication. It enables [[Latency (engineering)|low-latency]] operations for all clients and incorporates [[Amazon (company)|Amazon]]'s [[Dynamo (storage system)|Dynamo]] [[distributed storage]] and replication techniques, combined with [[Google]]'s [[Bigtable]] data storage engine model.<ref>{{cite web |url=https://cassandra.apache.org/doc/latest/architecture/overview.html |title=Apache Cassandra Documentation Overview |access-date=2021-01-21}}</ref>
'''Apache Cassandra''' is an [[open source software|open source]] [[distributed database|distributed]] [[database management system]]. It is an [[Apache Software Foundation]] top-level project<ref name=GRAD>{{cite web|url=http://www.mail-archive.com/cassandra-dev@incubator.apache.org/msg01518.html |title=Cassandra is an Apache top level project |publisher=Mail-archive.com |date=2010-02-18 |accessdate=2010-03-29| archiveurl= http://web.archive.org/web/20100328090322/http://www.mail-archive.com/cassandra-dev@incubator.apache.org/msg01518.html| archivedate= 28 March 2010 <!--DASHBot-->| deadurl= no}}</ref> designed to handle very large amounts of data spread out across many [[commodity server]]s while providing a highly available service with no [[single point of failure]]. It is a [[NoSQL (concept)|NoSQL]] solution that was initially developed by [[Facebook]] and powered their Inbox Search feature until late 2010.<ref name=FBIS>{{cite web|url=http://www.facebook.com/note.php?note_id=24413138919&id=9445547199&index=9 |title=Niet compatibele browser |publisher=Facebook |date= |accessdate=2010-03-29}}</ref><ref name=KM2010>{{cite web
|url=http://www.facebook.com/notes/facebook-engineering/the-underlying-technology-of-messages/454991608919
|title=The Underlying Technology of Messages
|author=Kannan Muthukkaruppan
}}</ref> Jeff Hammerbacher, who led the Facebook Data team at the time, has described Cassandra as a [[BigTable]] data model running on an [[Dynamo (storage system)|Amazon Dynamo]]-like infrastructure.<ref name=JH2008>{{cite web
|accessdate=2009-06-04
|date=July 12, 2008
|url=http://perspectives.mvdirona.com/2008/07/12/FacebookReleasesCassandraAsOpenSource.aspx
|title=Facebook Releases Cassandra as Open Source
|author=James Hamilton
}}</ref>


== History ==
Cassandra provides a structured [[key-value store]] with tunable consistency.<ref name=LADIS2009>http://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf</ref> Keys map to multiple values, which are grouped into [[column families]]. The column families are fixed when a Cassandra database is created, but columns can be added to a family at any time. Furthermore, columns are added only to specified keys, so different keys can have different numbers of columns in any given family. The values from a column family for each key are stored together.
Avinash Lakshman, a co-author of [[Amazon (company)|Amazon]]'s [[Dynamo (storage system)|Dynamo]], and Prashant Malik developed Cassandra at [[Facebook]] to support the [[inbox]] [[Search engine|search]] functionality. Facebook released Cassandra as open-source software on [[Google Code]] in July 2008.<ref name=JH2008>{{cite web |access-date= 2009-06-04 |date= July 12, 2008 |url= http://perspectives.mvdirona.com/2008/07/12/FacebookReleasesCassandraAsOpenSource.aspx |title= Facebook Releases Cassandra as Open Source |first= James |last= Hamilton}}</ref> In March 2009, it became an Apache Incubator project<ref>{{cite web |date=2009-03-02 |title=Is this the new hotness now? |url=http://www.mail-archive.com/cassandra-dev@incubator.apache.org/msg00004.html |url-status=live |archive-url=https://web.archive.org/web/20100425071855/http://www.mail-archive.com/cassandra-dev%40incubator.apache.org/msg00004.html |archive-date=25 April 2010 |access-date=2010-03-29 |publisher=Mail-archive.com}}</ref> and on February 17, 2010, it graduated to a top-level project.<ref name=GRAD>{{cite web|url=http://www.mail-archive.com/cassandra-dev@incubator.apache.org/msg01518.html |title=Cassandra is an Apache top level project |publisher=Mail-archive.com |date=2010-02-18 |access-date=2010-03-29 |archive-url=https://web.archive.org/web/20100328090322/http://www.mail-archive.com/cassandra-dev%40incubator.apache.org/msg01518.html |archive-date=28 March 2010 |url-status=live }}</ref>


The developers at [[Facebook]] named their database after [[Cassandra]], the [[mythological]] [[Troy|Trojan]] prophetess, referencing her curse of making prophecies that were never believed.<ref>{{cite web |url= http://kellabyte.com/2013/01/04/the-meaning-behind-the-name-of-apache-cassandra/ |archive-url= https://web.archive.org/web/20161101091045/http://kellabyte.com/2013/01/04/the-meaning-behind-the-name-of-apache-cassandra |archive-date= 2016-11-01 |title= The meaning behind the name of Apache Cassandra |access-date= 2016-07-19 |quote= Apache Cassandra is named after the Greek mythological prophet Cassandra. [...] Because of her beauty Apollo granted her the ability of prophecy. [...] When Cassandra of Troy refused Apollo, he put a curse on her so that all of her and her descendants' predictions would not be believed. [...] Cassandra is the cursed Oracle[.] |url-status= dead }}</ref>
Additional features include: using the [[BigTable]] way of modeling, [[eventual consistency]], and the [[Gossip protocol]], a master-master way of serving read and write requests inspired by [[Dynamo (storage system)|Amazon's Dynamo]].<ref>{{cite web
| accessdate = 2010-03-22
| author = Olivier Mallassi
| date = 2010-06-09
| location = http://blog.octo.com/
| publisher = OCTO Talks
| title = Let’s play with Cassandra… (Part 1/3)
| quote = Hybrid firstly because Cassandra uses a column-oriented way of modeling data (inspired by the BigTable) and permit to use Hadoop Map/Reduce jobs and secondly because it uses patterns inspired by Dynamo like Eventually Consistent, Gossip protocols, a master-master way of serving both read and write requests…
| url = http://blog.octo.com/en/nosql-lets-play-with-cassandra-part-13/
}}</ref>


== Features and Limitations ==
== History ==
Cassandra uses a [[distributed architecture]] where all nodes perform identical functions, eliminating single points of failure. The system employs configurable replication strategies to distribute data across clusters, providing redundancy and disaster recovery capabilities. The system is capable of linear scaling, which increases read and write throughput with the addition of new nodes, while maintaining continuous service.
Apache Cassandra was developed at [[Facebook]] to power their Inbox Search feature by Avinash Lakshman (one of the authors of Amazon's Dynamo) and Prashant Malik. It was released as an open source project on [[Google code]] in July 2008.<ref name=JH2008 /> In March 2009, it became an [[Apache Incubator]] project.<ref>{{cite web|url=http://www.mail-archive.com/cassandra-dev@incubator.apache.org/msg00004.html |title=Is this the new hotness now? |publisher=Mail-archive.com |date=2009-03-02 |accessdate=2010-03-29| archiveurl= http://web.archive.org/web/20100425071855/http://www.mail-archive.com/cassandra-dev@incubator.apache.org/msg00004.html| archivedate= 25 April 2010 <!--DASHBot-->| deadurl= no}}</ref> On February 17, 2010 it graduated to a top-level project.<ref name=GRAD />


Cassandra is categorized as an AP ([[Availability (system)|Availability]] and Partition Tolerance) system, emphasizing availability and partition tolerance over [[Consistency (database systems)|consistency]]. While it offers tunable consistency levels for both read and write operations, its architecture makes it less suitable for use cases requiring strict consistency guarantees.<ref name="carpenter2022" /> Additionally, Cassandra's compatibility with [[Apache Hadoop|Hadoop]] and related tools allows for integration with existing big data processing workflows. Eventual consistency is maintained using [[Tombstone (data store)|tombstones]] to manage reads, [[UPSERT|upserts]], and deletes.
Releases after graduation include
* 0.6, released Apr 12 2010, added support for integrated caching, and [[Apache Hadoop]] [[MapReduce]]<ref>[https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces3 The Apache Software Foundation Announces Apache Cassandra Release 0.6 : The Apache Software Foundation Blog<!-- Bot generated title -->]</ref>
* 0.7, released Jan 08 2011, added secondary indexes and online schema changes<ref>[https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces9 The Apache Software Foundation Announces Apache Cassandra 0.7 : The Apache Software Foundation Blog<!-- Bot generated title -->]</ref>
* 0.8, released Jun 2 2011, added the [[Cassandra Query Language]] (CQL), self-tuning memtables, and support for zero-downtime upgrades<ref>[http://grokbase.com/t/cassandra/user/1162fkpwx2/release-0-8-0 [Cassandra-user&#93; [RELEASE&#93; 0.8.0 - Grokbase<!-- Bot generated title -->]</ref>
* 1.0, released Oct 17 2011, added integrated compression, leveled compaction, and improved read performance<ref>[http://www.infoq.com/news/2011/10/Cassandra-1 Cassandra 1.0.0. Is Ready for the Enterprise<!-- Bot generated title -->]</ref>
* 1.1, released Apr 23 2012, added self-tuning caches, row-level isolation, and support for mixed ssd/spinning disk deployments<ref>[https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces26 The Apache Software Foundation Announces Apache Cassandra™ v1.1 : The Apache Software Foundation Blog<!-- Bot generated title -->]</ref>
* 1.2, released Jan 2 2013, added clustering across virtual nodes, inter-node communication, atomic batches, and request tracing<ref>[https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces38 The Apache Software Foundation Announces Apache Cassandra™ v1.2]</ref>


The system's query capabilities have notable limitations. Cassandra does not support advanced query patterns such as multi-table [[Join (SQL)|JOINs]], ad hoc aggregations, or complex queries.<ref name="carpenter2022" /> These limitations stem from its distributed architecture, which optimizes for scalability and availability rather than complex query operations.
== Licensing and support ==


== Data model ==
Apache Cassandra is an Apache Software Foundation project, so it has an [[Apache License|Apache License (version 2.0)]].
As a [[wide column store|wide-column store]], Cassandra combines features of both key-value and tabular database systems. It implements a partitioned row store model with adjustable consistency levels.<ref name="tunable_consistency">{{cite web |access-date=2013-07-25 |author=DataStax |author-link=DataStax |date=2013-01-15 |title=About data consistency |url=http://www.datastax.com/docs/1.2/dml/data_consistency |archive-url=https://web.archive.org/web/20130726185743/http://www.datastax.com/docs/1.2/dml/data_consistency |archive-date=2013-07-26 |url-status=dead }}</ref> The following table compares Cassandra and [[relational database management systems]] (RDBMS).


{| class="wikitable"
There is professional grade support available from a few companies. In the official wiki of Apache Cassandra's project<ref>[http://wiki.apache.org/cassandra/ThirdPartySupport "Third Party Support"] article on Apache Cassandra's wiki</ref> the following ones, which collaborate with developers of the project, are mentioned
|+ Data Model Comparison: Cassandra vs RDBMS
* [[DataStax]]
! Feature !! Cassandra !! RDBMS
* [http://www.acunu.com/ Acunu]
|-
| Organization || Keyspace → Table → Row || Database → Table → Row
|-
| Row Structure || Dynamic columns || Fixed schema
|-
| Column Data || Name, type, value, timestamp || Name, type, value
|-
| Schema Changes || Runtime modifications || Usually requires downtime
|-
| Data Model || Denormalized || Normalized with JOINs
|}


The data model consists of several hierarchical components:
== Main features ==


=== Keyspace ===
; Decentralized
A keyspace in Cassandra is analogous to a database in [[relational database management system|relational systems]]. It contains multiple tables and manages configuration information, including replication strategy and user-defined types (UDTs).<ref name="carpenter2022" />
: Every node in the cluster has the same role. There is '''no single point of failure'''. Data is distributed across the cluster (so each node contains different data), but there is no master as every node can service any request.


=== Tables ===
; Supports replication and multi data center replication
Tables (formerly called [[Column family|column families]] prior to CQL 3) are containers for rows of data. Each table has a name and configuration information for its stored data. Tables may be created, dropped, or altered at run-time without blocking [[Update (SQL)|updates]] and queries.<ref>{{cite web |access-date=2013-07-25 |first=Jonathan |last=Ellis |date=2012-03-02 |title=The Schema Management Renaissance in Cassandra 1.1 |publisher=DataStax |url=http://www.datastax.com/dev/blog/the-schema-management-renaissance}}</ref>
: Replication strategies are configurable.<ref>[http://www.datastax.com/dev/blog/deploying-cassandra-across-multiple-data-centers "Deploying Cassandra across Multiple Data Centers" article on Datastax Cassandra Developer Center]</ref> Cassandra is designed as a distributed system, for deployment of large numbers of nodes across multiple data centers. Key features of Cassandra’s distributed architecture are specifically tailored for multiple-data center deployment, for redundancy, for failover and disaster recovery.


=== Rows and Columns ===
; Scalability
Each row is identified by a [[primary key]] and contains columns. The first component of a table's primary key is the partition key; within a partition, rows are [[Clustered index|clustered]] by the remaining columns of the key.<ref>{{cite web |access-date=2013-07-25 |first=Jonathan |last=Ellis |date=2012-02-15 |title=Schema in Cassandra 1.1 |publisher=DataStax |url=http://www.datastax.com/dev/blog/schema-in-cassandra-1-1}}</ref>
: Read and write throughput both increase linearly as new machines are added, with no downtime or interruption to applications.


Columns contain data belonging to a row and consist of:
; Fault-tolerant
* A name
: Data is automatically replicated to multiple nodes for [[fault-tolerance]]. [[Replication (computer science)|Replication]] across multiple data centers is supported. Failed nodes can be replaced with no downtime.
* A type
* A value
* Timestamp metadata (used for write conflict resolution via "last write wins")


Unlike traditional RDBMS tables, rows within the same table can have varying columns, providing a flexible structure. This flexibility distinguishes Cassandra from relational databases, as not all columns need to be specified for each row.<ref name="carpenter2022" /> Other columns may be indexed separately from the primary key.<ref>{{cite web |access-date=2013-07-25 |first=Jonathan |last=Ellis |date=2010-12-03 |title=What's new in Cassandra 0.7: Secondary indexes |publisher=DataStax |url=http://www.datastax.com/dev/blog/whats-new-cassandra-07-secondary-indexes}}</ref>
; Tunable consistency
: Writes and reads offer a tunable level of consistency, all the way from "writes never fail" to "block for all replicas to be readable", with the [[Quorum_(distributed_computing)|quorum level]] in the middle.


== Storage Model ==
; MapReduce support
Cassandra uses a [[Log-structured merge-tree|Log Structured Merge Tree (LSM tree)]] index to optimize write throughput, in contrast to the [[B tree indexing|B-tree indexes]] used by most databases.<ref name="carpenter2022" />
: Cassandra has [[Hadoop]] integration, with [[MapReduce]] support. There is support also for [http://pig.apache.org/ Apache Pig] and [http://hive.apache.org/ Apache Hive].<ref name="hadoopsupport">[http://wiki.apache.org/cassandra/HadoopSupport "Hadoop Support"] article on Cassandra's wiki</ref>


{| class="wikitable"
; Query language
|+ Storage Model Comparison: Cassandra vs RDBMS
: CQL ([[Cassandra Query Language]]) was introduced, an SQL-like alternative to the traditional RPC interface. Language drivers are available for '''Java''' (JDBC), '''Python''' (DBAPI2) and '''Node.JS''' (Helenus).
! Feature !! Cassandra !! RDBMS
|-
| Index Structure || LSM Tree || B-Tree
|-
| Write Process || Append-only with Memtable || In-place updates
|-
| Storage Components || Commit Log, Memtable, SSTable || Data files, Transaction Log
|-
| Update Strategy || New entry for each change || Modify existing data
|-
| Delete Handling || Tombstone markers || Direct removal
|-
| Read Optimization || Secondary || Primary
|-
| Write Optimization || Primary || Secondary
|}


The storage architecture consists of three main components:<ref name="carpenter2022" />
== Data model ==
{{Expand section|informational details and clarification|date=September 2012}}
Cassandra is essentially a hybrid between a key-value and a row-oriented (or tabular) database.


=== Core Components ===
:A [[column family]] resembles a table in an RDBMS. Column families contain rows and columns. Each row is uniquely identified by a row key. Each row has multiple columns, each of which has a name, value, and a timestamp. Unlike a table in an RDBMS, different rows in the same column family do not have to share the same set of columns, and a column may be added to one or multiple rows at any time.<ref>{{cite web|last=DataStax|title=Apache Cassandra 0.7 Documentation - Column Families|url=http://www.datastax.com/docs/0.7/data_model/column_families#column-families|work=Apache Cassandra 0.7 Documentation|accessdate=29 October 2012}}</ref>
* '''Commit Log''': A [[Write-ahead logging|write-ahead log]] that ensures write durability
* '''Memtable''': An [[In-memory processing|in-memory]] data structure that stores writes, sorted by primary key
* '''SSTable''' (Sorted String Table): Immutable files containing data flushed from Memtables


=== Write and Read Processes ===
In other words, each key in Cassandra corresponds to a value which is an object. Each key has values as columns, and columns are grouped together into sets called column families. Also, each column family can be grouped in super column families.
Write operations follow a two-stage process:
# The write is recorded in the commit log and added to the Memtable
# When the Memtable reaches size or time thresholds, it flushes to an SSTable


Read operations:
So each key identifies a row of a variable number of elements. These column families could be considered then as tables. A table in Cassandra is a distributed multi dimensional map indexed by a key.
# Check Memtable for latest data
# Search SSTables from newest to oldest using bloom filters for efficiency


=== Data Management ===
Furthermore, applications can specify the sort order of columns within a Super Column or Simple Column family.
==== Tombstones ====
Every operation (create/update/delete) generates a new entry, with deletes handled via "[[Tombstone (data store)|tombstones]]". While common in many databases, tombstones can cause performance degradation in delete-heavy workloads.<ref>{{cite web |last1=Rodriguez |first1=Alain |title=About Deletes and Tombstones in Cassandra |url=https://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html |date=27 Jul 2016}}</ref>


==== Compaction ====
==Clustering==
Compaction consolidates multiple SSTables to:
When the cluster for Apache Cassandra is designed, an important point is to select the right partitioner. Two partitioners exist:<ref>{{cite web
* Reduce storage usage
| accessdate = 2011-03-23
* Remove deleted row tombstones
| author = Dominic Williams
* Improve read performance
| location = http://wordpress.com/
| publisher = WordPress.com
| title = Cassandra: RandomPartitioner vs OrderPreservingPartitioner
| quote = When building a Cassandra cluster, the “key” question (sorry, that’s weak) is whether to use the RandomPartitioner (RP), or the OrderPreservingPartitioner (OPP). These control how your data is distributed over your nodes. Once you have chosen your partitioner, you cannot change without wiping your data, so think carefully! The problem with OPP: If the distribution of keys used by individual column families is different, their sets of keys will not fall evenly across the ranges assigned to nodes. Thus nodes will end up storing preponderances of keys (and the associated data) corresponding to one column family or another. If as is likely column families store differing quantities of data with their keys, or store data accessed according to differing usage patterns, then some nodes will end up with disproportionately more data than others, or serving more “hot” data than others.
| url = http://ria101.wordpress.com/2010/02/22/cassandra-randompartitioner-vs-orderpreservingpartitioner/}}</ref>
# RandomPartitioner (RP): This partitioner randomly distributes the key-value pairs over the network, resulting in a good load balancing. Compared to OPP, more nodes have to be accessed to get a number of keys.
# OrderPreservingPartitioner (OPP): This partitioner distributes the key-value pairs in a natural way so that similar keys are not far away. The advantage is that fewer nodes have to be accessed. The drawback is the uneven distribution of the key-value pairs.


== Prominent users ==
== Cassandra Query Language ==
Cassandra Query Language (CQL) is the interface for accessing Cassandra, as an alternative to the traditional [[SQL|Structured Query Language]] (SQL). CQL adds an [[abstraction layer]] that hides implementation details of this structure and provides native syntaxes for collections and other common encodings. Language drivers are available for [[Java (programming language)|Java]] ([[Java Database Connectivity|JDBC]]), [[Python (programming language)|Python]] (DBAPI2), [[Node.js|Node.JS]] ([[DataStax]]), [[Go (programming language)|Go]] (gocql), and [[C++]].<ref>{{cite web |title=DataStax C/C++ Driver for Apache Cassandra |url=https://github.com/datastax/cpp-driver |access-date=15 December 2014 |work=DataStax}}</ref>
* [[AppScale]] uses Cassandra as a back-end for Google App Engine applications<ref>cite web|url=http://appscale.cs.ucsb.edu/datastores.html#cassandra</ref>
* [[Cisco]]'s [[WebEx]] uses Cassandra to store user feed and activity in near real time.<ref name=CISCO>{{cite web|url=http://www.mail-archive.com/cassandra-dev@incubator.apache.org/msg01163.html |title=Re: Cassandra users survey |publisher=Mail-archive.com |date=2009-11-21 |accessdate=2010-03-29| archiveurl= http://web.archive.org/web/20100417083733/http://www.mail-archive.com/cassandra-dev@incubator.apache.org/msg01163.html| archivedate= 17 April 2010 <!--DASHBot-->| deadurl= no}}</ref>
* The [[CERN]] [[ATLAS experiment]] uses Cassandra to archive its online DAQ system's monitoring information<ref name=CERN-ATLAS>{{cite web|url=https://cdsweb.cern.ch/record/1432912 | title=A Persistent Back-End for the ATLAS Online Information Service (P-BEAST)}}</ref>
* [[Cloudkick]] uses Cassandra to store the server metrics of their users.<ref>[https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra/ 4 Months with Cassandra, a love story | Cloudkick, manage servers better<!-- Bot generated title -->]</ref>
* [[Constant Contact]] uses Cassandra in their social media marketing application.<ref name=ConstantContact>{{cite web |url=http://www.readwriteweb.com/enterprise/2011/02/this-week-in-consolidation-hp.php |title=This Week in Consolidation: HP Buys Vertica, Constant Contact Buys Bantam Live and More |author=Klint Finley |publisher=Read Write Enterprise |date=2011-02-18}}</ref>
* [[Digg]], a large social news website, announced on Sep 9th, 2009 that it is rolling out its use of Cassandra<ref name=NM2009>{{cite web
|url=http://blog.digg.com/?p=966
|title=Looking to the future with Cassandra
|author=Ian Eure
}}</ref> and confirmed this on March 8, 2010.<ref name=DG2010>{{cite web
|url=http://about.digg.com/node/564
|title=Saying Yes to NoSQL; Going Steady with Cassandra
|author=John Quinn
}}</ref> [[TechCrunch]] has since linked Cassandra to Digg v4 reliability criticisms and recent company struggles.<ref name=ES2010>{{cite web|url=http://techcrunch.com/2010/09/07/digg-struggles-vp-engineering-door/|title=As Digg Struggles, VP Of Engineering Is Shown The Door|author=Erick Schonfeld}}</ref> Lead engineers at Digg later rebuked these criticisms as red herring and blamed a lack of load testing.<ref name=QU2010>{{cite web|url=http://www.quora.com/Is-Cassandra-to-blame-for-Digg-v4s-technical-failures/|title=Is Cassandra to Blame for Digg v4's Failures?}}</ref>
* [[Facebook]] used Cassandra to power Inbox Search, with over 200 nodes deployed.<ref name=FBIS /> This was abandoned in late 2010 when they built Facebook Messaging platform on [[HBase]].<ref name="KM2010"/>
* [[IBM]] has done research in building a scalable email system based on Cassandra.<ref name=IBM>{{cite web|url=http://docs.google.com/viewer?url=http://ewh.ieee.org/r6/scv/computer//nfic/2009/IBM-Jun-Rao.pdf |title=Powered by Google Docs |publisher=Docs.google.com |date= |accessdate=2010-03-29}}</ref>
* [[InWorldz]] has researched and developed a scalable high-performance storage system for user inventory items Cassandra.<ref>{{cite web|last=Dexler|first=Tranquillity|title=InWorldz 0.7.0 R1542X|url=http://inworldz.com/forums/viewtopic.php?f=19&t=11628&p=87147#p87147|publisher=InWorldz LLC|accessdate=9 February 2012}}</ref>
* [[Netflix]] uses Cassandra as their back-end database for their streaming services<ref name=Netflix1>cite web|url=http://www.slideshare.net/adrianco/migrating-netflix-from-oracle-to-global-cassandra</ref><ref name=Netflix2>{{cite web|url=http://techblog.netflix.com/2011/01/nosql-at-netflix.html |date=2011-01-28 |author=Yury Izrailevsky |title=NoSQL at Netflix}}</ref>
* [[Formspring]] uses Cassandra to count responses, as well as store Social Graph data (followers, following, blockers, blocking) for 26 Million accounts with 10 million responses a day<ref>{{cite web|url=http://www.slideshare.net/martincozzi/cassandra-formspring |date=2011-08-31 |author=Martin Cozzi |title=Cassandra at Formspring}}</ref>
* [[Mahalo.com]] uses Cassandra to record user activity logs and topics for their Q&A website<ref name=Mahalo>{{cite web|title=http://www.datastax.com/wp-content/uploads/2011/06/DataStax-CaseStudy-Mahalo.pdf|url=http://www.datastax.com/wp-content/uploads/2011/06/DataStax-CaseStudy-Mahalo.pdf}}</ref><ref name=Mahalo2>[http://blip.tv/datastax/cassandra-at-mahalo-com-4030941 Watch Cassandra at Mahalo.com | DataStax Episodes | Blip<!-- Bot generated title -->]</ref>
* [[Ooyala]] Built a scalable, flexible, real-time analytics engine using Cassandra<ref name=Ooyala>http://www.datastax.com/wp-content/uploads/2011/04/WP-Ooyala.pdf</ref>
* At [[Openwave]], Cassandra acts as a distributed database and serves as a distributed storage mechanism for Openwave’s next generation messaging platform<ref name=Openwave>http://www.datastax.com/wp-content/uploads/2011/05/DataStax-CaseStudy-Openwave.pdf</ref>
* [[OpenX (software)|OpenX]] is running over 130 nodes on Cassandra for their OpenX Enterprise product to store and replicate advertisements and targeting data for ad delivery<ref name=OpenX>[http://openx.com/publisher/technology Ad Serving Technology - Advanced Optimization, Forecasting, & Targeting | OpenX<!-- Bot generated title -->]</ref>
* [[Plaxo]] has "reviewed 3 billion contacts in [their] database, compared them with publicly available data sources, and identified approximately 600 million unique people with contact info."<ref name=Plaxo>{{cite web|url=http://blog.plaxo.com/2011/03/an-important-milestone-and-its-only-the-beginning/ |title=An important milestone - and it's only the beginning! |date=2011-03-20 |author=Preston Smalley}}</ref>
* [[PostRank]] uses Cassandra as their backend database<ref name=PostRank>{{cite web|url=http://blog.postrank.com/2011/03/webpulp-tv-scaling-postrank-with-ilya-grigorik/ |author=Ilya Grigorik |date=2011-03-29 |title=Webpulp TV: Scaling PostRank with Ilya Grigorik}}</ref>
* [[Rackspace]] is known to use Cassandra internally.<ref name=Rackspace>{{cite web|url=http://www.slideshare.net/stuhood/hadoop-and-cassandra-at-rackspace |title=Hadoop and Cassandra (at Rackspace) |publisher=Stu Hood |date=2010-04-23 |accessdate=2011-09-01}}</ref>
* [[Reddit]] switched to Cassandra from [[memcacheDB]] on March 12, 2010<ref name=REDDIT>{{cite web|author=Posted by david [ketralnis] |url=http://blog.reddit.com/2010/03/she-who-entangles-men.html |title=what's new on reddit: She who entangles men |publisher=blog.reddit |date=2010-03-12 |accessdate=2010-03-29| archiveurl= http://web.archive.org/web/20100325115755/http://blog.reddit.com/2010/03/she-who-entangles-men.html| archivedate= 25 March 2010 <!--DASHBot-->| deadurl= no}}</ref> and experienced some problems with overload handling in Cassandra in May.<ref name=REDDIT2>{{cite web|author= Posted by the reddit admins at |url=http://blog.reddit.com/2010/05/reddits-may-2010-state-of-servers.html |title=blog.reddit -- what's new on reddit: reddit's May 2010 "State of the Servers" report |publisher=blog.reddit |date=2010-05-11 |accessdate=2010-05-16| archiveurl= http://web.archive.org/web/20100514085008/http://blog.reddit.com/2010/05/reddits-may-2010-state-of-servers.html| archivedate= 14 May 2010 <!--DASHBot-->| deadurl= no}}</ref>
* [[RockYou]] uses Cassandra to record every single click for 50 million Monthly Active Users in real-time for their online games<ref name=RockYou>{{cite web |url=http://mysqldba.blogspot.com/2010/03/cassandra-is-my-nosql-solution-but.html |date=2011-03-23 |author=Dathan Vance Pattishall |title=Cassandra is my NoSQL Solution but}}</ref>
* [[SoundCloud]] uses Cassandra to store user account information<ref name=SoundCloud>{{cite web|url=http://berlinbuzzwords.de/sites/berlinbuzzwords.de/files/cassandra%20workshop%20berlin%20buzzword%202011-%20Soundcloud.pdf |title=Cassandra at SoundCloud}}</ref>
*[[Talentica Software]] uses Cassandra as a back-end for Analytics Application with Cassandra cluster of 30 nodes and inserting around 200GB data on daily basis.<ref>cite web|url=http://www.talentica.com</ref>
* [[Twitter]] announced it is planning to use Cassandra because it can be run on large server clusters and is capable of taking in very large amounts of data at a time.<ref name=TWITTER>{{cite web|last=Popescu |first=Alex |url=http://nosql.mypopescu.com/post/407159447/cassandra-twitter-an-interview-with-ryan-king |title=Cassandra @ Twitter: An Interview with Ryan King |publisher=myNoSQL |date= |accessdate=2010-03-29| archiveurl= http://web.archive.org/web/20100301151656/http://nosql.mypopescu.com/post/407159447/cassandra-twitter-an-interview-with-ryan-king| archivedate= 1 March 2010 <!--DASHBot-->| deadurl= no}}</ref><ref name=TWITTER2>{{cite web|last=Babcock |first=Charles |url=http://www.informationweek.com/news/software/open_source/showArticle.jhtml?articleID=223100894&pgno=1&queryText=&isPrev= |title=Twitter Drops MySQL For Cassandra - Cloud databases |publisher=InformationWeek |date= |accessdate=2010-03-29| archiveurl= http://web.archive.org/web/20100402075726/http://www.informationweek.com/news/software/open_source/showArticle.jhtml?articleID=223100894&pgno=1&queryText=&isPrev=| archivedate= 2 April 2010 <!--DASHBot-->| deadurl= no}}</ref> Twitter continues to use it but not for Tweets themselves.<ref>{{cite web|url=http://engineering.twitter.com/2010/07/cassandra-at-twitter-today.html|title=Cassandra at Twitter Today}}</ref>
* [[Urban Airship]] uses Cassandra with the mobile service hosting for over 160 million application installs across 80 million unique devices<ref name=UrbanAirship>{{cite web|url=http://www.slideshare.net/eonnen/from-100s-to-100s-of-millions |title=From 100s to 100s of Millions |author=Erik Onnen}}</ref>
* [[@WalmartLabs]]<ref>[http://www.walmartlabs.com Walmart Labs]</ref> (previously [[Kosmix]]) uses Cassandra with SSD<ref name=kosmix>{{cite web|url=http://blog.kosmix.com/2011/01/21/cassandra-on-ssd/ |title=Cassandra on SSD |author=Karl Mueller}}</ref>
* [[Yakaz]] uses Cassandra on a five-node cluster to store millions of images as well as its social data.<ref name=Yakaz>{{cite web|url=http://www.yakaz.com/about/technologies.php |title=Yakaz Technologies}}</ref>
* [[Zoho]] uses Cassandra for generating the inbox preview in their [[Zoho#Zoho_Mail]] service
Ironically, [[Facebook]] moved off its early Cassandra deployment in late 2010 when they replaced Inbox Search with the Facebook Messaging platform.<ref name="KM2010"/> Facebook never deployed an Apache Cassandra release.


The key space in Cassandra is a namespace that defines data replication across nodes. Therefore, replication is defined at the key space level. Below is an example of key space creation, including a column family in CQL 3.0:<ref>{{cite web |title=CQL |url=https://cassandra.apache.org/doc/cql3/CQL.html |url-status=dead |archive-url=https://web.archive.org/web/20160113141740/http://cassandra.apache.org/doc/cql3/CQL.html |archive-date=13 January 2016 |access-date=5 January 2016}}</ref><syntaxhighlight lang="mysql">
== Tools for Cassandra ==
CREATE KEYSPACE MyKeySpace
WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 3 };


USE MyKeySpace;
Cassandra has built in tools for accessing Cassandra from the direct download, such as cassandra-cli and node-tool.


CREATE COLUMNFAMILY MyColumns (id text, lastName text, firstName text, PRIMARY KEY(id));
There are third party tools available, as the following:<ref>[http://wiki.apache.org/cassandra/FAQ#gui FAQ] on Cassandra's wiki</ref>


INSERT INTO MyColumns (id, lastName, firstName) VALUES ('1', 'Doe', 'John');
'''Data browsers'''
* [http://github.com/driftx/chiton chiton], a GTK data browser.
* [http://code.google.com/p/cassandra-gui cassandra-gui], a Swing data browser.
* [http://www.quest.com/toad-for-cloud-databases/ Toad for Cloud Databases], an Eclipse plug-in data browser
'''Administration tools'''
* [http://www.datastax.com/products/opscenter OpsCenter], OpsCenter is a tool for management and monitoring of a Cassandra cluster. The Community Edition of OpsCenter is free for anyone to download and use. There is also an Enterprise Edition of OpsCenter that includes additional features.
* [https://github.com/sebgiroux/Cassandra-Cluster-Admin Cassandra Cluster Admin], Cassandra Cluster Admin is a GUI tool to help people administrate their Apache Cassandra cluster and modify its content, similar to PHPMyAdmin for MySQL administration.


SELECT * FROM MyColumns;
=== Client interfaces and language Support ===
</syntaxhighlight>
Which gives:
<syntaxhighlight lang="text">
id | lastName | firstName
----+----------+----------
1 | Doe | John


(1 rows)
Cassandra has a lot of high-level client libraries for Python, Java, .Net, Ruby, PHP, Perl, C++, etc.<ref>[http://wiki.apache.org/cassandra/ClientOptions "Client Options" article] on Cassandra Wiki</ref>
</syntaxhighlight>


== Distributed Architecture ==
For a detailed list of client software go to [http://wiki.apache.org/cassandra/ClientOptions "Client Options" article] on Cassandra Wiki
=== Gossip Protocol ===
Cassandra uses a peer-to-peer gossip protocol for cluster communication. Nodes routinely exchange information about cluster state, including:
* Node availability status
* Schema versions
* Generation timestamps (node bootstrap time)
* Version numbers (logical clock values)


The system uses [[vector clock]]s to track information currency and ignore outdated state data.<ref name="carpenter2022" />
=== Integration with other tools ===


=== Seed Nodes ===
The [[MariaDB]] developers have created a Storage Engine that allows MariaDB or MySQL to use Cassandra as a data source.<ref>https://kb.askmonty.org/en/cassandra-storage-engine/ Cassandra Storage Engine Documentation</ref>
The architecture designates certain nodes as "seed" nodes that:
* Bootstrap the cluster
* Serve as guaranteed gossip communication points
* Prevent cluster fragmentation
* Remain discoverable via service discovery methods


This design eliminates single points of failure while maintaining cluster-wide consistency of operational knowledge.<ref name="carpenter2022" />
There are other tools worth mentioning like '''Solandra''',<ref>[https://github.com/tjake/Solandra Solandra source at Github]</ref> a Cassandra backend for [http://lucene.apache.org/solr/ Apache Solr], a web application built around Lucene, for full text indexing and search.


=== Fault Tolerance ===
For monitoring purposes Cassandra is well integrated with [[Ganglia (software)|Ganglia]]<ref>[http://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf Cassandra - A Decentralized Structured Storage System], a 2009 paper presenting Cassandra by their creators Avinash Lakshman and Prashant Malik</ref> and there are plugins for other monitoring system as, by example, [[Nagios]].
Cassandra employs the Phi Accrual Failure Detector to manage node failures during cluster operation.<ref>{{cite conference
| title = The Φ Accrual Failure Detector
| first1 = Naohiro | last1 = Hayashibara
| first2 = Xavier | last2 = Défago
| first3 = Rami | last3 = Yared
| first4 = Takuya | last4 = Katayama
| book-title = IEEE Symposium on Reliable Distributed Systems
| year = 2004
| pages = 66–78
| doi = 10.1109/RELDIS.2004.1353004
}}</ref> Through this system, each node independently assesses the availability of other nodes during gossip communication. When a node fails to respond, it is "convicted" and removed from write operations, though it can rejoin the cluster upon resuming heartbeat signals.<ref name="carpenter2022" />


To maintain data integrity during node outages, Cassandra uses a "hinted handoff" mechanism. When writing to an offline node, the coordinator node temporarily stores the write data as a "hint." Once the offline node returns to service, these hints are forwarded to restore data consistency. Notably, Cassandra only permanently removes nodes through explicit administrative decommissioning or rebuilding, preventing temporary communication failures or restarts from triggering unnecessary data rebalancing.<ref name="carpenter2022" />
== See also ==
{{Portal|Free software}}


==Management and monitoring==
* [[NoSQL]]
Cassandra is a Java-based system that can be managed and monitored via [[Java Management Extensions]] (JMX). The JMX-compliant ''Nodetool'' utility, for instance, can be used to manage a Cassandra cluster.<ref>{{cite web|title=NodeTool|url=https://wiki.apache.org/cassandra/NodeTool|website=Cassandra Wiki|access-date=5 January 2016|archive-url=https://web.archive.org/web/20160113122938/http://wiki.apache.org/cassandra/NodeTool|archive-date=13 January 2016|url-status=dead}}</ref> Nodetool also offers a number of commands to return Cassandra metrics pertaining to disk usage, latency, compaction, garbage collection, and more.<ref>{{cite web|title=How to monitor Cassandra performance metrics|date=3 December 2015|url=https://www.datadoghq.com/blog/how-to-monitor-cassandra-performance-metrics/|publisher=Datadog|access-date=5 January 2016}}</ref>
* [[Berkeley_db|Berkeley DB]]

* [[MongoDB]] - The most-popular NoSQL database
Since the release of Cassandra 2.0.2 in 2013, measures of several metrics are produced via the Dropwizard metrics framework,<ref>{{cite web|title=Metrics|url=https://wiki.apache.org/cassandra/Metrics|website=Cassandra Wiki|access-date=5 January 2016|archive-date=12 November 2015|archive-url=https://web.archive.org/web/20151112112756/http://wiki.apache.org/cassandra/Metrics|url-status=dead}}</ref> and may be queried via JMX using tools such as [[JConsole]] or passed to external monitoring systems via Dropwizard-compatible reporter plugins.<ref>{{cite web|title=Monitoring|url=http://cassandra.apache.org/doc/latest/operating/metrics.html|website=Cassandra Documentation|access-date=1 February 2018}}</ref>
* [[BigTable]] - Original distributed database by Google

== Releases ==
Releases after graduation include:
{| class="wikitable"
|-
! Version
! Original release date
! Latest version
! Release date
! Status<ref>
{{cite web |title=Cassandra Server Releases |url=http://cassandra.apache.org/download/ |access-date=15 December 2015 |work=cassandra.apache.org}}
</ref>
|-
| {{Version|o|0.6}}
| 2010-04-12
| 0.6.13
| 2011-04-18
| No longer maintained
|-
| {{Version|o|0.7}}
| 2011-01-10
| 0.7.10
| 2011-10-31
| No longer maintained
|-
| {{Version|o|0.8}}
| 2011-06-03
| 0.8.10
| 2012-02-13
| No longer maintained
|-
| {{Version|o|1.0}}
| 2011-10-18
| 1.0.12
| 2012-10-04
| No longer maintained
|-
| {{Version|o|1.1}}
| 2012-04-24
| 1.1.12
| 2013-05-27
| No longer maintained
|-
| {{Version|o|1.2}}
| 2013-01-02
| 1.2.19
| 2014-09-18
| No longer maintained
|-
| {{Version|o|2.0}}
| 2013-09-03
| 2.0.17
| 2015-09-21
| No longer maintained
|-
| {{Version|o|2.1}}
| 2014-09-16
| 2.1.22
| 2020-08-31
| No longer maintained
|-
| {{Version|o|2.2}}
| 2015-07-20
| 2.2.19
| 2020-11-04
| No longer maintained
|-
| {{Version|o|3.0}}
| 2015-11-09
| 3.0.29
| 2023-05-15
| No longer maintained
|-
| {{Version|o|3.11}}
| 2017-06-23
| 3.11.15
| 2023-05-05
| No longer maintained
|-
| {{Version|co|4.0}}
| 2021-07-26
| 4.0.13
| 2023-05-20
| Maintained until 5.1.0 release
|-
| {{Version|co|4.1}}
| 2022-06-17
| 4.1.6
| 2024-08-19
| Maintained until 5.2.0 release
|-
| {{Version|c|5.0}}
| 2024-09-05
| 5.0.2
| 2024-10-19
| Latest release. Maintained until 5.3.0 release
|-

| colspan="5" | <small>{{Version|l|show=111110}}</small>
|}
<!-- o=Old-Not-Supported; co=Old-Still-Supported; c=Latest-Stable; cp=Preview; p=Planned-Future -->

== See also ==
{{Portal|Free and open-source software}}
* [[Bigtable]] – Original distributed database by Google
* [[Distributed database]]
* [[Distributed database]]
* [[Distributed hash table]] (DHT)
* [[Distributed hash table]] (DHT)
* [[Dynamo (storage system)]] - Cassandra borrows many elements from Dynamo
* [[Dynamo (storage system)]] Cassandra borrows many elements from Dynamo
* [[HBase|Apache HBase]] - [[Hadoop|Apache Hadoop]] based distributed database. Very similar to BigTable
* [[Hypertable]] - [[Hadoop|Apache Hadoop]] based distributed database. Very similar to BigTable
* [[Riak]]


==References==
== References ==
{{Reflist|2}}
{{Reflist|30em}}


==Bibliography==
==Bibliography==
{{refbegin}}
{{refbegin}}
* {{cite book
* {{cite book
| first1 = Eben
| first1 = Jeff
| last1 = Hewitt
| last1 = Carpenter
| date = December 15, 2010
| first2 = Eben
| last2 = Hewitt
| date = January 23, 2022
| title = Cassandra: The Definitive Guide
| title = Cassandra: The Definitive Guide
| publisher = [[O'Reilly Media]]
| publisher = [[O'Reilly Media]]
| edition = 1st
| edition = 3rd
| page = 300
| page = 432
| isbn = 978-1-4493-9041-9
| isbn = 978-1-4920-9710-5
| url = http://oreilly.com/catalog/0636920010852
}}
}}
* {{cite book
* {{cite book
Line 213: Line 312:
| edition = 1st
| edition = 1st
| page = 324
| page = 324
| isbn = 1-84951-512-3
| isbn = 978-1-84951-512-2
| url = http://www.packtpub.com/cassandra-apache-high-performance-cookbook/book
| url = http://www.packtpub.com/cassandra-apache-high-performance-cookbook/book
}}
* {{cite book
| first1 = Eben
| last1 = Hewitt
| date = December 15, 2010
| title = Cassandra: The Definitive Guide
| publisher = [[O'Reilly Media]]
| edition = 1st
| page = 300
| isbn = 978-1-4493-9041-9
| url = http://shop.oreilly.com/product/0636920010852.do
}}
}}
{{refend}}
{{refend}}


==External links==
==External links==
{{Commons category}}
{{external links|date=January 2012}}
{{Wikiversity|Big Data/Cassandra}}
* {{cite web
|title=Cassandra - A structured storage system on a P2P Network
* {{cite web |title=Cassandra - A structured storage system on a P2P Network |url=https://www.facebook.com/note.php?note_id=24413138919&id=9445547199&index=9 |first=Avinash |last=Lakshman
|date=2008-08-25 |access-date=2014-06-17 |publisher=Engineering @ Facebook's Notes}}
|url=http://www.facebook.com/note.php?note_id=24413138919&id=9445547199&index=9
* {{cite web |url=https://cassandra.apache.org/ |title=The Apache Cassandra Project |access-date=2014-06-17 |publisher=[[Apache Software Foundation|The Apache Software Foundation]] |location=Forest Hill, MD, USA}}
|author=Avinash Lakshman
* {{cite web |url=https://wiki.apache.org/cassandra/ |title=Project Wiki |access-date=2014-06-17 |publisher=[[Apache Software Foundation|The Apache Software Foundation]] |location=Forest Hill, MD, USA |archive-url=https://web.archive.org/web/20140614175405/http://wiki.apache.org/cassandra/ |archive-date=2014-06-14 |url-status=dead }}
|date=25 August 2008
* {{cite web |url=http://www.infoq.com/presentations/Adopting-Apache-Cassandra |title=Adopting Apache Cassandra |first=Eben |last=Hewitt |date=2010-12-01 |access-date=2014-06-17 |website=infoq.com |publisher=InfoQ, C4Media Inc}}
|accessdate=2009-06-04
* {{cite web |first1=Avinash |last1=Lakshman |first2=Prashant |last2=Malik |url=https://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf |title=Cassandra - A Decentralized Structured Storage System |website=cs.cornell.edu |date=2009-08-15 |access-date=2014-06-17 |others=The authors are from [[Facebook]]}}
|publisher=Engineering @ Facebook's Notes
* {{cite web |url=http://www.slideshare.net/jbellis/what-every-developer-should-know-about-database-scalability |title=What Every Developer Should Know About Database Scalability |first=Jonathan |last=Ellis |date=2009-07-29 |access-date=2014-06-17 |website=slideshare.net}} From the [[O'Reilly Open Source Convention|OSCON]] 2009 talk on RDBMS vs. Dynamo, Bigtable, and Cassandra.
}}
* {{cite web |url=https://code.google.com/p/cassandra-rpm/ |title=Cassandra-RPM - Red Hat Package Manager (RPM) build for the Apache Cassandra project |website=code.google.com |access-date=2014-06-17 |publisher=[[Google Code#Project hosting|Google Project Hosting]] |location=Menlo Park, CA, USA}}
* [http://cassandra.apache.org/ Project Website]
* {{cite web |url=http://de.slideshare.net/grro/cassandra-by-example-the-path-of-read-and-write-requests |title=Cassandra by example - the path of read and write requests |first=Gregor |last=Roth |date=2012-10-14|access-date=2014-06-17 |website=slideshare.net}}
* [http://wiki.apache.org/cassandra/ Project Wiki]
* {{cite web |url=http://10kloc.wordpress.com/category/cassandra-2/ |title=A collection of Cassandra tutorials |first=Umer |last=Mansoor |date=2012-11-04 |access-date=2015-02-08}}
* [http://www.infoq.com/presentations/Adopting-Apache-Cassandra Adopting Apache Cassandra] presented by Eben Hewitt on December 1, 2010
* {{cite web|url=http://www.networkworld.com/news/tech/2012/102212-nosql-263595.html |title=A vendor-independent comparison of NoSQL databases: Cassandra, HBase, MongoDB, Riak |first=Sergey |last=Bushik |date=2012-10-22 |work=[[Network World|NetworkWorld]] |publisher=[[International Data Group|IDG]] |location=Framingham, MA, USA and Staines, Middlesex, UK |access-date=2014-06-17 |url-status=dead |archive-url=https://web.archive.org/web/20140528110238/http://www.networkworld.com/news/tech/2012/102212-nosql-263595.html |archive-date=2014-05-28 }}
* [http://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf LADIS 2009 WhitePaper by the original contributors Avinash Lakshman & Prashant Malik]
* [http://www.nosqldatabases.com/main/tag/cassandra Cassandra Articles on NoSQLDatabases.com]
* [http://nosql.mypopescu.com/tagged/cassandra Cassandra News and Articles on myNoSQL]
* [http://nosql.mypopescu.com/post/407159447/cassandra-twitter-an-interview-with-ryan-king Cassandra @ Twitter: an Interview with Ryan King]
* [http://www.slideshare.net/jbellis/what-every-developer-should-know-about-database-scalability Presentation on RDBMS vs. Dynamo, BigTable, and Cassandra]
* [http://code.google.com/p/cassandra-rpm/ RPM build for the apache cassandra project]
* [http://de.slideshare.net/grro/cassandra-by-example-the-path-of-read-and-write-requests Cassandra by example - the path of read and write requests]


{{Apache Software Foundation}}
{{apache}}
{{Facebook navbox}}
{{Facebook navbox}}


[[Category:2008 software]]
{{DEFAULTSORT:Cassandra (Database)}}
[[Category:Apache Software Foundation]]
[[Category:Apache Software Foundation projects]]
[[Category:Apache Software Foundation projects]]
[[Category:BigTable implementations]]
[[Category:Big data products]]
[[Category:Bigtable implementations]]
[[Category:Column-oriented DBMS software for Linux]]
[[Category:Distributed data stores]]
[[Category:Distributed data stores]]
[[Category:Facebook software]]
[[Category:Free database management systems]]
[[Category:Free database management systems]]
[[Category:NoSQL]]
[[Category:Structured storage]]
[[Category:Structured storage]]
[[Category:NoSQL]]

Latest revision as of 11:44, 15 December 2024

Apache Cassandra
Original author(s)Avinash Lakshman, Prashant Malik / Facebook
Developer(s)Apache Software Foundation
Initial releaseJuly 2008; 16 years ago (2008-07)
Stable release
5.0.2[1] Edit this on Wikidata / October 19, 2024; 2 months ago (October 19, 2024)
Repository
Written inJava
Operating systemCross-platform
Available inEnglish
TypeNoSQL Database, data store
LicenseApache License 2.0
Websitecassandra.apache.org Edit this on Wikidata

Apache Cassandra is a free and open-source database management system designed to handle large volumes of data across multiple commodity servers. The system prioritizes availability and scalability over consistency, making it particularly suited for systems with high write throughput requirements due to its LSM tree indexing storage layer.[2] As a wide-column database, Cassandra supports flexible schemas and efficiently handles data models with numerous sparse columns. The system is optimized for applications with well-defined data access patterns that can be incorporated into the schema design.[2] Cassandra supports computer clusters which may span multiple data centers,[3] featuring asynchronous and masterless replication. It enables low-latency operations for all clients and incorporates Amazon's Dynamo distributed storage and replication techniques, combined with Google's Bigtable data storage engine model.[4]

History

[edit]

Avinash Lakshman, a co-author of Amazon's Dynamo, and Prashant Malik developed Cassandra at Facebook to support the inbox search functionality. Facebook released Cassandra as open-source software on Google Code in July 2008.[5] In March 2009, it became an Apache Incubator project[6] and on February 17, 2010, it graduated to a top-level project.[7]

The developers at Facebook named their database after Cassandra, the mythological Trojan prophetess, referencing her curse of making prophecies that were never believed.[8]

Features and Limitations

[edit]

Cassandra uses a distributed architecture where all nodes perform identical functions, eliminating single points of failure. The system employs configurable replication strategies to distribute data across clusters, providing redundancy and disaster recovery capabilities. The system is capable of linear scaling, which increases read and write throughput with the addition of new nodes, while maintaining continuous service.

Cassandra is categorized as an AP (Availability and Partition Tolerance) system, emphasizing availability and partition tolerance over consistency. While it offers tunable consistency levels for both read and write operations, its architecture makes it less suitable for use cases requiring strict consistency guarantees.[2] Additionally, Cassandra's compatibility with Hadoop and related tools allows for integration with existing big data processing workflows. Eventual consistency is maintained using tombstones to manage reads, upserts, and deletes.

The system's query capabilities have notable limitations. Cassandra does not support advanced query patterns such as multi-table JOINs, ad hoc aggregations, or complex queries.[2] These limitations stem from its distributed architecture, which optimizes for scalability and availability rather than complex query operations.

Data model

[edit]

As a wide-column store, Cassandra combines features of both key-value and tabular database systems. It implements a partitioned row store model with adjustable consistency levels.[9] The following table compares Cassandra and relational database management systems (RDBMS).

Data Model Comparison: Cassandra vs RDBMS
Feature Cassandra RDBMS
Organization Keyspace → Table → Row Database → Table → Row
Row Structure Dynamic columns Fixed schema
Column Data Name, type, value, timestamp Name, type, value
Schema Changes Runtime modifications Usually requires downtime
Data Model Denormalized Normalized with JOINs

The data model consists of several hierarchical components:

Keyspace

[edit]

A keyspace in Cassandra is analogous to a database in relational systems. It contains multiple tables and manages configuration information, including replication strategy and user-defined types (UDTs).[2]

Tables

[edit]

Tables (formerly called column families prior to CQL 3) are containers for rows of data. Each table has a name and configuration information for its stored data. Tables may be created, dropped, or altered at run-time without blocking updates and queries.[10]

Rows and Columns

[edit]

Each row is identified by a primary key and contains columns. The first component of a table's primary key is the partition key; within a partition, rows are clustered by the remaining columns of the key.[11]

Columns contain data belonging to a row and consist of:

  • A name
  • A type
  • A value
  • Timestamp metadata (used for write conflict resolution via "last write wins")

Unlike traditional RDBMS tables, rows within the same table can have varying columns, providing a flexible structure. This flexibility distinguishes Cassandra from relational databases, as not all columns need to be specified for each row.[2] Other columns may be indexed separately from the primary key.[12]

Storage Model

[edit]

Cassandra uses a Log Structured Merge Tree (LSM tree) index to optimize write throughput, in contrast to the B-tree indexes used by most databases.[2]

Storage Model Comparison: Cassandra vs RDBMS
Feature Cassandra RDBMS
Index Structure LSM Tree B-Tree
Write Process Append-only with Memtable In-place updates
Storage Components Commit Log, Memtable, SSTable Data files, Transaction Log
Update Strategy New entry for each change Modify existing data
Delete Handling Tombstone markers Direct removal
Read Optimization Secondary Primary
Write Optimization Primary Secondary

The storage architecture consists of three main components:[2]

Core Components

[edit]
  • Commit Log: A write-ahead log that ensures write durability
  • Memtable: An in-memory data structure that stores writes, sorted by primary key
  • SSTable (Sorted String Table): Immutable files containing data flushed from Memtables

Write and Read Processes

[edit]

Write operations follow a two-stage process:

  1. The write is recorded in the commit log and added to the Memtable
  2. When the Memtable reaches size or time thresholds, it flushes to an SSTable

Read operations:

  1. Check Memtable for latest data
  2. Search SSTables from newest to oldest using bloom filters for efficiency

Data Management

[edit]

Tombstones

[edit]

Every operation (create/update/delete) generates a new entry, with deletes handled via "tombstones". While common in many databases, tombstones can cause performance degradation in delete-heavy workloads.[13]

Compaction

[edit]

Compaction consolidates multiple SSTables to:

  • Reduce storage usage
  • Remove deleted row tombstones
  • Improve read performance

Cassandra Query Language

[edit]

Cassandra Query Language (CQL) is the interface for accessing Cassandra, as an alternative to the traditional Structured Query Language (SQL). CQL adds an abstraction layer that hides implementation details of this structure and provides native syntaxes for collections and other common encodings. Language drivers are available for Java (JDBC), Python (DBAPI2), Node.JS (DataStax), Go (gocql), and C++.[14]

The key space in Cassandra is a namespace that defines data replication across nodes. Therefore, replication is defined at the key space level. Below is an example of key space creation, including a column family in CQL 3.0:[15]

CREATE KEYSPACE MyKeySpace
  WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 3 };

USE MyKeySpace;

CREATE COLUMNFAMILY MyColumns (id text, lastName text, firstName text, PRIMARY KEY(id));

INSERT INTO MyColumns (id, lastName, firstName) VALUES ('1', 'Doe', 'John');

SELECT * FROM MyColumns;

Which gives:

 id | lastName | firstName
----+----------+----------
  1 | Doe      | John

(1 rows)

Distributed Architecture

[edit]

Gossip Protocol

[edit]

Cassandra uses a peer-to-peer gossip protocol for cluster communication. Nodes routinely exchange information about cluster state, including:

  • Node availability status
  • Schema versions
  • Generation timestamps (node bootstrap time)
  • Version numbers (logical clock values)

The system uses vector clocks to track information currency and ignore outdated state data.[2]

Seed Nodes

[edit]

The architecture designates certain nodes as "seed" nodes that:

  • Bootstrap the cluster
  • Serve as guaranteed gossip communication points
  • Prevent cluster fragmentation
  • Remain discoverable via service discovery methods

This design eliminates single points of failure while maintaining cluster-wide consistency of operational knowledge.[2]

Fault Tolerance

[edit]

Cassandra employs the Phi Accrual Failure Detector to manage node failures during cluster operation.[16] Through this system, each node independently assesses the availability of other nodes during gossip communication. When a node fails to respond, it is "convicted" and removed from write operations, though it can rejoin the cluster upon resuming heartbeat signals.[2]

To maintain data integrity during node outages, Cassandra uses a "hinted handoff" mechanism. When writing to an offline node, the coordinator node temporarily stores the write data as a "hint." Once the offline node returns to service, these hints are forwarded to restore data consistency. Notably, Cassandra only permanently removes nodes through explicit administrative decommissioning or rebuilding, preventing temporary communication failures or restarts from triggering unnecessary data rebalancing.[2]

Management and monitoring

[edit]

Cassandra is a Java-based system that can be managed and monitored via Java Management Extensions (JMX). The JMX-compliant Nodetool utility, for instance, can be used to manage a Cassandra cluster.[17] Nodetool also offers a number of commands to return Cassandra metrics pertaining to disk usage, latency, compaction, garbage collection, and more.[18]

Since the release of Cassandra 2.0.2 in 2013, measures of several metrics are produced via the Dropwizard metrics framework,[19] and may be queried via JMX using tools such as JConsole or passed to external monitoring systems via Dropwizard-compatible reporter plugins.[20]

Releases

[edit]

Releases after graduation include:

Version Original release date Latest version Release date Status[21]
Old version, no longer maintained: 0.6 2010-04-12 0.6.13 2011-04-18 No longer maintained
Old version, no longer maintained: 0.7 2011-01-10 0.7.10 2011-10-31 No longer maintained
Old version, no longer maintained: 0.8 2011-06-03 0.8.10 2012-02-13 No longer maintained
Old version, no longer maintained: 1.0 2011-10-18 1.0.12 2012-10-04 No longer maintained
Old version, no longer maintained: 1.1 2012-04-24 1.1.12 2013-05-27 No longer maintained
Old version, no longer maintained: 1.2 2013-01-02 1.2.19 2014-09-18 No longer maintained
Old version, no longer maintained: 2.0 2013-09-03 2.0.17 2015-09-21 No longer maintained
Old version, no longer maintained: 2.1 2014-09-16 2.1.22 2020-08-31 No longer maintained
Old version, no longer maintained: 2.2 2015-07-20 2.2.19 2020-11-04 No longer maintained
Old version, no longer maintained: 3.0 2015-11-09 3.0.29 2023-05-15 No longer maintained
Old version, no longer maintained: 3.11 2017-06-23 3.11.15 2023-05-05 No longer maintained
Old version, yet still maintained: 4.0 2021-07-26 4.0.13 2023-05-20 Maintained until 5.1.0 release
Old version, yet still maintained: 4.1 2022-06-17 4.1.6 2024-08-19 Maintained until 5.2.0 release
Current stable version: 5.0 2024-09-05 5.0.2 2024-10-19 Latest release. Maintained until 5.3.0 release
Legend:
Old version, not maintained
Old version, still maintained
Latest version
Latest preview version
Future release

See also

[edit]

References

[edit]
  1. ^ "Release cassandra-5.0.2".
  2. ^ a b c d e f g h i j k l Carpenter, Jeff; Hewitt, Eben (2022). Cassandra: The Definitive Guide (3rd ed.). O'Reilly Media. ISBN 978-1-4920-9710-5.
  3. ^ Casares, Joaquin (November 5, 2012). "Multi-datacenter Replication in Cassandra". DataStax. Retrieved July 25, 2013. Cassandra's innate datacenter concepts are important as they allow multiple workloads to be run across multiple datacenters...
  4. ^ "Apache Cassandra Documentation Overview". Retrieved January 21, 2021.
  5. ^ Hamilton, James (July 12, 2008). "Facebook Releases Cassandra as Open Source". Retrieved June 4, 2009.
  6. ^ "Is this the new hotness now?". Mail-archive.com. March 2, 2009. Archived from the original on April 25, 2010. Retrieved March 29, 2010.
  7. ^ "Cassandra is an Apache top level project". Mail-archive.com. February 18, 2010. Archived from the original on March 28, 2010. Retrieved March 29, 2010.
  8. ^ "The meaning behind the name of Apache Cassandra". Archived from the original on November 1, 2016. Retrieved July 19, 2016. Apache Cassandra is named after the Greek mythological prophet Cassandra. [...] Because of her beauty Apollo granted her the ability of prophecy. [...] When Cassandra of Troy refused Apollo, he put a curse on her so that all of her and her descendants' predictions would not be believed. [...] Cassandra is the cursed Oracle[.]
  9. ^ DataStax (January 15, 2013). "About data consistency". Archived from the original on July 26, 2013. Retrieved July 25, 2013.
  10. ^ Ellis, Jonathan (March 2, 2012). "The Schema Management Renaissance in Cassandra 1.1". DataStax. Retrieved July 25, 2013.
  11. ^ Ellis, Jonathan (February 15, 2012). "Schema in Cassandra 1.1". DataStax. Retrieved July 25, 2013.
  12. ^ Ellis, Jonathan (December 3, 2010). "What's new in Cassandra 0.7: Secondary indexes". DataStax. Retrieved July 25, 2013.
  13. ^ Rodriguez, Alain (July 27, 2016). "About Deletes and Tombstones in Cassandra".
  14. ^ "DataStax C/C++ Driver for Apache Cassandra". DataStax. Retrieved December 15, 2014.
  15. ^ "CQL". Archived from the original on January 13, 2016. Retrieved January 5, 2016.
  16. ^ Hayashibara, Naohiro; Défago, Xavier; Yared, Rami; Katayama, Takuya (2004). "The Φ Accrual Failure Detector". IEEE Symposium on Reliable Distributed Systems. pp. 66–78. doi:10.1109/RELDIS.2004.1353004.
  17. ^ "NodeTool". Cassandra Wiki. Archived from the original on January 13, 2016. Retrieved January 5, 2016.
  18. ^ "How to monitor Cassandra performance metrics". Datadog. December 3, 2015. Retrieved January 5, 2016.
  19. ^ "Metrics". Cassandra Wiki. Archived from the original on November 12, 2015. Retrieved January 5, 2016.
  20. ^ "Monitoring". Cassandra Documentation. Retrieved February 1, 2018.
  21. ^ "Cassandra Server Releases". cassandra.apache.org. Retrieved December 15, 2015.

Bibliography

[edit]
[edit]