Database Basics
Data is the backbone of every application and a database is used to store the data. One of the common design discussions is to determine which database to use for a particular application. It would be good to be equipped with basic knowledge so that as a TPM, you can follow along the design discussions. If you are already proficient in databases you can guide the teams too in choosing the right database. Here are some of the basics. Definitions are copied over from various sites and the links to the sites are provided both for attribution and for you to explore more if you are interested.
There are two basic types of databases: RDBMS (Relational Database Management Systems), NoSQL (Not only – SQL Structured Query Language)
RDBMS/SQL vs. NoSQL
RDBMS | NoSQL | |
How data is stored | 1. Structured as tables. 2. Data is normalized (no duplication of data) 3. Enforces constraints (columns can be integers, or strings etc.) 4. Build relationships | 1. Unstructured way – document based, key-value based, graph based, wide column based 2. Hashes the input 3. Some duplication of data |
Examples | Oracle, MySQL, Microsoft SQL Server | Cassandra, HBase, Neo4j, MongoDB |
Schema definition | Schemas are predefined | Dynamic schemas for unstructured data |
Ability to scale | 1. Vertically scalable (more memory, better CPU etc. There is an upper limit on vertically scaling) 2. Horizontally scalable by adding a master and read replicas. Scales well for reads. 3. Single master and so bottleneck for writes | 1. Horizontally scalable 2. Scales well for both reads and writes |
Access | 1. Raw SQL 2. Direct database connection 3. Object relational mappers | 1. REST APIs 2. CRUD (create, read, update, delete) in vendor specific language |
When to use | 1. When your business case is evolving and you are not sure how you are going to access the data 2. When you want to perform flexible queries 3. When you want to perform relational or complex queries 4. When consistency is important so that malformed data cannot make it into your tables 5. Important to have correct data than fast data (ex: financial transactions) | 1. When you know how you are going to access your data 2. When your primary key is known 3. When availability is important 4. Important to have fast data vs. correct data (ex: number of views on youtube) 5. Need to scale based on changing requirements. 6. Store Time series data or hierarchical data 7. Doesn’t need to update data in several places |
Data consistency model | ACID( Atomicity, Consistency, Isolation, and Durability) model. | Base ( Basically Available, Soft state, Eventually Consistent) |
Performance | Limitations for thousands of reads and writes per second | Great performance for mass simple reads and writes |
Type of online processing | Used for Online Transaction Processing (OLTP) | Used for Online Analytical Processing (OLAP) |
Example projects | 1. Small project + low scale + unknown access patterns 2. Large project + high scale + relational queries (SQL with read replicas) | Medium / large project + high scale + high performance |
ACID consistency model
In the context of databases, a sequence of database operations that satisfies the ACID properties (which can be perceived as a single logical operation on the data) is called a transaction. For example, a transfer of funds from one bank account to another, even involving multiple changes such as debiting one account and crediting another, is a single transaction. Here are the details of ACID properties:
Atomicity | Atomicity guarantees that each transaction is treated as a single “unit”, which either succeeds completely, or fails completely: if any of the statements constituting a transaction fails to complete, the entire transaction fails and the database is left unchanged |
Consistency | Consistency ensures that a transaction can only bring the database from one valid state to another, maintaining database invariants: any data written to the database must be valid according to all defined rules, including constraints, cascades, triggers, and any combination thereof. |
Isolation | Isolation ensures that concurrent execution of transactions leaves the database in the same state that would have been obtained if the transactions were executed sequentially. |
Durability | Durability guarantees that once a transaction has been committed, it will remain committed even in the case of a system failure (e.g., power outage or crash). |
BASE consistency model
Eventual consistency is a consistency model used in distributed computing to achieve high availability that informally guarantees that, if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value.
Eventually-consistent services are often classified as providing BASE (Basically Available, Soft state, Eventual consistency) semantics, in contrast to traditional ACID consistency.
Basically Available | Basic reading and writing operations are available as much as possible (using all nodes of a database cluster), but without any kind of consistency guarantees (the write may not persist after conflicts are reconciled, the read may not get the latest write) |
Soft state | Without consistency guarantees, after some amount of time, we only have some probability of knowing the state, since it may not yet have converged |
Eventually consistent | If the system is functioning and we wait long enough after any given set of inputs, we will eventually be able to know what the state of the database is, and so any further reads will be consistent with our expectations |
CAP theorem
States that it is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees:
- Consistency: Every read receives the most recent write or an error
- Availability: Every request receives a (non-error) response, without the guarantee that it contains the most recent write
- Partition tolerance: The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes
Sharding
Sharding is a database architecture pattern related to horizontal partitioning — the practice of separating one table’s rows into multiple different tables, known as partitions. Each partition has the same schema and columns, but also entirely different rows. Likewise, the data held in each is unique and independent of the data held in other partitions.
Oftentimes, sharding is implemented at the application level, meaning that the application includes code that defines which shard to transmit reads and writes to. However, some database management systems have sharding capabilities built in, allowing you to implement sharding directly at the database level.
References
- SQL vs. NoSQL whats the difference
- SQL vs. NoSQL video
- SQL vs. NoSQL explained
- ACID vs. BASE explained
- Understanding database sharding