💰Maximizing Efficiency: A Guide to Cost-Optimizing Your DynamoDB Population
DynamoDB is a NoSQL database service that provides fast and flexible performance, scalability, and high availability for applications that require low-latency data access. One of the critical aspects of using DynamoDB is populating it with data. DynamoDB allows for different ways of data ingestion, such as manual insertion through the AWS Management Console, SDKs, and CLI. However, for larger datasets or automated ingestion, it’s essential to have an efficient and reliable population strategy. In this article, we will explore different approaches to the DynamoDB population, their advantages and disadvantages, and best practices for managing data ingestion in DynamoDB.
What is DynamoDB?
DynamoDB is a fully managed NoSQL database service provided by Amazon Web Services (AWS). It is designed for applications that require high performance, scalability, and low latency. DynamoDB can store and retrieve any amount of data and is a great option for building web, mobile, gaming, and IoT applications.
DynamoDB is a key-value and document database that supports flexible data models, including document, key-value, and graph data models. It allows users to store and retrieve structured and semi-structured data and provides features such as automatic scaling, high availability, and data encryption.
DynamoDB uses a partitioned architecture, which allows it to scale horizontally to handle any amount of data and traffic. It also supports multiple data centres, which provide high availability and fault tolerance.
DynamoDB provides a rich set of APIs and SDKs for several programming languages, which makes it easy to integrate with other AWS services and build applications quickly. It also integrates with other AWS services, such as AWS Lambda, Amazon S3, and Amazon Redshift, to enable the building of fully functional and scalable applications.
Use Cases
DynamoDB is a versatile and highly scalable database that can be used for a wide range of use cases. Here are some examples of how DynamoDB is commonly used:
- 🌐📱Web and Mobile Applications: DynamoDB is commonly used to store user data, session data, and other application data for web and mobile applications. It provides fast and reliable access to data and can handle millions of requests per second.
- 🎮🕹️Gaming Applications: Many online gaming applications use DynamoDB to store game state, player data, and other game-related information. It's low latency and high throughput make it an ideal choice for real-time gaming applications.
- 🌐📈Internet of Things (IoT): DynamoDB is used to store and process data from IoT devices such as sensors and cameras. It can handle large volumes of data in real time and allows for fast and reliable analysis of IoT data.
- 📊📈Ad Tech and Marketing Applications: Many ad tech and marketing applications use DynamoDB to store and analyze user behaviour data such as clicks, impressions, and conversions. Its flexible data model and high performance make it an ideal choice for storing and processing large volumes of data.
- 💰🏦Financial Services: DynamoDB is used by many financial services companies to store and analyze financial data such as transaction records, customer data, and fraud detection data. Its scalability, reliability, and security features make it an ideal choice for financial services applications.
Overall, DynamoDB is a popular choice for any application that requires a highly scalable, flexible, and reliable database that can handle large volumes of data and traffic.
Context of application
My application (plagiarism detection) needs to store 100 million key/value hashes. Where the key is word hash and value document where this word hash has occurred. In future, I want to store more than 1 billion to make results more precise.
Let’s talk about approaches, we can take to saturate DB.
What is WCU?
In DynamoDB, WCU stands for Write Capacity Unit. It is a unit of measurement that represents the amount of write capacity available for a DynamoDB table or index. One WCU represents the capacity to write one item of up to 1KB in size per second.
When you create a DynamoDB table, you need to provision the desired read and write capacity for the table. You can do this by specifying the number of read capacity units (RCUs) and write capacity units (WCUs) for the table. The provisioned capacity determines the number of reads and writes that can be performed on the table per second.
For example, if you provision a table with 10 WCUs, you can write up to 10 items of up to 1KB in size per second, or fewer items if they are larger than 1 KB. If you need to write more items per second or larger items, you can increase the number of WCUs for the table.
DynamoDB automatically scales a table's read and write capacity based on traffic and usage patterns. It can also provide burst capacity to handle sudden spikes in traffic. The cost of using DynamoDB is based on the provisioned capacity and the actual usage, and you pay only for what you use.
🔗: More read in Read/write capacity mode — Amazon DynamoDB.
Challenges of storing 100 million key/value hashes
Storing 100 million key/value hashes can present several challenges, such as:
- 💪🚀Scalability: Traditional relational databases may struggle to handle such a large volume of data and may require additional hardware to scale up. This can be costly and time-consuming.
- 🛡️High availability: It’s important to ensure that the data is always available and accessible, even in the event of hardware failures or other issues.
- 💨Performance: Retrieving and updating data at scale can be time-consuming, especially if the database is distributed across multiple servers.
- 🔄Data consistency: With a distributed database, it can be challenging to ensure that data is consistent across all nodes.
How does DynamoDB address those challenges?
DynamoDB addresses these challenges by providing a highly scalable, highly available NoSQL database that can store and retrieve large amounts of data quickly and efficiently. Some of the specific ways DynamoDB addresses these challenges include:
- 💪🚀Scalability: DynamoDB is designed to scale horizontally, meaning that it can easily add more servers to handle additional data as needed. This allows it to handle large volumes of data without requiring additional hardware.
- 🛡️High availability: DynamoDB is designed to be highly available, with built-in redundancy and automatic failover to ensure that data is always accessible.
- 💨Performance: DynamoDB is optimized for fast data retrieval and updates, with support for low-latency queries and high-throughput data access.
- 🔄Data consistency: DynamoDB supports multiple levels of consistency, allowing developers to choose the level of consistency that best suits their application’s needs.
Overall, DynamoDB is a powerful database solution for storing and retrieving large volumes of data quickly and efficiently, while also providing high availability and scalability.
Calculating each mode price
Calculate the price for on-demand mode(very expensive)
For every million writes we pay $1,25 in that mode.
For our scenario, 100 million rows will cost 125 dollars.
ℹ️: Make notice, that read operations are 5 times cheaper.
🔗: Look at the prices of Amazon DynamoDB Pricing for On-Demand Capacity
Calculate the price for provisioned mode
For provisioned mode, we can scale up resources for the time of population and scale down for the time of the end.
In this case, we should evenly divide our data across specific times, because you pay per hour. You get 1 WCU every second.
For example,
100 million rows / 3600 seconds = 27778 WCU
In real situations, we will need more than what math says. In my experience 1,4 times more. So, we get 40k WCU.
Price for 40k WCU = $27.
Not bad, but creating an even population is hard to achieve.
Compared to on-demand mode, provisioned can be 4 times cheaper.
⚠️: You should test it on your dataset and not take this information for granted.
🔗: Look at the prices of Amazon DynamoDB Pricing for Provisioned Capacity
Import from S3 — the cheapest way
⚠️: Disclaimer:
“During the Amazon S3 import process, DynamoDB creates a new target table that will be imported into. Import into existing tables is not currently supported by this feature” — DynamoDB data import from Amazon S3: how it works — Amazon DynamoDB
For the population existing table, use previous methods.
“Import from Amazon S3 does not consume write capacity on the new table, so you do not need to provision any extra capacity for importing data into DynamoDB. Data import pricing is based on the uncompressed size of the source data in Amazon S3, that is processed as a result of the import” — DynamoDB data import from Amazon S3: how it works — Amazon DynamoDB
Let’s take every row as 1 KB.
1Kb*100ML rows = 100GB
100GB will cost us $15.
DynamoDB Importing from S3
DynamoDB allows you to import data from Amazon S3 into a DynamoDB table using the AWS Management Console, the AWS Command Line Interface (CLI), or the DynamoDB API. When importing data from S3, you must specify the source S3 bucket and the format of the data.
Import formats
- CSV (Comma-Separated Values): CSV is a popular format for importing and exporting data. When importing CSV data into DynamoDB, each line in the file represents a single item in the table. The first line of the file contains the column names, and each subsequent line contains the values for each column.
- JSONL (JSON Lines) is a format for storing structured data in a text file, where each line of the file contains a single JSON object. The format is commonly used for logging, data exchange, and batch processing of large datasets.
JSONL files are similar to JSON files, but each line of the file must contain a valid JSON object, separated by a newline character. Unlike JSON, which requires the entire file to be parsed as a single object, JSONL files can be processed line by line, making them ideal for large datasets that do not fit into memory.
JSONL files can be created and processed using various tools and programming languages, including Python, Node.js, and Ruby. Many modern data processing frameworks and libraries, such as Apache Spark and Pandas, support reading and writing JSONL files.
Overall, JSONL provides a lightweight and flexible format for storing and processing structured data in a line-by-line fashion. Its simplicity and compatibility with a wide range of tools and programming languages make it a popular choice for many data processing applications.
- Amazon Ion is a richly-typed, self-describing, hierarchical data serialization format that is designed to be both human-readable and efficient for machine processing. It was developed by Amazon Web Services (AWS) and is used as a data interchange format in various AWS services such as DynamoDB, S3, and Kinesis.
At its core, Ion is a superset of JSON that includes additional data types such as timestamps, binary data, and symbol tables. It also supports advanced features such as annotations, which allow users to add metadata to data elements, and symbol tables, which allow for efficient representation of repeated values.
One of the key advantages of Ion is its ability to represent complex and nested data structures in a compact and efficient format. The self-describing nature of Ion means that data elements are accompanied by type information, making it easier to validate and process the data. Additionally, Ion supports partial parsing, which allows users to parse and process portions of a larger Ion document without having to parse the entire document.
Ion is supported by various programming languages, including Python, Java, and JavaScript, and there are several open-source libraries available for working with Ion data.
Overall, Amazon Ion is a flexible and efficient data serialization format that is particularly well-suited for complex and nested data structures. Its support for advanced features and self-describing nature make it a popular choice for various data processing and interchange use cases.
🔗: Read more in official documentation Amazon S3 import formats for DynamoDB — Amazon DynamoDB
📗: More about popular file formats read in my previous article.
Encryption and number of files
“Data can be compressed in ZSTD or GZIP format, or can be directly imported in uncompressed form. Source data can either be a single Amazon S3 object or multiple Amazon S3 objects that use the same prefix” - DynamoDB data import from Amazon S3: how it works — Amazon DynamoDB
☝️: Encryption can reduce the price of storing a big file on S3
In conclusion, DynamoDB is a powerful database that offers high performance, scalability, and low latency for a wide range of use cases. It is an excellent option for applications that require a highly scalable, flexible, and reliable database that can handle large volumes of data and traffic. When it comes to populating DynamoDB, there are two modes to consider: on-demand and provisioned. While the on-demand mode is very expensive, the provisioned mode offers a more cost-effective solution, especially for large-scale data populations. However, it can be challenging to create evenly populated data in provisioned mode. In summary, by understanding the different modes and approaches for populating DynamoDB, you can make the most of this powerful database technology to build scalable and reliable applications.