Day 17: Building an MVP – Modelling the Database

CofounderGPT helped me set up the backend project structure and the most critical part of the project, the CSV import and processing, on Day 15. Our next step is to define the data structure, set up the database, and, if we have time, set up the signup/login flow.

Data flow in the Knowlo application

The data structure is one of the critical parts of every application. Getting it right can help you ship features easier in the future, and doing it fast and without thinking can slow you down significantly by increasing the number of issues and complicating adding new features.

We tried both in production with Vacation Tracker. We didn’t plan our MVP data structure properly, and it caused a lot of issues when the project grew. Then we rewrote it from scratch, and the new system allows us to move a lot faster.

But even the first poorly planned data structure took Vacation Tracker from $0 to more than $20k MRR! So, we do not want to overthink the data structure for Knowlo MVP. Instead, we want to analyze how our users will use the app and then try to develop a data structure that works for these use cases.

There will be two types of users in our system: a Knowlo app user, someone that uses our app to generate tooltips, and an end-user, a person that interacts with tooltips on other websites and web apps. Both of these user types are equally important for us, but we’ll start with a Knowlo app user, as they will also be our customers.

In an MVP version, we’ll allow Knowlo users to have only one project. However, there’s a big chance that users will want to manage multiple projects in the future. For example, I should be able to use the Knowlo app to create tooltips for both Vacation Tracker and Knowlo.

Each project will have a knowledge base. There’s a chance that people will ask for multiple knowledge bases for a single project, but from the current perspective, that seems less likely, so we’ll ignore it for now. Besides the knowledge base, a project will have many tooltips. So the initial project structure looks like the following image:

This is where end-users enter the system. Knowlo app users will generate tooltips and put them in their web apps or websites. Their users will then click on these tooltips to learn more about some of their features:

The main difference between a standard app tooltip and a Knowlo AI-driven tooltip is that it gives end-users an option to ask a follow-up question! Or even multiple follow-up questions:

However, each follow-up question costs money because Knowlo will use OpenAI GPT-4 API (at least for its MVP version), so we’ll need to introduce some limitations. We’ll limit each question and answer to a specific number of characters, and we’ll also limit the number of questions an end user can ask for each tooltip (see the following image). Our goal is to give end users helpful information just in time, but we do not want to convert tooltips to funny little chat apps. There might be some other limitations later.

Also, we want to track all interactions with tooltips to show different analytics to Knowlo users. For example, we would like to be able to show the number of clicks and questions for each tooltip in a specified period (such as one day, one week, one month, etc.). We also want to know which tooltips are helpful and which are not because that information allows both Knowlo and Knowlo’s users to improve them.

Finally, if the MVP version gains some users, we would like to allow people to invite collaborators. That feature would probably require at least some basic roles and permissions, but we’ll keep that for later. For our data structure, we want to plan that each project can have multiple users:

This looks good so far! As mentioned on Day 12, we’ll use Amazon DynamoDB as our database for Knowlo. Let’s see how to translate our data flow into the data structure that works well with DynamoDB.

Storing the data in DynamoDB tables

DynamoDB tables are fast and efficient if you design the database structure correctly. You can read the data by querying indexes or scanning the whole table. Indexes are fast and much cheaper, and full scans are slow and expensive. Our goal is to create a database structure that uses indexes efficiently, and then we can use only indexes. We can go a step further and disallow scans with our IAM permissions, which will help with data separation.

DynamoDB tables must have a primary key, which can be either a single-attribute partition key or a composite partition-sort key. The primary key uniquely identifies each item in the table and allows for efficient querying. In addition to the primary key, DynamoDB supports two types of secondary indexes: Global Secondary Indexes (GSIs) and Local Secondary Indexes (LSIs). GSIs have different partition keys and sort keys from the base table and span all partitions. GSIs basically copy a DynamoDB table with different indexes. LSIs have the same partition key as the base table but different sort keys, and they are scoped to a single partition.

This explanation of DynamoDB and its indexes is oversimplified. It would take a lot of time to explain all the details and benefits of DynamoDB, so we’ll not even try it. If you want to learn more about it, you should read Alex DeBrie’s excellent book called “The DynamoDB Book.”

There are two popular models for structuring data in DynamoDB: a multi-table design, which is similar to structuring most other NoSQL databases, and a single-table design.

Single-table design in DynamoDB refers to storing multiple types of entities (like users, projects, tooltips, etc.) in a single DynamoDB table instead of using a separate table for each entity type. This design pattern is commonly used with DynamoDB because it fits well with DynamoDB’s strengths and usage patterns.

The main advantage of the DynamoDB single-table design is its efficiency and performance. With a single-table design, we store all related data together, which leads to fewer read and write operations and reduces costs. It is especially beneficial when dealing with large datasets because of its scalability, making it highly cost-effective for read-heavy workloads. Because data for a single query is located on the same partition, retrieval becomes faster, which improves application performance.

But a single-table design has certain drawbacks. The main challenge is the increased complexity of data modeling and application code. Since we store multiple entity types in the same table, it can be challenging to understand the data structure just by looking at the table, and the application code needs to handle this complexity. This model often involves composite, which may be more complex to work with than traditional relational models. In addition, data migration and schema changes can be more difficult and error-prone with a single-table design, as changes may impact multiple entity types stored in the table.

But building a new product is an excellent place to experiment with some things that are hard to add to existing products. We use the DynamoDB single-table design for the core events table of Vacation Tracker, so let’s make the Knowlo data structure a bit more fun with the same approach.

As I mentioned, creating a good data model for a single-table design can be tricky. But CofounderGPT might be able to help me. Let’s see! I asked CofounderGPT to suggest a single-table design for the Knowlo app. To get some useful results, I explained the user interactions in detail.

CofounderGPT: Thank you for the detailed explanation. Based on the information you’ve given, I’ll suggest an initial single-table design for your DynamoDB setup.

Here are the main entities I identified:

Knowlo App User (User)
Project
Knowledge Base
Tooltip
End User
Interactions

DynamoDB is a key-value and document database, where data is accessed via primary key. This key is composed of two parts: partition key (PK) and sort key (SK). For querying across multiple partitions, we need secondary indexes, which allow to use alternative keys.

Here’s a suggested table design:

PK	SK	GSI1PK	GSI1SK	Other Attributes
USER#<user_id>	METADATA#<user_id>			user attributes
USER#<user_id>	PROJECT#<project_id>	PROJECT#<project_id>	USER#<user_id>	project attributes, user_id
PROJECT#<project_id>	KNOWLEDGEBASE#<knowledge_base_id>			knowledge_base attributes, project_id
PROJECT#<project_id>	TOOLTIP#<tooltip_id>	TOOLTIP#<tooltip_id>	PROJECT#<project_id>	tooltip attributes, project_id
TOOLTIP#<tooltip_id>	ENDUSER#<end_user_id>	ENDUSER#<end_user_id>	TOOLTIP#<tooltip_id>	interaction count, last interaction, etc.
TOOLTIP#<tooltip_id>	INTERACTION#<interaction_id>	NTERACTION#<interaction_id>	TOOLTIP#<tooltip_id>	interaction details

CofounderGPT’s first iteration of the Knowlo app DynamoDB single-table design

Here are the entities represented:

User: A user entity is identified by the partition key USER#<user_id> and the sort key METADATA#<user_id>. User attributes can be stored as other attributes.
Project: A project is a child of a user and has a partition key USER#<user_id> and sort key PROJECT#<project_id>. We also add the project id as the partition key in a Global Secondary Index (GSI1PK) and the user id as the sort key (GSI1SK) to allow querying projects across multiple users.
Knowledge Base: A knowledge base is a child of a project and has a partition key PROJECT#<project_id> and sort key KNOWLEDGEBASE#<knowledge_base_id>.
Tooltip: A tooltip is a child of a project, with a partition key PROJECT#<project_id> and sort key TOOLTIP#<tooltip_id>. We also add the tooltip id as the partition key in GSI1PK and the project id as the sort key in GSI1SK to allow querying tooltips across multiple projects.
End User Interaction: The interactions of end users are children of a tooltip. Each interaction has the partition key TOOLTIP#<tooltip_id> and sort key ENDUSER#<end_user_id>. We also add the end user id as the partition key in GSI1PK and the tooltip id as the sort key in GSI1SK to allow querying interactions across multiple tooltips.
Interaction: An interaction is identified by the partition key TOOLTIP#<tooltip_id> and sort key INTERACTION#<interaction_id>. We also add the interaction id as the partition key in GSI1PK and the tooltip id as the sort key in GSI1SK to allow querying interactions across multiple tooltips.

This design should allow you to query all data relevant to a particular user, project, or tooltip, and to analyze the interactions. As your application grows, you might need to introduce more GSIs to support additional query patterns.

In the future, when you want to add support for collaborators, you can add a new relation between projects and users. This might also involve adding more GSIs to support the new query patterns that this feature will introduce.

Slobodan: Let’s design the database so we don’t need to modify it when adding a few more features, such as adding multiple users. For example, I think we should keep project attributes unrelated to users and then somehow connect users and projects. Here are some access patterns we should cover:

Each project user needs to be able to query the project attributes and billing (this will be for administrators only in the future) with a simple query. The knowledge base can be a project attribute (not part of an index) because I do not think we’ll need to get it by ID at any point.
Each project user needs to be able to query the project to get their attributes/settings (such as name, profile image, etc.)
Each project user should be able to see all users in that project with their roles (remember, we’ll add user roles later).
Each project user should be able to get all tooltips for that project. One tooltip is always part of one project, but the project has many tooltips. Users should be able to see and update tooltip attributes.
Each project user should be able to see the analytics for all tooltips in the project (maybe it would be great to support cross-project analytics later, but we can keep it project specific for now). We can create aggregates for the interactions, but users should be able to see the number of interactions in a period and filter by interaction type.
End users are not authenticated, but they have some generated IDs. They should be able to get tooltip data by tooltip ID only (they’ll not know the project ID, but we might be able to pass it if needed).
End users will create different interactions with tooltips. Here‘s the list of interaction types: VIEW (end user opened a tooltip, and we want to know if that is a new or returning user), QUESTION_ASKED (end user asked a follow-up question, we should block them at two questions per tooltip), TOOLTIP_UPVOTE or TOOLTIP_DOWNVOTE (end user clicked the tooltip upvote/downvote button), TOOLTIP_BUTTON_CLICKED (end user clicked an action button in a tooltip, there can be multiple links or buttons in a tooltip, and we need to know when they are clicked and which button is clicked).
All project users should be able to see all the questions asked for each tooltip, the number of interactions with each tooltip, and the number of downvotes/upvotes for each tooltip. They should be able to filter this info by the selected period or show everything from the tooltip creation until now.

CofounderGPT: Based on your access patterns and future considerations, we can adapt the DynamoDB table as follows:

PK	SK	GSI1PK	GSI1SK	Other Attributes
USER#<user_id>	METADATA#<user_id>			user attributes
PROJECT#<project_id>	METADATA#<project_id>			project attributes, knowledge base
PROJECT#<project_id>	USER#<user_id>	PROJECTUSER#<user_id>	PROJECT#<project_id>	user roles in project, user attributes
PROJECT#<project_id>	TOOLTIP#<tooltip_id>	TOOLTIP#<tooltip_id>	PROJECT#<project_id>	tooltip attributes
TOOLTIP#<tooltip_id>	INTERACTION#<interaction_id>	INTERACTION#<tooltip_id>	TOOLTIP#<interaction_id>	interaction details (end user id, interaction type, timestamp, button clicked etc.)

Here‘s the explanation for each line in the table:

User: This is the same as before, it represents an individual user.
Project: The Project entity now has its own partition key and is separated from the User entity. It holds all the project attributes and knowledge base.
Project User: This entity represents the relationship between a project and a user. We use the GSI1 (global secondary index) to enable querying all projects that a user belongs to.
Tooltip: Same as before, represents a tooltip that belongs to a project.
Interaction: Same as before, but now we add all details of the interaction to this item, including end user id, interaction type, timestamp, button clicked, etc.

This should allow you to support all the access patterns you mentioned. We separate the project from the user, which enables support for multiple users per project. We also add the user roles and user attributes to the Project User entity, which will support the roles feature when you want to add it.

Here’s how you would support each of your access patterns:

Query the Project item to get the project attributes and billing.
Query the Project User item with the user_id and project_id to get the user‘s attributes/settings for that project.
Query the Project User items with the project_id to get all users and their roles in a project.
Query the Tooltip items with the project_id to get all tooltips for a project.
Query the Interaction items with the tooltip_id to get all interactions for a tooltip, then aggregate and filter as needed to get analytics for a project.
Get the Tooltip item by tooltip_id (project_id is not needed).
Create a new Interaction item for each interaction, including the interaction type and any other data needed.
Query the Interaction items for each Tooltip and aggregate the data as needed to get the metrics for each tooltip. You can use filters to limit this to a specific period.

Slobodan: Ok, there are some good suggestions here. Thank! But, we need to have a tooltip interaction type in the index, too (maybe in the GSI) because we need to show the number of questions without loading all the interactions for that tooltip. Or, we need to be able to list all questions or downvotes for all tooltips in a selected project.

Also, we should keep the end-user ID somewhere because we need to get their previous interaction with tooltips. For example, if an end user with an ID “eu123” opens a tooltip with ID “‘t1“ and adds one question today, tomorrow they should see that question, an answer, and be able to ask just one more question.

CofounderGPT: Based on your additional requirements, you will need to include the tooltip interaction type in the secondary index to support querying for specific interaction types. Also, you need to include end-user interactions in the index to support querying for their past interactions.

Here’s how you can modify the table:

PK	SK	GSI1PK	GSI1SK	GSI2PK	GSI2SK	Other Attributes
USER#<user_id>	METADATA#<user_id>					user attributes
PROJECT#<project_id>	METADATA#<project_id>					project attributes, knowledge base
PROJECT#<project_id>	USER#<user_id>	PROJECTUSER#<user_id>	PROJECT#<project_id>			user roles in project, user attributes
PROJECT#<project_id>	TOOLTIP#<tooltip_id>	TOOLTIP#<tooltip_id>	PROJECT#<project_id>			tooltip attributes
TOOLTIP#<tooltip_id>	INTERACTION#<interaction_id>	INTERACTION#<tooltip_id>	TOOLTIP#<interaction_id>	ENDUSER#<end_user_id>	TOOLTIP#<tooltip_id>	interaction details (end user id, interaction type, timestamp, button clicked etc.)
TOOLTIP#<tooltip_id>	INTERACTION#<interaction_id>_<interaction_type>					interaction details (end user id, interaction type, timestamp, button clicked etc.)

In this design:

A new GSI2 (global secondary index) is added. The partition key is the end user id and the sort key is the tooltip id. This allows you to quickly find all interactions of an end user with a specific tooltip.
The interaction type is added to the sort key for interaction items. This allows you to quickly find all interactions of a specific type for a tooltip.

This revised design should allow you to:

Get the number of questions for a tooltip by querying for INTERACTION#<interaction_id>_QUESTION_ASKED in the tooltip’s partition.
List all questions or downvotes for all tooltips in a selected project by querying for INTERACTION#<interaction_id>_QUESTION_ASKED or INTERACTION#<interaction_id>_TOOLTIP_DOWNVOTE in the tooltips‘ partitions.
Get an end user‘s previous interaction with a tooltip by querying for ENDUSER#<end_user_id> in GSI2PK.

Slobodan: Ok, this is better, but there‘s still one more problem: we use interaction ID for interactions. I guess that’s some unique ID. We need it. But how do I query all the “QUESTION_ASKED” interactions in the project? Or how do I count the number of upvotes and downvotes for a selected tooltip?

Also, now that I think about the interactions, we should probably have an option to archive questions because users will often “answer“ them by updating tooltips and improving them. These questions should still be available in the analytics, but we need to show only non-archived questions on the questions page in the app.

CofounderGPT: Thank you for your patience, it seems that we’ll need to slightly adjust our approach to cover all requirements. The solution is to change the GSI1 to track interactions and to include interaction type in it. Additionally, we’ll add an “archived“ attribute to interactions to support archiving.

Here’s how we can modify the table:

PK	SK	GSI1PK	GSI1SK	GSI2PK	GSI2SK	Other Attributes
USER#<user_id>	METADATA#<user_id>					user attributes
PROJECT#<project_id>	METADATA#<project_id>					project attributes, knowledge base
PROJECT#<project_id>	USER#<user_id>	PROJECTUSER#<user_id>	PROJECT#<project_id>			user roles in project, user attributes
PROJECT#<project_id>	TOOLTIP#<tooltip_id>	TOOLTIP#<tooltip_id>	PROJECT#<project_id>			tooltip attributes
TOOLTIP#<tooltip_id>	INTERACTION#<interaction_id>	INTERACTION#<interaction_id>	TYPE#<interaction_type>	ENDUSER#<end_user_id>	TOOLTIP#<tooltip_id>	interaction details (end user id, interaction type, timestamp, button clicked etc.), archived flag

Here’s the explanation for each line in the table:

User: This is the same as before, it represents an individual user.
Project: The Project entity now has its own partition key and is separated from the User entity. It holds all the project attributes and knowledge base.
Project User: This entity represents the relationship between a project and a user. We use the GSI1 (global secondary index) to enable querying all projects that a user belongs to.
Tooltip: Same as before, represents a tooltip that belongs to a project.
Interaction: Now, we add interaction type to GSI1SK to allow querying for a specific type of interaction. We also add the end user id to GSI2PK to enable querying for interactions by the end user. The “archived“ flag is added to support archiving of questions.

Now, you can use GSI1 to:

Get all “QUESTION_ASKED“ interactions in a project by querying GSI1PK with INTERACTION#<interaction_id> and GSI1SK with TYPE#QUESTION_ASKED.
Count the number of upvotes and downvotes for a tooltip by querying GSI1PK with INTERACTION#<interaction_id> and GSI1SK with TYPE#TOOLTIP_UPVOTE or TYPE#TOOLTIP_DOWNVOTE.

You can use the “archived“ flag to filter out archived questions when displaying them in the app, but still include them in the analytics.

This structure is not perfect. For example, an interaction type can’t be a sort key with an interaction ID as a primary key. But it’s good enough to start with. Let’s convert it to code, and we’ll adjust it later.

Let’s write some code

Creating a DynamoDB table using CDK is trivial compared to planning the data structure. We need to import the DynamoDB construct and then create a table and two global secondary indexes. And that’s it!

To write the code, I’ll open the CDK stack file (“backend/lib/knowlo-backend-stack.ts”) and add the following code at the top of the file where we imported other constructs:

Copy Code

import * as dynamodb from 'aws-cdk-lib/aws-dynamodb'

Then, I’ll create a DynamoDB table with the following code:

Copy Code

const coreDbTable = new dynamodb.Table(this, 'KnowloCoreDatabaseTable', {
  partitionKey: {
    name: 'PK',
    type: dynamodb.AttributeType.STRING,
  },
  sortKey: {
    name: 'SK',
    type: dynamodb.AttributeType.STRING,
  },
  billingMode: dynamodb.BillingMode.PAY_PER_REQUEST,
  encryption: dynamodb.TableEncryption.AWS_MANAGED,
  pointInTimeRecovery: environmentParameter.valueAsString === 'production',
  removalPolicy: environmentParameter.valueAsString === 'production' ? RemovalPolicy.RETAIN : RemovalPolicy.DESTROY,
  deletionProtection: environmentParameter.valueAsString === 'production' ? true : false,
})

In the code snippet above, we did the following:

We created a partition key named “PK.” We’ll keep the generic name because our partition key’s value depends on the context. The type of this field is always a string.
We created a sort key named “SK.” The type of our sort key is also a string.
We set the billing mode to “PAY_PER_REQUEST” because we want our DynamoDB table to scale automatically. AWS will charge us for used capacity. There are some specific cases when provisioned capacity sometimes makes more sense than the pay-per-request mode, but that’s rarely in the MVP phase because we can’t accurately predict the usage patterns at this stage.
We enabled encryption with AWS-managed keys.
We’ll enable point-in-time recovery for the production version only. We do not need to have it in non-production environments.
We’ll set a special removal policy with deletion protection for the production version. These will be disabled for non-production versions because we might need to delete the table at some point during the development.

There’s no strict rule on where to put a DynamoDB table in the CDK stack. But because a database table is a stateful resource that will be used by many Lambda functions, I like keeping it near the top of the stack, after the parameters.

Once we have the DynamoDB resource, we can add indexes with the following code:

Copy Code

coreDbTable.addGlobalSecondaryIndex({
  indexName: 'GSI1',
  partitionKey: {
    name: 'GSI1PK',
    type: dynamodb.AttributeType.STRING,
  },
  sortKey: {
    name: 'GSI1SK',
    type: dynamodb.AttributeType.STRING,
  },
  projectionType: dynamodb.ProjectionType.ALL,
})

coreDbTable.addGlobalSecondaryIndex({
  indexName: 'GSI2',
  partitionKey: {
    name: 'GSI2PK',
    type: dynamodb.AttributeType.STRING,
  },
  sortKey: {
    name: 'GSI2SK',
    type: dynamodb.AttributeType.STRING,
  },
  projectionType: dynamodb.ProjectionType.ALL,
})

Each index has its name (we’ll need to provide this name when we want to query that index), its partition and sort keys, and a projection type. In some cases, we can keep just specific data in an index. However, in our case, we want to replicate all of our data to each index.

Scoreboard

Time spent today: 8h

Total time spent: 113h

Investment today: $0 USD

Total investment: $959.54 USD

Paying customers: 0

Revenue: $0

What’s next?

CofounderGPT and I made some progress on the signup and login process, but because this article is already long enough, we’ll keep it for the next one.

So far, we created Cognito User Pool, a place where we’ll store our users. Our next step is to connect that flow with the frontend application and allow users to sign up and log in to our Knowlo. Once we finish the authentication flow, we’ll continue working on the API and the frontend application.