Data Modeling Interview Questions
Written August 19, 2024
Curated From The Web
I have put together some concepts, key words, and questions about data modeling that I think are highly relevant to today's job market along with my own two-cents. Data modeling interviews can be focused or split in several areas from schema, query optimization,
Concepts & Keywords
Dimensional Data Modeling AKA "Star Schema" - Breaks data into "dimensions" and "facts" to more effectively balance model simplicity, join performance, and storage costs.
Medallion Architecture - Create progressively more refined layers of database tables from raw ingress to accurate and efficient reporting tables.
Data Lake
Data Warehouse
Data Mesh
OLAP versus OLTP - Online Analytical Processing and Online Transaction Processing
Normalization or denormalization: This describes the decision of including or excluding columns in a table. In denormalization, we add a column to two or more tables which is considered redundant and will increase storage costs but can improve query performance and simplicity by skipping a JOIN. Normalization means to remove redundant columns to save on storage costs which may increase the JOINS required. Normalization can also help to provide consistency as there are fewer or no chances to have mismatches across column values.
3N or N-Normal Form: A way to quantify the desired level of normalization or denormalization in a database. Literally, the number is a target for how many tables or joins are required to perform critical or frequent queries . Higher values mean more normalized schema.
Indexing
Partitioning
Predicate pushdown
Cardinality
Sparsity
Aggregation
Projection
Questions
My Questions
Given this data model <shows data model>, what are its strengths and weaknesses?
Given this query on that data model, how do you expect it to perform? What could you change about the model to make it perform better? What could you change in the query?
Given a miscellaneous set of requirements and some potential use cases, come up with a data model.
Given <X> new requirements, would you change anything? What would you change and why?
Simplilearn has a collection of definitional and trivia questions and some more sophisticated questions. If all of these questions seem new, focus on learning the definitions. If the definitions are familiar, I wouldn't practice answering the question.
Definitions/Trivia:
What is a data model? I likely would not ask this question. The term is so overloaded it could refer to the definition of the concept, as simplilearn suggests or it could mean the table schema, ERD, or even software object model.
What is a table? Same thing here. Most people interact with "tables" in all sorts of ways and if you really get into the details, such as specific databases' internal implementations of tables, the answers will actually differ quite a lot.
What is a surrogate key? Surrogate keys can be useful but this is a weak question. You either know the answer or not. More interesting is how and when you would use it. However, many sufficiently creative problem solvers can come up with this on their own. No columns stand out as a natural primary key? Generate one with the properties you want.
Sophisticated:
What is data sparsity and how does it impact aggregation? While not a common question, this is an interesting one. It tests a bit more of your understanding of the nature of data and statistical methods. It can also potentially test your awareness and understanding of how databases are implemented internally.
Discuss the three types of data models in an ER model. The answer is "conceptual, logical, and physical" with some discussion on the differences. This is misleading because it's limited to a particular perspective on data modeling. It's good to know this in a large and established organization (e.g. a large health insurer or bank) that is not known as a web-scale tech company (e.g. Google or Uber). This is due to legacy practices and regulatory requirements that drive data governance over scalability.
Discuss in-memory analytics and column-store databases in a data warehousing context. This doesn't seem like a question I would really expect. There's no reason to compare and contrast these topics. Columnar and row-wise storage are interesting comparisons. Databases with on-disk storage versus totally in-memory databases are interesting to compare.
Lots of questions about NoSQL. These are good refreshers if you are already familiar with NoSQL databases and their differences to relational databases. You may have practiced them on the job already but these questions precisely target key decision points to measure your experience instead of critical thinking ability.