Becoming a data scientist is considered a prestigious trait. Back in 2012, Harvard Business Review called 'data scientist' the sexiest job of the 21st century, and the growing trend of roles in the industry seems to be confirming that statement. To confirm this sexiness is still ongoing, the info from Glassdoor shows being a data scientist is the second-best job in America in 2021.

To get such a prestigious job, you have to go through rigorous job interviews. Questions asked can be very broad and complex. This is expected, considering the role of a data scientist usually incorporates so many areas. To help you prepare for the job interviews, I have reviewed all the applicable questions and separated them into different question categories. Here’s how I did that.

## Description and Methodology of the Analysis

I have gathered data from various job search boards and websites and company review platforms such as Glassdoor, Indeed, Reddit, and Blind App. To be more precise, there are 903 questions collected over the past four years.

The questions are sectioned into pre-determined categories. These categories are the result of an expert analysis of the interview experience description taken from our sources. The categories are: 1. Algorithms 2. Business case 3. Coding 4. Modelling 5. Probability 6. Product 7. Statistics 8. System design 9. Technical

## What Types of Questions Should you Expect?

This chart shows you the question type per category according to the collected data.

Translated to percentages, the chart looks like this:

As you can see, the coding and modeling questions are most dominant. More than half of all questions come from that area. It’s not surprising when you think about it. Coding and modeling are probably the two most important skills for a data scientist. Coding-type questions are widespread, comprising more than one-third of all questions. Other question types, such as algorithms and statistics, are also fairly significant; 24% of all questions come from these two categories. Other categories are not as represented. I find that reasonable, considering the nature of a data scientist role.

Now I want to guide you through every question category and show you some examples of the questions being asked.

## The Most Tested Concepts on Data Science Interviews

### Coding

As you already saw, coding questions are the single most important topic in data science. Such questions will require some sort of data manipulation using the code to identify insights. The questions are designed to test coding ability, problem-solving skills, and creativity. You’ll usually do that on a computer or a whiteboard.

#### Coding Question Example

One __example from Microsoft__ is this one:

**QUESTION**: *“Calculate the share of new and existing users. Output the month, share of new users, and share of existing users as a ratio. New users are defined as users who started using services in the current month. Existing users are users who started using services in the current month and used services in any previous month. Assume that the dates are all from the year 2020.”*

You’ll be using the table fact_events, with the sample data looking like this:

To get the desired output, you should write this code:

*with all_users as (
SELECT date_part('month', time_id) AS month,
count(DISTINCT user_id) as all_users
FROM fact_events
GROUP BY month),
new_users as (
SELECT date_part('month', new_user_start_date) AS month,
count(DISTINCT user_id) as new_users
FROM
(SELECT user_id,
min(time_id) as new_user_start_date
FROM fact_events
GROUP BY user_id) sq
GROUP BY month
)
SELECT
au.month,
new_users / all_users::decimal as share_new_users,
1- (new_users / all_users::decimal) as share_existing_users
FROM all_users au
JOIN new_users nu ON nu.month = au.month*

Writing a code in SQL is the most often tested concept when it comes to coding. It’s no surprise since SQL has been the most used tool in data science. One of the concepts you almost can’t avoid in the interviews is the joins. So make sure you know the difference between different joins and how to use them to get the required result.

Also, you can expect to group data using the GROUP BY clause very often. Some other concepts that are usually asked are filtering data using the WHERE and/or HAVING clause. You’ll also be asked to select distinct data. And also make sure that you know the aggregate functions, such as SUM(), AVG(), COUNT(), MIN(), MAX().

Some concepts don’t occur that much often, but it’s worth mentioning them and being prepared for such questions. For example, Common Table Expressions or CTEs is one such topic. The other one is the CASE() clause. Also, don’t forget to refresh your memory on handling the string data types and dates.

### 1. Modeling

Modeling was the second-largest category in our research data, with 20% of all questions coming from here. These questions are designed to test your knowledge on building statistical models and implementing machine learning models.

#### Modeling Question Examples

Regression, the most common technical concept asked in interviews. It’s not surprising, considering the nature of the statistical modeling.

One __example from Galvanize__ would be the following:

**QUESTION**:* “What is regularization in regression?”*

Here is how you could answer this question:

**ANSWER**:* “A regularization is a special type of regression where the coefficient estimates are constrained (or regularized) to zero. By doing this, it is possible to reduce the variance of the model while at the same time decreasing the sampling error. Regularization is used to avoid or reduce overfitting. Overfitting happens when the model learns training data so well it undermines the model’s performance on new data. To avoid overfitting, Ridge or Lasso regularizations are usually used.”*

Some of the concepts tested regularly are, again, other regression analysis concepts, such as logistic regression, Bayesian logistic regression, and naive Bayes classifiers. You can also be asked about the random forests, as well as testing and evaluating models.

### 2. Algorithms

Questions on algorithms are all questions that require solving a mathematical problem, mainly through code by using one of the programming languages. These questions involve a step-by-step process, usually requiring adjustment or computation to produce an answer. These questions test the basic knowledge of problem-solving and data manipulation, which can be implemented for complex problems at work.

#### Algorithm Question Examples

The technical concept tested most under algorithms is solving a mathematical or syntax problem with a programming language.

Here is __one example you can find on Leetcode__:

**QUESTION**: *“You are given two non-empty linked lists representing two non-negative integers. The digits are stored in reverse order, and each of their nodes contains a single digit. Add the two numbers and return the sum as a linked list. “*

The example of the data could be something like this:

**ANSWER** (code written in Java should be):

*public ListNode addTwoNumbers(ListNode l1, ListNode l2) {
ListNode dummyHead = new ListNode(0);
ListNode p = l1, q = l2, curr = dummyHead;
int carry = 0;
while (p != null || q != null) {
int x = (p != null) ? p.val : 0;
int y = (q != null) ? q.val : 0;
int sum = carry + x + y;
carry = sum / 10;
curr.next = new ListNode(sum % 10);
curr = curr.next;
if (p != null) p = p.next;
if (q != null) q = q.next;
}
if (carry > 0) {
curr.next = new ListNode(carry);
}
return dummyHead.next;
}*

The other general concepts often tested by this type of question are arrays, dynamic programming, strings, greedy algorithm, depth-first search, tree, hash table, and binary search.

### 3. Statistics

The statistics interview questions are questions testing the knowledge of statistical theory and associated principles. These questions intend to try how familiar you are with the founding theoretical principles in data science. Being able to understand the theoretical and mathematical background of analyses being done is important. Answer those questions well, and every interviewer will appreciate you.

#### Statistics Question Examples

The most mentioned technical concept is sampling and distribution. For a data scientist, this is one of the most commonly used statistics principles the data scientist implements daily.

For example, __an interview question from IBM__ asks:

**QUESTION**: *“What is an example of a data type with a non-Gaussian distribution?”*

To answer the question, you could first define a Gaussian distribution. Then you could follow this by giving examples of the non-Gaussian distribution. Something like this:

**ANSWER**:* “A Gaussian distribution is a distribution where a certain known percentage of the data can be found when examining standard deviations from the mean, otherwise known as a normal distribution. Some of the examples of the non-Gaussian distribution can be exponential distribution or binomial distribution.”*

When preparing for the job interview, make sure you also cover the following topics: variance and standard deviation, covariance and correlation, the p-value, mean and median, hypothesis testing, and Bayesian statistics. These are all concepts you’ll need as a data scientist, so expect them in the job interviews too.

### 4. Probability

These questions require theoretical knowledge only on probability concepts. Interviewers ask these questions to get a deep understanding of your knowledge on the methods and uses of probability to complete the complex data studies usually performed in the workplace.

#### Probability Question Example

It’s highly probable, pun intended, that the question you’ll get is to calculate the probability of getting a certain card/number from a set of dice/cards. This seems to be the most common element of questioning for most companies in our research, as many of them have asked these types of questions.

An example of such a __probability question from Facebook__:

**QUESTION**:* “What is the probability of getting a pair by drawing two cards separately in a 52-card deck?”*

Here is how you can answer this:

**ANSWER**: *“This first card you draw can be whatever, so it does not impact the result other than that there is one card less left in the deck. Once the first card is drawn, there are three remaining cards in the deck that can be drawn to get a pair. So, the chance of matching your first card with a pair is 3 out of 51 (remaining cards). This means that the probability of this event occurring is 3/51 or 5.89%.”*

Since this is a kind of “specialised” question that deals only with probability, no other concepts are asked. The only difference is how imaginative the question is. But basically, you’ll always have to calculate the probability of some event and show your thinking.

### 5. Product

Product interview questions will ask you to evaluate the performance of a product/service through data. These questions test your knowledge of adapting and using data science principles in any environment, as is the case with daily work.

#### Product Question Example

The most prominent technical concept in this category is identifying a company’s product and proposing improvements from a data scientist’s perspective. The high variance in technical concepts tested on the product side can be explained by the nature of product questions and the higher level of creativity required to answer these.

An example of a __product question from Facebook__ would be:

**QUESTION**: *“What is your favorite Facebook product, and how would you improve it?”*
**ANSWER**: *Due to the nature of the question, we will let you answer this one yourself.*

The general concepts tested heavily depend on the company that’s interviewing you. Just make sure you are familiar with the company’s business and their products (ideally, you’re their user, as well), and you’ll be fine.

### 6. Business Case

This category includes case studies and generic questions related to the business that would test a data science skill. The significance of knowing how to answer these questions can be enormous as some interviewers would like the candidates to know how to apply data science principles to solve a company’s specific problems before hiring them.

#### Business Case Question Example

Due to the nature of the question type, I could not identify a single technical concept that stands out. Since most of the questions categorized here are case studies, they are unique in a certain way.

However, here is an example of a __business case question from Uber__:

**QUESTION**: *“There is a pool of people who took Uber rides from two cities that were close in proximity, for example, Menlo Park and Palo Alto, and any data you could think of could be collected. What data would you collect so that the city the passenger took a ride from could be determined?”*

**ANSWER**: *“To determine the city, we need to have access to the location/geographical data. The data collected could be GPS coordinates, longitude/latitude, and ZIP code.”*

### 7. System Design

System design questions are all questions related to designing technology systems. They are asked to analyze the candidate’s process in solving problems, creating, and designing systems to help customers/clients. Knowing system design can be quite important for a data scientist; even if your role is not to design a system, you will most likely play a role in an established system and need to know how it works in order to do your work.

#### System Design Question Example

These questions cover different topics and tasks. But the one that stands out is building a database. Data scientists deal heavily with databases daily, so it makes sense to ask this question to see whether you can build a database from scratch.

Here is one __question example from Audible__ uncovered in our research:

**QUESTION**: *“Can you walk us through how you would build a recommendation system?”*

**ANSWER**: *Since there is such a variety of approaches to answer this question, we will leave you to come up with your own way of building one.*

Again, to answer these questions, it’s essential to know the company’s business. Think a little about databases that the company most probably needs, and try to elaborate your approach a little before the interview.

### 8. Technical

Technical questions are all questions that are asking about the explanation of various data science technical concepts. The technical questions are theoretical and require knowledge of the technology you will be using at the company. Due to nature, they can seem similar to coding questions. Knowing the theory behind what you are doing is quite important, so technical questions can often be asked in interviews.

#### Technical Question Example

The most tested area is theoretical knowledge of Python and SQL. Not surprising, since these two languages are dominant in data science, along with R to complement Python.

An example of a __real-world technical question from Walmart__ would be:

**QUESTION**: “*What are the data structures in Python?*”

**ANSWER**: *“The data structures are used for storing data. There are four data structures in Python: List, Dictionary, Tuple, and Set. Those are the built-in data structures. Lists are used for creating lists that can contain different types of data. Dictionary is basically a set of keys; they are used to store a value with a key and getting the data using the same key. Tuples are the same as lists. The difference is that in a tuple, the data can’t be changed. Set contains the unordered elements with no duplicates. Along with the built-in data structures, there are also the user-defined data structures.”*

These are catch-all types of questions. It’s a category for all the questions that can’t cleanly fit into other categories. Due to that, there are no specific concepts that occur more or less often.

## Conclusion

This data science interview guide has been written to support the research undertaken to understand the types of questions being asked at a data science interview. The interview questions’ data are taken from dozens of companies over a four-year period and analyzed. The questions have been categorized under nine different question types (algorithms, business case, coding, modeling, probability, product, statistics, system design, and technical questions).

As part of the analysis, I talked about some of the most common technical concepts from each question type category. For example, the most asked statistics questions have to do with sampling and distribution. Every question category is supported by one practical example of the real question.

The article is intended to serve you as an important guide for interview preparation or simply learning more about data science. I hope I have helped you to feel more comfortable about the data science interview process. Good luck with the interviews!