15 min. read

June 16, 2021

data science

SQL Interview Questions and Answers for Data Scientists

Be ready for the most common SQL questions and anwsers tested in a data science job interview

Nate Rosidi, Data scientist & product manager

As you already know, SQL is the most used tool in data science. But not only in data science. Whether you’re a programmer, data analyst, reporting specialist, database designer, or admin, you can count on being asked about SQL in your job interview.

The purpose of such questions is to check whether the candidate can write a code on a level required for his or her job. Of course, different job positions require different types and levels of SQL knowledge. However, there are some concepts that are a must-know for anyone using SQL. We’ll cover those concepts by showing you the most popular SQL data science questions (and answers!).

Technical concepts of SQL data science interview questions

This guide doesn’t cover everything that can come up in the SQL interviews. But it can reasonably prepare you for the most SQL interviews and SQL questions. Use this guide to prepare for the complex questions and practice some SQL data science concepts that your future employer could test. Practice doesn’t make perfect, but it sure makes you more confident in your knowledge. And confidence shouldn’t be underestimated in the interviews. Also, use this guide to make yourself familiar with how companies approach SQL interviews.

The technical SQL concepts we’re going to talk about are:

Categorizations, aggregations, ratios
JOINs and subqueries and CTEs
Subqueries in the WHERE clause
Window functions
Date manipulations

1. Categorizations & aggregations

Even though these are two different concepts, they’re usually tested together in one coding question. This concept is probably the most commonly tested concept in SQL interviews, especially in the beginning rounds.

Just to make sure we’re on the same page, let me guide you shortly through the meaning of every concept.

By categorizations, we mean CASE statements in SQL.

Aggregations relate to using the aggregate functions in SQL. The ones like SUM(), AVG(), COUNT(). Aggregation is tested almost without exception and is usually applied after the categorization takes place.

To know how to categorize and aggregate data in SQL is a very important concept to know. When you’re working with data, you’re usually working with a large amount of data. Your task is usually to transform this data into something meaningful that can be used by people other than data experts. Categorization helps you to organize data in a certain way. And then you can use it to perform some calculations. Such as summing data, finding averages, min and max values, etc. That’s what aggregation does. These are basic calculations but used whenever you want to create some decision-making insights.

Here’s a question from Airbnb: “Find the total number of searches for each room type (apartments, private, shared) by city.”

Answer: To answer the question, you need to use the table airbnb_search_details.

SELECT city, COUNT(CASE WHEN room_type ILIKE '%apt%' THEN id ELSE NULL END) apt_count, COUNT(CASE WHEN room_type ILIKE '%private%' THEN id ELSE NULL END) private_count, COUNT(CASE WHEN room_type ILIKE '%shared%' THEN id ELSE NULL END) shared_count FROM airbnb_search_details GROUP BY 1

To answer this question, you need to know well how to use COUNT() and CASE statement. This solution is written in PostgreSQL, so we used its ILIKE keyword, which is not a standard SQL. This is the same as LIKE, but it’s not case-sensitive.

2. JOINs & Subqueries & CTEs

The second technical concept that is usually tested in the interviews is, again, actually three technical concepts. Those are JOINs, subqueries, and CTEs. Subqueries and CTEs are rather similar in their purpose (with several distinctions), and usually, either of them can be used to achieve the same result. JOINs are, of course, used whenever you’re getting the data from more than one table. Which is always, unless you’re writing some really basic SQL code. And JOINs are used within subqueries and CTEs too.

JOINs are unavoidable when working with data. Simply because the database definition itself calls for using JOINs. In databases, data is always organized in multiple tables. So whenever you need to use data from the database, you’ll need to connect two or more tables to get the data you need. Subqueries and CTEs are not much different; they can be used instead of JOINs. But when you’re writing complex queries, subqueries and CTEs are often easier to use. It’s because they break down the query into logical parts, which are much easier to read than multiple JOINs. This is especially beneficial when you have a complex SQL code to maintain. Subqueries additionally allow you to use the result of the main SELECT statement in a subquery. CTEs even will enable you to reference CTE itself, which you can’t do with the subqueries.

Here’s one example from Google: “There are two tables with user activities. The 'google_gmail_emails` table contains information about emails being sent to users. Each row in that table represents a message with a unique identifier in the `id` field. The `google_fit_location` table contains user activity logs from the Google Fit app. There is no primary key, however, unique rows that represent `user sessions` are created using a set of columns: user_id, session, step_id.

Find the correlation between the number of emails received and the total exercise per day. The total exercise per day is calculated by counting the number of user sessions per day.”

Answer: To answer the question, you need to use the tables google_gmail_emails, google_fit_location

SELECT corr(n_emails :: NUMERIC, total_exercise :: NUMERIC) FROM (SELECT to_user, DAY, COUNT(*) AS n_emails FROM google_gmail_emails GROUP BY to_user, DAY) mail_base INNER JOIN (SELECT user_id, DAY, COUNT(*) AS total_exercise FROM google_fit_location GROUP BY user_id, DAY) loc_base ON mail_base.to_user = loc_base.user_id

As you can see, this query is written using the subquery and INNER JOIN. If you want, you can also re-write it by using the CTE. You also have to know the aggregate functions and a function that calculates the correlation to solve this question. What’s being tested is if you can use two sources of data, aggregate data on a required level, and get only one number as a result.

3. Subqueries in the WHERE Clause

Speaking of subqueries, they can be found in the FROM clause, as in the example above. But they can also be found in the WHERE clause, which can be rather slippery for some people.

Using the WHERE clause is helpful when you need to filter data. Using the subquery in the WHERE clause allows you not only to filter data but also do it according to criteria or data that is not explicitly stated in the data you have.

The question from Yelp is a good example: “Find the business and the review_text that received the highest number of 'cool' votes. Output the business name along with the review text”

Answer: To answer this question you need to use the table yelp_reviews.

SELECT business_name, review_text FROM yelp_reviews WHERE cool = (SELECT max(cool) FROM yelp_reviews)

The solution above uses the subquery to get data with the maximum cool votes. The maximum is not something you have ready in your data, so you have to find it to return the business_name and review_text in your result. Why not solve this problem another way? For example, maybe you could get the same result if you just sorted data descendingly and then select the top row. Would that get you the same result? Yes, if you’re lucky, but not in this case. If several businesses have the same (top) number of cool votes, you would get only one of them and miss all others. That way, you would fail this question because the correct result returns two businesses, which both have the highest number of cool votes.

4. Window Functions

To sail through the job interview, you’ll also need to know the window functions in SQL. They are heavily utilized in working with data.

The window functions can be seen as some sort of advanced aggregate functions. Like aggregate functions, they allow you to aggregate data. But they also do it in a way that the aggregation is not done in a single row; window functions can aggregate data so that every single stays visible, and the aggregate results are shown in the additional columns.

For example, you could use the RANK() function the rank all the rows and choose according to which column. But you can also do it not for the entire data; you can rank the rows within the specific subgroup. Let’s say, you have data on employee's salaries you can, of course, rank the employees from the highest to the lowest paid in general. But windows function also allows you to do that, for example, for every department separately too quite easily. That way you’ll have overall and departmental ranking in the same result.

You can also use NTILE function to create groups in data. For example, if your company has a that is sold daily. With the NTILE function, you can group the monthly sales into, let’s say, groups of the first ten days in a month, the second ten-day, and the final then days.

You can do all sorts of other things with other aggregate functions if you use them as window functions. They allow you to get the data from any previous or following row. That way you can calculate running total, moving averages, differences between different periods (such as monthly/quarterly/yearly revenue differences), etc. Again, you can do that not only on the overall data but according to subgroups you define when you define a window. That’s why they’re called window functions.

To see how the window functions can be used, here’s one Lyft question: “Find contract starting dates of the top 5 most paid Lyft drivers. Consider drivers who are still working with Lyft.”

Answer: To answer this question, you should use the table lyft_drivers.

SELECT start_date FROM (SELECT start_date, rank() OVER (ORDER BY yearly_salary DESC) FROM lyft_drivers WHERE end_date IS NULL ORDER BY yearly_salary DESC) sq WHERE rank <=5

This one’s not complicated. However, it manages to test you on window functions, RANK() in this case. This function will allow you to rank the drivers first and then implement other criteria to get you the desired result. Besides that, you’ll also have to use the subquery.

5. Date Manipulations

The last technical concept of SQL data science interview questions is date manipulations. This one is always asked because everyone in data analysis works with dates and analyzes time series. This is one of the main jobs for a data analyst: get the raw data and aggregate it by weeks, months, or years. To do that, you’ll need to know how to manipulate dates.

One such question is asked by Facebook: "Calculate the total revenue from each customer in March 2019. Revenue for each order is calculated by multiplying the order_quantity with the order_cost.

Output the revenue along with the customer id and sort the results based on the revenue in descending order.”

Answer: To answer this question, you’ll have to use the table orders.

SELECT cust_id, SUM(order_cost * order_quantity) AS revenue FROM orders WHERE EXTRACT('MONTH' FROM order_date :: TIMESTAMP) = 3 GROUP BY cust_id ORDER BY revenue DESC

To solve this one, you’ll need to extract the month from the order_date to get the customers from March 2019. The code is not too difficult; it tests some simple aggregation, ordering, and filtering of data. You probably know that already, so you just need to know some Date/Time functions, such as EXTRACT().

...

Without knowing these five technical concepts, you probably won’t fare well in the SQL job interviews. There are some simple and some more complex questions that can ask you these concepts. Whatever it is, make sure you’re familiar with them.

So make sure you practice them enough and go through other questions on StrataScratch.

Nate Rosidi, Data scientist & product manager