How would you deal with an ambiguous problem? (Data science interview question)
I was asked the following two questions during a data science interview at a large tech company recently:
1. How would you identify users “interested in ______” from their social media data?
2. How would you deal with an ambiguous problem?
I felt my answer was pretty hand-wavy, so in this post I’ll structure my thoughts and give both questions a better answer.
For the purposes of this post, let’s assume the problem is to identify users “interested in considering Master in Public Health (MPH) programs”. Astute observers will note that this is a classic ambiguous data science problem… it’s not clear what “interested in” or “MPH program” really means. We’ll discuss my proposed approach for solving this classification problem and getting buy-in from key decision makers and stakeholders.
My basic workflow is the following:
- Clarify requirements
- Establish working definitions
- Identify proxy signals
- Validate proxy signals
- Iterate (until you convince yourself)
- Communicate results & share evidence
- Iterate (until you convince others)
Details below.
1) Clarify requirements
When a data science research question is proposed by a Product Manager or other decision-making stakeholder, it’s often guided by vague intuitions.
For a classification problem like this, we primarily care about two details: (a) the classes and (b) the intended use case.
Knowing the classes requires the translation of intuitions into discrete categories. For example, is this a binary problem (“users interested in MPHs” and “everyone else”) or is the problem motivated by more specific subsets (“users currently attending an MPH program”, “users considering graduate school”, “users who have don’t know what an MPH is but are interested in public health”, etc.)?
To evaluate the classification task, we have to know something about the intended use case. Three cases:
Precision-focused problems: Precision-focused problems care a lot about avoiding false positives but are okay “missing” some users. In this context, that might be a problem like “identify a set of users that we can contact for interviews about why they’re interested in MPHs” or “characterize the link-sharing behavior of people interested in MPHs”.
Recall-focused problems: Recall-focused problems care about identifying as broad a set of users that meet the criteria as possible but are okay including some less relevant users (false positives). In this context, that might be a problem like “what’s the upper bound on the market share for advertising MPH programs to prospective students” or “for legal compliance reasons, we need to maintain a list of every MPH-interested user on our platform”.
Balanced problems: Balanced problems care approximately an equal amount about precision and recall. Balanced problems are generally uncommon in my experience, and have a few common causes: (a) The stakeholders don’t know what the resulting analysis will be used for, and are trying to keep their options open. (b) The stakeholders need to communicate with others that have diverse needs; this includes demonstrating a general capability (e.g. in an academic publication) or satisfying conflicting demands. (c) The stakeholders want to estimate the proportion of the classes. (d) The stakeholders actually care about all of the classes equally (e.g. “we are launching two ad campaigns and need to assign every user to exactly one campaign based on their interest in MPH programs”).
Identifying the type of problem and its intended use case will affect how the stakeholder is likely to evaluate your success (and the evaluation metrics and methods you should use). In the event that the stakeholder is unable to give you guidance about the intended use case, you’re in trouble: the stakeholder may not know what success will look like or what they hope the output of the analysis will be.
In these cases, you should generally make multiple proposals (ideally with preliminary results) — with at least one focused on precision and at least one focused on recall — and see what stakeholders respond to. Otherwise, choosing a precision focus is a more conservative choice that will help limit unjustified extrapolations from your results in the future. (N.B.: be wary when presenting proportion estimates for precision-focused problems.)
2) Establish working definitions
With an intended use case in hand, we can now assert definitions about the conceptual objects of interest. These assumptions will underpin our analysis and inform our measurement decisions. Generally, we should create our definitions based on our domain knowledge, which can come from a few places: (a) knowledge of existing business processes or market research, (b) academic research or competitor processes, or (c) investigation of available data (e.g., qualitative content analysis of posts that explicitly mention “MPH”).
In this case, I might define “interested” as “likely to apply for 1+ MPH programs in the next 2 years”, and “MPH programs” as “accredited US-based 2-year MPH programs”. Discussing your working definitions with stakeholders is an effective way to elicit further requirements and to develop a shared language.
3) Identify proxy signals
Given working definitions, we can now operationalize those definitions as something we can compute with and measure. That might involve training a statistical classifier or identifying a series of rules.
For example, a rule like “do any of the user’s posts contain the tokens ‘MPH’ and ‘application’” might be a good way to start investigating potential proxy definitions; how many users does this return? What is the apparent precision of the rule?
4) Validate proxy signals
Validating proxy signals is a question of sampling procedure and labeling procedure.
Sampling procedure determines which users we will inspect in order to assess the performance of our proxy signal for the classification task. If the task is precision-focused, then we can sample randomly from the users classified as positive. If the task is recall-focused, our task is harder: we need to cast a wider net, but if our positive class is rare (e.g. <1% prevalence) then we will need to look at a lot of data to find just a few positive examples. In that case, we will likely need to make additional assumptions, triangulating our recall evaluation by randomly sampling from a few different subsets (e.g. “what’s our recall among users who list an MPH in their Twitter bio” + “what’s our positive class precision” + “what’s our recall among users who mention MPH in a recent post”).¹
Labeling procedure determines the methods we will use to inspect the sampled users and assess them against some notion of “ground truth”. If we expect the users’ data to contain signals that humans can use to reliably create a ground truth, then manual annotation by ourselves or with crowdworkers (potentially with some form of IRR measure) can be a good solution. Otherwise, we’ll likely have to collect data from a 3rd party, such as annotation by experts (if they exist for the problem in question), self-report (by conducting a survey), or some other form of computable evaluation criteria (e.g. “do the identified users enroll in an MPH program within 2 years”; we might have to wait for a while to collect that metric, but it will be a direct measure of our concept of interest!).
5) Iterate (until you convince yourself)
The first person to convince is yourself. If I don’t think I’ve reasonably addressed the question (because the base assumptions don’t make sense, because the proxy signals I’m using are noisy, etc.), I keep iterating from step 2 until I have a clear narrative to share. (It’s okay if the “clear narrative” ends up being “given the data we have, we can only get a very unsatisfying answer to our original question”.)
6) Communicate results & share evidence
I use two primary tools when communicating results: examples and numbers. Communicating results is fundamentally about narrative, and different people respond to different narratives. Particularly when your results challenge the intuitions or assumptions of stakeholders, you’ll need to adjust your presentation to respond to their preferred forms of evidence (and their preferred presentation medium). I’ve presented detailed quantitative visualizations to demonstrate the infeasibility of an approach without much response… and then a single example that completely transformed our conversation.
Even if the primary stakeholder is myself, I always create both numeric summaries of my results and examples — pulled from the data — that serve as both exemplars and counter-examples to general trends.
7) Iterate (until you convince others)
Ultimately, your goal is to move forward with an analysis that will benefit a product or process decision. Discussing results with others is valuable because they will have perspectives that can challenge your own assumptions and understandings and lead to future analyses. Often, it’s only at the end of an analysis that we can properly assess whether our working definitions were reasonable, or whether it’s worth pursuing more expensive proxy signals.
I’m interested in other perspectives about data science workflows and handling ambiguous problems; please share any pointers. These are my thoughts on how I’ve handled these problems in the past, but it really just boils down to: state my assumptions clearly, ground my analysis in the data, and iterate as needed.
Interested in reading more about data science workflows? Check out “Characterizing Practices, Limitations, and Opportunities Related to Text Information Extraction Workflows”, a CHI 2022 paper I recommended on my reading list.
¹How to evaluate recall-centric problems is an open research question. Interested readers might consider NLP researcher Yoav Goldberg’s thoughts (shared March 2023).