My top 5 takeaways from Build a Career in Data Science (Settling into data science) – PART 3

Posted on: May 9, 2024
Post Category: Book Notes

‘Build a Career in Data Science’ is one of the most consequential books I’ve read for my growth in the data analytics space – particularly the earlier parts of the book, which helped me understand the space and apply for data analytics jobs.

If you’re more interested in that aspect, I’ve attached links to my notes on those two parts here:

Getting started with data science (part 1)

Finding your data science job (part 2)

For this particular blog post, I’ll be focussing on the third part of the book: Settling into data science.

Not every data analysis/science job is the same; tools are not the same, expectations might not be the same, etc. But this part of the book gives a readers a general idea of what might be in store for them in the early months of their role.

So here are the 5 key takeaways I took from this part of the guide:

Note that this part of the book is very comprehensive, and that my coverage below will not align with everything that was written in the book; there are a few pages about thinking about leaving your data science job, which I have intentionally left out for the sake of relevance. If you want a complete view of the contents, I would recommend getting the book.

1. What to expect from the first months of the job

    When you start your new data science role, you would want to get as much done as possible. But the first couple months should be about learning and doing tasks in the right way/approach.

    • Your first months should be the chance to set up a system and support network that will help you be successful.
    • When you’re starting, you’ll often be given a few weeks to undergo training and get access approved. You may be frustrated that it’s taking so long to feel productive, but a slow start is natural in this environment.
    • You might be given a task, but you should be focussing on the process, not the result. Understand the approach and ask lots of questions at this stage, because this will help you do your job better later.
    • This is more challenging when you work at a small company, where you might need to build out processes yourself that will be successful in the long run.
    • Ideally, your manager has a vision of what you’ll be doing but is open to your priorities and strengths. You’ll need to meet with your manager to discuss priorities, and together, you would want to define what success means in your job.
      • And the way to find out whether you’re meeting expectations is to have regular meetings with your direct supervisor.
    • Make a plan with your manager to have a review after your first three months if that’s not common practice.
    • If your company has been doing data science for a while, a great place to start is reading reports that employees have written. Reports will tell you not only what types of data your company keeps (and give you key insights), but also the tone and style of how you should communicate your results (to non-technical stakeholders).
    • Be familiar with where the data lives and get access to it, and read the documentation available for these tables.
    • Learn how the data got to you – and the systems/steps involved.
    • You will make it harder for your manager at the start of your role – but that is expected. You can still deliver value early on though – focus on simple and descriptive questions to investigate (e.g., distribution of client size) and check in your findings/approach with your manager.
    • Your manager would rather you ask questions and take up a few minutes of someone’s time than be stuck spinning your wheels for days. Learn about the question culture at the company. Try find the answers yourself (or do research) before you ask questions. Try book in meetings with experts who you have many questions for. See how others arrived at their answers, see their environment and techniques. Keep a list of questions.
    • Set up nontechnical talks with people so you can comfortable relying on one another. Ask your manager for a list of people you should get to know – and try meet all of them. And if you want someone to be a long-term mentor/sponsor, keep them updated on how you’ve followed their advice or taken advantage of the opportunity they helped you with.
    • If you’re the first data scientist, you have to consider the programming languages you should use (generally a commonly-used one would be good), and manage expectations about what data science is capable of and how quickly goals could be achieved.

    2. How to make an effective analysis

      An analysis is typically a PowerPoint deck, a PDF or Word file, or an Excel spreadsheet that can be shared with non-data scientists, containing insights from the data.

      These examples have varying levels of technical complexity; some require only summarizing and visualizing data, whereas others need optimization methods or machine learning models, but all of them answer a one-time question.

      What makes a good analysis?

      • It answers the question – with an appropriate data science approach.
      • It is made quickly for critical business decisions/deadlines
      • It can be shared – not just scripted with a programming language, but in something sharable like a slide deck or document.
      • It is self-contained – understandable on its own.
      • It can be revisited – in the common case that the questions are be asked again.

      “A good analysis is something that helps non-data scientists do their job.”

      3. Have an analysis plan

        Here’s a simple workflow/timeline of stages that can happen when working with a data request:

        • The request – a person will come to you with a question, and your job is to turn that business question into a data science question (which you can answer with a data science answer, translated into a business answer). Here you need to understand the context (so you know what is helpful), ask questions, get acquainted with your stakeholder, and you must know whether you have the data to plausibly answer the question (so you manage expectations).
        • The analysis plan: an analysis plan is good to keep track of how much you have completed, and to discuss with your manager in meetings about how things are going. Generally this involves listing questions you would like to answer, and listing (independent) tasks that would help answer those questions. It is also helpful to share with stakeholders to confirm the approach and manage expectations.
        • Doing the analysis
          • Importing and cleaning data – here you are supposed to focus on data-cleaning work that would support your actual analysis (i.e., cleaning columns you would actually use), ask for help or see ways to avoid data connection problems, and raise any unusual data.
          • Data exploration and modelling – using general summarisation/transformation, visualisations, modelling. You should save as much as possible of what works – and flag which results/findings are “good”, isolate modelling code from general analysis work, avoid sharing exploratory plots to non-technical audiences, stick to simple modelling techniques so they are easy to understand/defend/debug, check in frequently with your stakeholder, and design your analysis code for a one-button run (which you can use to produce results quickly and repetitively without error).
          • Wrapping up. This means creating a final document that tells a story – so that anyone who doesn’t have your level of contextual knowledge can understand.
            • Final presentation: Walk them through each component, describing what you did, what you learned, and what you chose not to look into. Be up front with what you know and don’t know and don’t be afraid to say you will look into something. Pushing back on (inconsequential) requests may be involved here.
            • Mothballing your work: Save your code (with documentation) so that it can be executed again if need be sometime down the line.

        4. How to work with stakeholders

          The term stakeholder can be thought of as “a person, group, or organisation that is actively involved in a project, is affected by its outcome, or can influence its outcome.”

          Here are the different types of stakeholders you can come across – and what to expect from each of them:

          • Business stakeholders: these are people from marketing, customer care, product, etc. who have little technical background. Often it is best to help them understand what you did and how you did it.
          • Engineering stakeholders: they are in charge of maintaining code that the company delivers – and especially when a company product involves a machine learning model or analysis, they become stakeholders. The role of a data scientist is more exploratory/uncertain, which might appear foreign to a software engineer; in that scenario, you will need to communicate the process early and communicate often about the progress you are making, so there aren’t any surprises.
          • Corporate leadership: Corporate leadership tend to work with data scientists when they need data to make major decisions or when they want to have a better understanding of the company. Collaborating poorly could be a serious blow to the team. Collaborating well will help the data science team gain leverage to use data/analytics in more places.
          • Your manager: They want you to succeed, and they help the project get as far as possible. Give clear updates, communicate continuously, deliver presentable work, and you can let your guard down and be more vulnerable.

          And to communicate effectively with stakeholders during your data science projects, there are four core tenets to think about:

          • Understand the stakeholder’s goals: understand their goals/motivations as quickly as possible – ask what’s important to them and ask your colleagues about them. Frame your interactions/relationship as a collaboration.
          • Communicate constantly: You should keep stakeholders in the loop about how the project is meeting the expected timeline, you should communicate progress including findings and areas that are more difficult than expected, and you should update the stakeholder about how the work informs the business and what comes next.
          • Be consistent: You are a mini-business. Standardise how your analysis is structured, how your analysis is delivered and how your analysis is styled (i.e., use the same colours and templates). And if you’re building many data science products, be consistent with the input data structure, the output data structure, and the authentication.
          • Create a relationship

          Regarding how to prioritise you work between these different stakeholders, think about the bucket it falls into:

          • Innovative and impactful
          • Not innovative but impactful
          • Innovative but not impactful
          • Neither innovative nor impactful

          Prioritise first by impact, then by innovation second. Delivering value is the most important thing as an analyst.

          5. Putting a machine learning model into production

            Since this can get a bit technical, I will share a high-level overview of the process of deploying a machine learning model (which is useful, specifically in the context of deploying one for an industry use case).

            But essentially, sometimes the point of a data science project is to create a model that runs continuously e.g., as part of a product.

            ‘Deploying a model into production’ basically means making it run continuously to provide near-real time predictions/classifications – specifically for something customer-facing. And it need to run regardless of how weird the data gets.

            And the steps for building one are as follows:

            • Gathering appropriate data, do feature selection, training the model, and getting business buy-in.
            • Converting the model so that it is accessible for other programs. Typically this is done by allowing the model to be accessed as an API.
            • Writing code so that model can handle many possible inputs without crashing.
            • Deploying the model to a test environment to ensure it handles the traffic it might get when going live.
            • Deploying the model into a production environment.

            Card image cap
            About the author

            Jason Khu is the creator of Data & Development Deep Dives and currently a Data Analyst at Quantium.