Latest Posts

Saturday, July 23, 2016

8 Reasons why Data Scientists Should not be Managed with Scrum

Scrum is successfully used by many organizations when it comes to software development. It's only natural to assume that it will work well with Data Scientists.

My experience as the VP of Data Science at Treexor reveals a rather different scenario. Using scrum with Data Engineers works very well for us, just like with the rest of the developers in our organization. But when we tried to use Scrum with our Data Scientists we started running into serious trouble. This is why it didn't work for us.





The Data Engineer


The world where a Data Engineer lives looks like this:


  1. Tasks can be explained (well-defined) in user stories
  2. There is certainty in the sprint
    1. Tasks can be estimated
    2. With clear outcomes (acceptance criteria)
    3. Tasks can be planned 2 weeks ahead. Unexpected high-priority tasks are the exception
    4. Tasks can deliver business value incrementally within a 2-week sprint
  3. Engineers can usually pick up any task from the sprint. Team can easily switch to another business domain: product, service or process, as long as the technology is similar
  4. They focus on the solution
  5. They can easily understand, present and defend the work of a colleague
  6. The Product Owner (PO) is the only business stakeholder to deal with
  7. Most stakeholders not interested: They work in backstage
  8. The team does not scale beyond 6-8 people due to Scrum limitation

The Data Scientist

The world where a Data Scientist lives looks exactly the opposite!


  1. Tasks can’t be easily explained in user stories
  2. There is uncertainty in the sprint
    1. Tasks can’t be estimated
    2. Unexpected high-priority tasks are the norm
    3. High sensibility to data availability and quality
    4. All-or-nothing value: Tasks do not deliver business value incrementally
    5. Exploration and discovery: Sometimes, it is a matter of good luck!
  3. She specializes in areas of product functionality (subject-matter expert) and therefore cannot pick up just any task. Domain expertise is required!
  4. She focuses on the problem as well as the solution
  5. She can only effectively present and defend her own results and doesn't need to wait until the end of the sprint. Storytelling is key!
  6. She deals directly with both business and technical stakeholders (executive level, managers)
  7. Most stakeholders interested: Work on stage!
  8. The team scales beyond 6-8 people.


Let's go through them one by one.

1. Tasks can't be easily Explained in User Stories

We found that describing the exact questions the Data Scientist has to answer in the user story was not a good practice, since often part of her work is to define the questions that need to be answered in the first place.

For instance, the Product Owner would write:

As [whoever] I want to measure the impact of feature [whatever] on the user conversion rate, so that I can decide if it was a good idea or not to develop it
 But then the Data Scientist would start analyzing the data and conclude that the interesting question for the business is to see the impact of the feature on the amount of user fraud, not on the conversion rate.

For this reason the Product Owner was forced to write really high-level and rather ambiguous user-story descriptions, which doesn't make much sense in Scrum.


2. There is Uncertainty in the Sprint

We found it very hard to estimate Data Science tasks. Sometimes a machine learning algorithm would take a few days, sometimes a few weeks or months, depending on how lucky you got with the feature selection. Sometimes a data analysis task would take a few hours, sometimes it would consume the whole sprint, depending on many things, including data availability and quality.

Also, having a rapid-changing business, and agile software teams with continuous delivery, it is normal for us to receive a new question or problem to solve any time. Making these questions or problems wait until the end of the current sprint, is simply unacceptable. Adding them in the current sprint as unexpected tasks, systematically ruined our sprints.

3. Domain Expertise is Required

Scrum tries to reach the point in which every developer in the team can pick up any task from the sprint. This is simply not realistic for a team of data scientists. It's like managing researchers in a university department and having them switch their research subject at random every two weeks.

Data Science requires domain expertise, and this is something that cannot be improvised. It takes time to develop this expertise (say, about online fraud), to read about the subject, to go to conferences and so on.

4. Focus on Problem and Solution

A Data Engineer is given a well-defined problem in a user story and focuses on how to solve this problem in the most efficient and parsimonious way. As mentioned in 1, in Data Science the problems are usually not well-defined, and part of the challenge a data scientist faces is to define it. Because a data scientist knows about the business and the data, she can find out:
  • a) what is the question that is relevant to solve for the business, and
  • b) that at the same time is possible to solve with the available data.

A business person knows a), a data engineer knows b), but only a data scientist knows both.

5. The Data Scientist Presents her own Results when Available

In Scrum sometimes it's encouraged that each member of the team presents the results for the whole sprint for every sprint review. We follow this practice at Treexor, since it motivates developers and helps them own and be proud of the results achieved through their collective effort.

In Data Science this simply doesn't work. When you force data scientist A to present the results of a study from data scientist B it's like asking a researcher to present the Ph. D. of another researcher. Domain expertise is required for presenting and defending the work of a data scientist, and this is something that cannot be improvised.

Also, the results of a Data Scientist should be presented to the business stakeholders as soon as they are available. Waiting for the end of a sprint to present them in the sprint review imposes an artificial delay that can slow down business operations.

6. The Data Scientist Deals with Business Stakeholders

In Scrum all interaction between developers and business stakeholders takes place through the Product Owner. Data scientists need not only speak the language of business, but directly address business itself in order to understand its problems. It's also often the case that a data scientist must work in close collaboration with a business person. For instance, a Data Scientist may work with the Product Owner of a company product in order to devise way to optimize some aspect of the product or to develop a model to predict some of its main KPIs.

7. Works on Stage

The Data Scientist is on the spot for two reasons. First, because she directly presents her results to the business stakeholders. Second, because she directly receives the pressure from the business. A Data Engineer is isolated from this pressure thanks to the Product Owner. According to our experience in Treexor, having a Product Owner sitting between business and Data Science often slows down the business and transforms the Product Owner into a bottleneck when the number of Data Scientists grows in the team.

8. Scaling beyond 6-8 Data Scientists

A team with Data Scientists:
  • managing interactions with business directly without depending on a Product Owner
  • managing their own cycle of work and presentation of results
  • not needing to collectively estimate tasks
can grow to a number well above the scrum limit of 6-8. There is no Product Owner acting as a bottleneck, and there are no scrum meetings where tasks need to be collectively estimated.

The Conclusion


What works best for us at Treexor is to use Scrum for the team of Data Engineers with a Kanban for unexpected tasks and bug fixes, but no Scrum at all for the team of Data Scientists. Instead, for Data Scientists we just use a Kanban.

Data Scientists act as clients for Data Engineers, who develop infrastructure to enable them do their job better. Data Engineers typically develop for Data Sicentists:
  • Dashboards & tools
  • Implementation in production of machine learning algorithms developed by Data Scientists
Requests of this sort coming from Data Scientists are prioritized by the Product Owder of the Data Engineering team, together with requests coming from the rest of the company.

Data Scientists don't have the scrum meetings. They:
  • Present results whenever they are ready
  • Meet every day. But instead of telling what they did, what's blocking them and what they'll do, like in scrum, they talk about results, numbers, and ask each other for help
  • Dont' have a Product Owner, although they have a chief, the VP of Data Science that helps them coordinate their work
  • Don't have a Planning or Refinement meeting, instead they have a weekly meeting with the VP of Data Science to follow up on their progress and remove bottlenecks
  • Don't structure their work in 2-week sprints
  • Have Learn & Debate sessions in which they learn new things together
  • Have Forensic sessions in which one Data Scientist presents her results and the rest of them contribute and comment on the methodology to enhance it and to make sure everyone uses the best practices

How to Set up a US Company Take my Course - 50% Off!

Take this 50% discount for a limited time!

How to Raise Money

How to Build your Startup with no Coding

How to get Customers

How to Run a Lean Startup

Powered by Blogger.