Data Science: Cloud-based Tools for Effective Remote Collaboration

David Layton

David / 14 April 2020

If you think your data scientists and engineers can get by on just email and phone calls, you are in for a rude awakening. While IT has their hands full with broader company needs, let’s see look at some data science tools for working effectively as a team. As a Data Engineering consultant, I've put these tools into place for several teams. Getting up an running is quick once you know what you are doing.

We’ll cover several tools for each of the basic communication needs for a data team:

  1. Chat: How is your day going?
  2. Tracking work: Whose doing what and how is it going?
  3. Knowledge base/wiki: How do we do X again?
  4. Presentation: I made a thing. What do you think?
  5. Version Control: Where did you put that thing you made?

Data Science Tools: Chat

I’m so old I remember IRC and AIM, but email eventual became the lowest-common-denominator communication tool. That is, until the rise of the likes of Twitter and Slack. My lost youth and “why email is not an appropriate means of communication in the vast majority of situations” are topics for another post. Suffice to say, better teamwork requires a rich and more synchronous form of communication. Here come the chat tools.

Slack

Data Science Tools: Slack

Slack is the de facto solution for developers. It's a hosted solution; so you can sign up to slack here with just an email. However, chat history is limited with the free plan. Upgrading to the next level (currently £5.25 per person, per month, when billed yearly) gets you past this limitation.

Seriously, if you are not familiar with this type of chat tool, sign up and check it out now. Other chat tools merely ape Slack--adding at most one unique selling point; so try Slack first.

Microsoft Teams

Data Science Tools: Microsoft Teams

Microsoft never met a tool popular tool it didn't want to clone, poorly. And Teams is just that --a bad Slack clone. Another hosted solution, the only advantage here is the Microsoft stack. That is, if your already on Office 365 (Microsoft's bad G-suite clone), you already have Teams. It's integrated with active directory so it should know your organisational structure, emails, etc.

If you are on the Microsoft stack give it a try; otherwise, I'd give it a hard pass.

ROCKET.CHAT

Data Science Tools: Rocket.Chat

Our first self-hosted solution, and that's its USP, is Rocketchat--it's free as in beer and open source. The code's on github. However you can pay for support. They also offer a SaaS solution (pricing plans here). However if I wanted SaaS, I'd just go with Slack.

Need to keep all of your chat on your own servers, but not on the Atlassian stack (JIRA, Confluence, etc)? Then I'd recommend Rocketchat. Get in touch with me on social media, or comment below, if you'd like a howto on installing Rocketchat. Show enough interest, and I'll put something together.

HipChat

HipChat

HipChat is the Slack for the Atlassian stack (JIRA, Confluence, etc.)--or at least it was. It seems Atlassian is ditching HipChat for a "strategic partnership" with Slack. You can read more, but no huge loss here.

Data Science Tools: Work/Task Tracking

Managing workloads chiefly requires talking about items of technical work. To do this effectively, you need both a reference for each piece of work and its context. That is, you need a description, all the notes, related material, and the conversation surrounding that work. In scrum, this would be a scrum board. For support work, this would be a ticketing system.

For remote work, we need something digital and online.

JIRA

Need all-bells-and-whistles solution for tracking work? JIRA will accommodate any type of workflow with the right configuration, but that's not always easy. Consequently, the user experience varies between busy and mind boggling depending on the skill of the administrator.

Try out the hosted solution for free here. Once you exceed ten users, you will have to pay $7 USD per user per month. You can also run it on your own server for a one-time payment of $10 (until you exceed ten users). Once again the price jumps ($3,500 for 11-25 users).

Trello

Trello is a great, spartan solution that's quick off-the-ground and might be all you ever need. It's what I use. Sign up here for free (limited to ten team boards). Then its $10 per user per month (detailed pricing here). Note: As of January 2017, Trello is also owned by Atlassian.

It's also really good for retrospectives as well.

Others

Github issues could work in a pinch unless you are doing more support ticket style work. Then I'd use a ticket tracking system like ZenDesk. SalesFroce also has ticket tracking. You just need something everyone can access.

Get in touch if you would like me to do a deep dive on the alternatives--particularly anything self-hosted.

Data Science Tools: Knowledge Base (Wiki)

You’re going to need some sort of wiki — you can’t always walk over to someone’s desk and, shoulder-to-shoulder, sort out computer problems like our ancient hunter-gatherer ancestors did.

Curate howtos for your basic workflow and anything that usually assists or replaces in-person training/support. These data science tools build your workplace's organisational knowledge, i.e. knowledge that doesn’t just walk out the door when the employee does.

Confluence

Confluence

Confluence is an amazing tool and the centrepiece of the Atlassian stack. It's quick to learn and easy to use. You can signup for free here. Once you exceed ten users, however, you'll be paying $5 per user per month. Further pricing information can be found here. It's difficult to migrate away from confluence, so please consider the long term implications of their pricing model.

MediaWiki

MediaWiki powers Wikipedia so the user experience will be familiar. However, if your users are not regular Wikipedia moderators or contributors, the learning curve will be significant. It is free and open source if you want to host it yourself. It's such a popular and mature piece of software that there are a number of hosted providers.

I have personally used Cloudways for this for small teams. It was so simple and just a wonderful experience. However, I'd be happy to write a howto on getting this setup on AWS, GCP, or Digital Ocean manually for anyone interested.

SharePoint

SharePoint is similar to confluence but on the Microsoft stack. I do not have much personal experience with it. If you are on the Office 365 stack, have a go.

Github/Gitlab

github wiki

Already have GitHub(or GitLab)? Edit wikis directly on GitHub, and the information will be linked and searchable. When your needs outgrow GitHub, migrate to another solution. It's easy because it's git. Find out more about the limitations here.

I'd love to hear about your experience using the GitHub wiki or any of the above data science tools.

Data Science Tools: Presentation and Conferencing

The internet is currently flooded articles to help you pick a video conferencing tool. I won;t discuss it here but have used Zoom, Skype, Whereby and Microsoft Teams. They're all equally lacking in various areas. Invest in a good webcam. The major challenge is integrating with the in-office staff — too large a topic to address here, but people in the office need a good omni-speaker.

Let's focus instead on the specific needs of a data science team.

Data Science Hub

Presenting insights from data can be particularly difficult through words alone. Collaborating towards finding those insights is even more challenging--you must see and manipulate the data itself. SHOW what you did and build upon each others work in tandem.

One solution is "notebooks"--the central data science tools. Let’s look at sharing and co-creating notebooks.

JupyterLab

JupyterLab

Showcase: Jupyterlab

Jupyter let's you code in Python, R and Julia and makes visualisation a breeze. Try JupyterLab here in your browser for free. Installing it on your own server is straight-forward, but you'll soon hit performance issues--particularly around memory. I use docker because the dependencies are not trivial, and you can easily trash a system installing it directly. Scalable solutions like Amazon SageMaker and Google DataLab can have you up in minutes, but calculating costs is non-trivial -- it depends on your analyses and work patterns. Additionally, you may need a bit of infrastructure to easily and reliably get you own packages working with any of these solutions. If you get in the weeds, reach out to me.

Zeppelin

Data Science Tool: Zepplin

Zeppelin is an Apache project and more for those using a JVM languages like Java, Scala, or Kotlin -- though it does support Python. Deployment in the cloud is an option with both Amazon and Google. Zepplin on Amazon EMR is an option, and I've personally set it up on Google DataProc with no headaches. I really like its SQL and Spark integration when working with terabyte-sized datasets.

Dashboards

Sometimes you need something interactive for use outside of the team. You could make dedicated notebooks--tailoring to the target audience and applying the right levels of visibility and security. However, a notebook might be a bit daunting for non-technical users. For those situations, use dashboards.

JIRA/Confluence

Both JIRA and Confluence have dashboard capabilities, though limited. If you have either are already popular where you work, give it a try. You should be able to deliver something functional. But if you really want to wow your users, check out these other options.

Google Data Studio

Data Studio was in beta for a while and still has some limitations. However, you can start using it in minutes and integrations well with BigQuery and other Google products. Actually, it integrates well with most data sources. It's directed a marketing for delivering continuously updated infographics, but fairly flexible. As a result, it's really easy to learn. In contrast, alternatives like Tableau can quickly have you disappearing up your own ass.

The next two options require more development time, but can deliver more bespoke, complex user experiences.

Bokeh

Panel uses Bokeh

With Bokeh data scientists and engineers create web-based interactive dashboards in pure Python. It is most commonly used with Flask to deliver the webpages that house these dashboards, but you can also use Bokeh within notebooks.

Although I do often use this solution, you will eventually hit the limits of what can be done without JavaScript. If you have serious in-house JavaScript skills and want to make a serious product, take a look D3.js.

Dash from Plotly

https://www.youtube.com/watch?v=o5fgj1AIq4s

If you already use plotly in your JupyterHub/Lab notebooks, this can be a smooth transition. Dash is another "No JavaScript" solution and the one I would currently recommend for those who need more than can easily be done in Data Studio. Although I wouldn't say it is intuitive, I'm looking to doing more with Dash in the future.

Version Control

No one wants to spend a day, or more, wrestling a bug that someone's already fixed in another version. Worse yet is putting the wrong version of the code into production!

Although not traditionally considered a data science tool, proper version control prevents these nightmares. It’s so fundamental, there are several option. Let's explore a few.

Gitlab

I love GitLab and will definitely be writing many dedicated posts (more like love letters) on what you can do with this software. You can install it for free on your own servers or use GitLab's hosted offering (which also has a free tier suitable for smaller operations).

I personally prefer running it on AWS or Google Cloud. The reason being gitlab-ci, Gitlab's build machine solution. It's a fully integrated version control and continuous integration solution.

Still using Jenkins? Stop. If you are starting greenfield, get in touch. I love making new teams their most productive on the lasted best practices with this tool.

Github & Bitbucket

Github and Bitbucket offer largely the same functionality, and both are free for small and medium-sized teams. Most developers are already familiar with both. Although your data scientists may require an introduction.

Github is far more popular, but I still use Bitbucket for historical reasons. However popularity usually translates into better integration and feature further down the road. Almost every open source project has moved over to github over the past few years--mostly from older platforms. Unless you are using the other Atlassian tools we've discussed, and Github would be my recommendation between the two.

That being said, neither has a serious advantage over the other solutions.

AWS CodeCommit

If you are using AWS, consider CodeCommit. You only get charged for the related S3 storage--so peanuts. However that's still more than the no-cost options we discussed above. The main advantage is that it will be easier to integrate with your other Amazon resources.

Google Cloud Source Repositories

Similar to CodeCommit,Cloud Source Repository is Google's offering. Again, the sole advantage of is integration with its proprietor's cloud, Google Cloud.

Where to start?

I can help

We've discuss a lot of tools and it may seem overwhelming, but once you chose a stack and/or a cloud vendor, it's just getting the pieces to fit together. What we haven't discussed is the bigger picture of getting your data sciences team's practices aligned with the tools.

I hope this introduction has you bubbling with ideas while giving you some concrete next actions. Try these tools and figure out which ones fit your needs. If you get stuck or get into trouble, feel free to reach out to me via the comments, the contact form, or social media. This is how I get ideas for my next post. I love helping people give their team the best tools. And even if I've made things crystal clear, and you just don't have the time or the hands, I can help.

I’d love to hear about times when you used any of the above tools or their ilk--especially when it went spectacularly well or wrong. If you have any questions, comments, impassioned speeches, or heated denunciations, reach out.