Data Science Digest

DataScience Digest — 19.05.21

Hey folks,

As I promised, we come up with ideas to make the lives of members of our amazing community easier. This time we tackle the how-do-I-find-the-right-dataset problem.


Well, if you are (and, frankly, you should be), I’d appreciate it if you invest a few minutes of your time to fill out the survey. Your answers will help us get going with the dataset catalog (more about it inside) and learn about your dataset needs.

Looking forward to your feedback!

Best regards,

Dmitry Spodarets


What was new last week?

AI is a dynamic field. Just five years ago, for example, we didn’t use Transformers in deep learning and didn’t have access to the Transformer-family models like BERT, ALBERT, and the GPT series of models. Now, we can just increase the amount of training data and compute power to increase the performance of any model. This is exactly what OpenAI did, first with GPT-2 and then with GPT-3.

However, OpenAI’s models are accessible to only a selection of companies. OpenAI simply withholds public access to its trained GPT models. Thankfully, that’s going to change with the release of GPT-Neo, an open source alternative to GPT-3.

The engineering team at LinkedIn also chose to follow the open-source path. They released Greykite, an open source Python library to support LinkedIn’s forecasting needs. Its main forecasting algorithm, called Silverkite, is fast, accurate, and intuitive, making it suitable for interactive and automated forecasting at scale.

AI isn’t just about technology, though. Its practical applications in various industries, from agriculture to entertainment, make a huge difference for day-to-day users. For example, Reface, a face-swapping video application, now allows its users to shift not just selfies, but also a variety of self-uploaded content and bring it to life with AI. The new feature enabled by GAN algorithms expands the app’s potential by letting users supply their own source material to face swap and animate.

Reach DataScience Digest ​readers by sponsoring an issue. 

Click here for details.


Meet skweak: A Python Toolkit For Applying Weak Supervision To NLP Tasks

Skweak is a Python toolkit developed for applying weak supervision to various NLP tasks. In this article, you will learn how to use skweak for such NLP tasks as labelling and text classification. The article is illustrated with a practical implementation for reference.


Build with PyCaret, Deploy with FastAPI

PyCaret is an open-source ML library that helps ML engineers build and deploy end-to-end ML models quickly and efficiently. In this tutorial, we’ll explore how you can use it to build an end-to-end ML pipeline, develop an API using FastAPI to generate predictions on unseen data, and use Python to send a request to the API to generate predictions.


Data Scientist vs Machine Learning Engineer Skills. Here’s the Difference.

Data Science and Machine Learning have become buzzwords in the tech community. Let’s cut through the hype and, actually, figure out what Data Scientists and ML Engineers do, where their roles overlap and where they differ. Please note that this is an opinionated piece, and thoughts and ideas expressed in the article are the author’s only.


A Definitive Primer on Robotic Process Automation

Automation of repetitive manual activities has always been associated with AI and machine learning. In this detailed article, you will learn all the in’s and out’s of Robotic Process Automation (RPA), from a business value standpoint. 


Write and Train Your Own Custom Machine Learning Models Using PyCaret

PyCaret is a simple and easy-to-use ML library and end-to-end model management tool built-in Python for automating machine learning workflows. In this tutorial, you’ll learn how to use PyCaret to design and build custom machine learning models from scratch, from PyCaret installation to writing and training custom models.


Supercharge Your Machine Learning Experiments with PyCaret and Gradio

In this tutorial, you’ll learn how to integrate PyCaret and Gradio to improve your machine learning experimentation quickly and easily. Specifically, you’ll train and evaluate multiple models, and develop a lightweight UI to interact with the models in the Notebook — all done in less than 25 lines of code thanks to the simplicity of PyCaret.


Teaching AI How to Forget at Scale

The engineering team at Facebook continues to explore the relationship between AI/ML, data, and specific algorithms — and how all of that can be attributed to the human mind. The team has announced a novel method in deep learning: Expire-Span, a first-of-its-kind operation that equips neural networks with the ability to forget at scale. 


Animating Pictures with Eulerian Motion Fields

In this paper, Aleksander Holynski et al. demonstrate a fully automatic method for converting a still image into a realistic animated looping video. The images are animated using a deep warping technique: pixels are encoded as deep features, features are warped via Eulerian motion, and the warped feature maps are decoded as images.


Discovering Diverse Athletic Jumping Strategies

In this paper, the researchers present a “smart” framework to discover motion strategies for such athletic skills as the high jump. It allows us to come up with, explore, and optimize a wide range of novel motion strategies for jumpers through a sample-efficient Bayesian diversity search (BDS) algorithm.


End-to-end Alternating Optimization for Blind Super Resolution

In this paper, the team of researchers experiments with a new approach to the blind super-resolution (SR) problem. Instead of breaking it down into two sequential steps, they adopt an alternating optimization algorithm, which can estimate the blur kernel and restore the SR image in a single model (two convolutional neural modules).


Diffusion Models Beat GANs on Image Synthesis

In this research paper, Prafulla Dhariwal and Alex Nichol demonstrate that diffusion models can achieve image sample quality superior to the current state-of-the-art generative models. The results are achieved with unconditional image synthesis through a series of ablations. For conditional image synthesis, they improved sample quality with classifier guidance.


Ordering-Based Causal Discovery with Reinforcement Learning

In this paper, Xiaoqiang Wang et al. utilize reinforcement learning to discover causal relations among a set of variables. They propose a novel RL-based approach for causal discovery, by incorporating RL into the ordering-based paradigm, and formulate the ordering search problem as a multi-step Markov decision process.

Event Materials

PyTorch Ecosystem Day 2021 — Presentations and Posters

If you’ve missed the first-ever PyTorch Ecosystem Day, no worries. You’ll find all talks, demos, and tutorials, including 71 posters, 32 breakout sessions, and 6 keynote speakers on this materials page. Enjoy!


Machine Learning for Art

Machine Learning for Art (ml4a) is a collection of tools and educational resources which apply techniques from machine learning to arts and creativity. If you are looking for creative (or just fun) models for art, this project is worth checking out.

Having a good time with DataScience Digest? Well, we hope so, because we do our best to keep you updated about what’s new and important in the Data Science world. 

We can do even better, though, if you support our project on Patreon

Any donation helps!