Data Science Digest

DataScience Digest — 13.05.21


What was new last week?

AI is ready to take on rare diseases that cost us $1 trillion yearly. Though rare diseases affect fewer than 200,000 people, their diagnosis and treatment are super expensive. Because ~80% of rare diseases are genetic (and, by now, we have lots of tools to handle genetic data), advances in AI and analytics technology can make it easier to diagnose them earlier, thus making preventive treatment possible.

But let’s not become too optimistic about what AI can and cannot. It holds huge potential, but if history teaches us something, we should know that after every AI summer comes AI winter. AI is harder than we think; and, though we have had some successes with narrow AI, general AI is still more fiction than reality. 

Some of these overly optimistic misconceptions about AI are described by Melanie Mitchell, Davis Professor of Complexity at the Santa Fe Institute, in her new paper. Those include:

  • Narrow intelligence is on a continuum with general intelligence
  • Easy things are easy and hard things are hard
  • Current AI systems work like the human mind
  • Intelligence is all in the brain
  • AI can learn common sense through data

Read more on the topic in “Why AI is Harder Than We Think”.

Despite all of that, the work on advancing general AI continues. For example, last week marked the start of the International Conference on Learning Representations (ICLR) 2021, an event dedicated to research in deep learning. The conference has already accepted 860 research papers from thousands of participants, up from 687 papers in 2020.

One of the participants is Jilei Hou, VP of engineering at Qualcomm, who heads up the AI Research division that advances such capabilities of AI as perception, reasoning, and action. Hou presented new papers at ICLR in the areas of power and energy efficiency, computer vision, natural language processing, and machine learning fundamentals.

And speaking of AI successes… For the first time, artificial intelligence managed to outscore the human solvers at the American Crossword Puzzle Tournament. It was a triumph for the developers of Dr. Fill, a crossword-solving AI system that has been competing against humans for nearly a decade.

While developing AI, however, we should not forget about ethics and the mere fact that, if used incorrectly, AI systems can be dangerous. This point was proved by Latitude, a startup from Utah, that launched an online game called AI Dungeon, to demonstrate a new form of human-machine collaboration. What began as an exciting experience ended up with AI generating scenes of violence, nudity, and sexual encounters involving children. It is hard to predict how AI is going to behave in the wild, and we should really put in the effort to exert more control over AI language systems.

It seems that AI should be more about generating practical value, or, at least, provide us, humans, with ways to become better. For instance, Liverpool, a professional football club from England (unless you did not know this already, dah), has collaborated with DeepMind to apply AI to football tactics. DeepMind is hoping to combine computer vision, statistical learning, and game theory to help teams spot patterns in data they’re collecting. Applying artificial intelligence to football could make players and coaches smarter.

Reach DataScience Digest ​readers by sponsoring an issue. 

Click here for details.


Improving Model Performance Through Human Participation

In this article, Preetam Josh (Netflix) and Mudit Jain (Google) explore a complex topic of AI-to-human cooperation. Specifically, they explain how human input in the model inference loop (human-in-the-loop) can increase the final precision and recall, and how to incorporate human feedback at inference time to ensure higher precision and recall.


AutoNLP: Automatic Text Classification with SOTA Models

Developing NLP models can be challenging as you need to account for multiple factors, including model selection, data preprocessing, training, optimization, and infrastructure. AutoNLP, a tool to automate the end-to-end life cycle of an NLP model, can make this process much easier. Learn how to use AutoNLP in this step-by-step guide.


How to Plot XGBoost Trees in R

XGBoost is a popular ML algorithm, which is frequently used in Kaggle competitions and has many practical use cases. If you always wanted to learn more about XGBoost, this short tutorial is for you. You will learn how to prepare the dataset for modeling, train the XGBoot model, plot the XGBoot trees, then export tree plots, and plot multiple trees at once.


Feature Engineering of DateTime Variables for Data Science, Machine Learning

DateTime fields require Feature Engineering to transform them from raw data to insightful information that can be used by and in ML models. In this article, you will learn how to extract date and time components, create Boolean flags, and calculate date and time differences using a combination of inbuilt pandas and NumPy functions.


Multiple Time Series Forecasting with PyCaret

PyCaret is a popular machine learning library and a model management tool for automating machine learning workflows. It allows us to build and deploy end-to-end ML prototypes quickly and efficiently. In this step-by-step tutorial, you will learn how to use PyCaret to forecast multiple time series in less than 50 lines of code.


What’s Lost in JPEG?

In this article, the author looks into the JPEG format. She explains how you can replicate the lossy part of the JPEG compression process in Python using common libraries (e.g. Numpy and Scipy), from chroma subsampling to discrete cosine transform and quantization.


Using Machine Learning to Predict Customers Next Purchase Day

Predicting the customer's next action is one of the most popular use cases for machine learning. In this article, the author explains how to design and build an accurate next-purchase prediction model from scratch.


What Is Face Recognition?

In this 101 tutorial, Adrian Rosebrock of the PyImageSearch team explains everything you need to know about face recognition, from what it is and how it works to how it is different from face detection and advanced face recognition algorithms you can start using today.


Motion Representations for Articulated Animation

In this research, Aliaksandr Siarohin et al. present novel motion representations for animating articulated objects consisting of distinct parts. Learn about the new method they propose, how it differs from keypoint-based works, and how it can be used to animate a variety of objects, surpassing previous methods on existing benchmarks.


Advancing the State of the Art in Computer Vision with Self-Supervised Transformers and 10x more Efficient Training

In this exploratory article, Facebook AI presents its new method, called DINO, to train Vision Transformers (ViT) with no supervision. The model can discover and segment objects in an image or a video with no supervision and without being given a segmentation-targeted objective. 


EigenGAN: Layer-Wise Eigen-Learning for GANs

In this paper, Zhenliang He et al. present EigenGAN, a new GAN model that can mine interpretable and controllable dimensions from different generator layers without supervision. The team theoretically proves that their algorithm derives the principal components as efficiently as PCA does. Review the code here.


Machine Learning in Python with scikit-learn

This course will help you master machine learning with scikit-learn, even if you are a beginner and do not have a strong technical background. You will start with the basic concepts of ML and predictive modeling pipelines all the way to model performance evaluation and feature selection. The course is a work in progress — Stay tuned!

Podcasts & Interviews

Reinforcement Learning for Industrial AI with Pieter Abbeel

Join Pieter Abbeel, a professor at UC Berkeley, for a talk about industrial AI applications, robots, and reinforcement learning. Also, learn about his vision of end-to-end deep learning and about his recent paper on pretrained transformers.


R Charts

R Charts is a collection of code examples of R graphs made with base R graphics, ggplot2, and other packages. So far, the project features eight categories, and it is open for new contributions to the collection, as well as to bug fixes and code suggestions on GitHub.


Chris Albon Notes on Data Science & Machine Learning

Chris Albon, Director of Machine Learning at Wikimedia Foundation, has spent years collecting useful resources on statistical learning, artificial intelligence, and software engineering. Here you can find flashcards and notes on almost any topic on AI/ML.