Data Science Digest

DataScience Digest — 02.06.21


What’s new this week?
A new AI startup fund. AI combats fake news and disinformation. One more self-driving car will hit the road (probably, most likely). The power of synthetic data. And Clearview’s another batch of privacy problems in the EU.

OpenAI is launching a $100 million startup fundOpenAI Startup Fund — to invest in early-stage AI companies. Microsoft has been announced as one of the key partners and investors. The fund’s priority is companies tackling such major issues as healthcare, climate change, and education, but also those who focus on tech’s productivity improvements like GPT-3.

The researchers of MIT Lincoln Laboratory have built a program that can automatically detect and analyze social media accounts that spread disinformation across a network. The program is called RIO aka the Reconnaissance of Influence Operations program. RIO combines multiple analytics techniques to create a comprehensive view of where and how the disinformation narratives are spreading.

Trucking has been the industry up for AI’s grabs for at least a decade. Though considerable progress has been made, self-driving trucks are still more about fantasy than reality. Plus, an autonomous trucking company, plans to change that by using AI and billions of miles of data to train self-driving semis. Hope they will finally move the needle.

Speaking of problems why self-driving semis are not on the road yet in mass… Data is often a key problem. It can be data availability, quality, or security, you name it. The answer to all these problems is synthetic data, artificial data generated via computer programs instead of real-world events. At least, that’s what David Yunger, CEO of Vaital, is sure of.

Clearview’s problems in the EU are not over. Just this week privacy groups from France, Austria, Greece, Italy, and the UK accused it of stockpiling biometric data on more than 3 billion people without their knowledge or permission, by scraping their images from websites. Let’s see what’s going to come out of that.

Reach DataScience Digest ​readers by sponsoring an issue. 

Click here for details.


Lessons on ML Platforms — From Netflix, DoorDash, Spotify, and More

In this article, the author draws from the experience of AI industry leaders to answer the ubiquitous question, How can organizations enable data scientists to repeatedly deliver value, out of scope of the existing ML production systems? Here he also looks into best practices, tools, and management approaches to resolve the value delivery problem.

Build a Scalable Machine Learning Pipeline for Ultra-High Resolution Medical Images using Amazon SageMaker
In this comprehensive article by the AWS team, you’ll learn how to preprocess medical images in ultra-high resolution, train an image classifier on these preprocessed images, and deploy a pretrained model as an API — all done on the Amazon SageMaker platform — to, finally, build a highly scalable machine learning pipeline.

Easy MLOps with PyCaret + MLflow

PyCaret is an open-source, low-code library for machine learning. Built on Python, it’s simple and easy to use, and allows you to quickly and efficiently handle ML models. MLflow is an open-source platform to manage the ML lifecycle. In this article, you’ll learn how to integrate MLOps in your ML experiments using PyCaret and MLflow.

R vs Python: The Data Science Language Debate

The battle of titans — R or Python, which do you choose? Both are extremely popular languages for Data Science; both are open source and excel at data analysis. In this article, you’ll look into the debate once again, with all pros and cons, specifics, and caveats. The review is prepared by the ImaginaryCloud team.

Six Business Trends Benefiting Data Scientists

Data Scientist is one of the most expensive roles in any organization. Companies do their best to hunt for, yet professionals good at data, algorithms, and models are as hard to find as ever before. In this article, we’ll explore six business trends that keep overheating the market and stimulating the demand for DS jobs.


GAN Prior Embedded Network for Blind Face Restoration in the Wild

In this paper, Tao Yang et al. use existing generative adversarial network-based methods to solve the problem of blind face restoration from severely degraded face images in the wild. The proposed GAN prior embedded network (GPEN) generates visually photo-realistic results, which are significantly superior to BFR methods both quantitatively and qualitatively.

Image Cropping on Twitter: Fairness Metrics, their Limitations, and the Importance of Representation, Design, and Agency

In this paper, researchers look into fairness and bias issues in Twitter’s automated image cropping system. They found systematic disparities in cropping, identified contributing factors, and to resolve the problem proposed the removal of saliency-based cropping in favor of a solution that better preserves user agency.

High-Resolution Photorealistic Image Translation in Real-Time: A Laplacian Pyramid Translation Network
In this paper, Jie Liang et al. propose a new method of speeding-up the high-resolution photorealistic I2IT tasks called a Laplacian Pyramid Translation Network (LPTN). It enables translating the low-frequency components with reduced resolution and refining the high-frequency ones, to translate 4K images in real-time using one normal GPU.

LAPAR: Linearly-Assembled Pixel-Adaptive Regression Network for Single Image Super-Resolution and Beyond
In this paper, the team of researchers propose a linearly-assembled pixel-adaptive regression network (LAPAR), designed and built to deal with a fundamental problem of upsampling a low-resolution (LR) image to its high-resolution (HR) version. LAPAR is highly lightweight and easy to optimize, and helps achieve superb results on SISR benchmarks.

Latent Gaussian Model Boosting
Latent Gaussian models and boosting are widely used in statistics and machine learning thanks to their predictive accuracy. This article introduces a novel approach that combines boosting and latent Gaussian models. The author demonstrates that the method helps increase predictive accuracy in simulated and real-world data experiments.


Awesome List of Datasets in 100+ Categories

Data is the lifeblood of any AI/DS project. In this article, Etienne D. Noumen and his team have collected over 100 extensive datasets encompassing a variety of topics and industries, from cancer genomes to UFO reports. The article features a link to another collection of 100+ datasets in the end — make sure you scroll!


3D Computer Vision - National University of Singapore - 2021
This is an introductory course on 3D Computer Vision, which was recorded for online learning at NUS due to COVID-19. During the course, you’ll learn the basics of Computer Vision, from 2D and 1D projective geometry to Auto-Calibration.


Albumentations 1.0.0 has been released! 
Albumentations is a computer vision tool and a Python library designed to improve the performance of deep convolutional neural networks by enabling fast, flexible, cost- and resource-efficient image augmentations. The tool can be used for different CV tasks, including object classification, segmentation, and detection.
New version contains 10 new transforms, independence from imgaug, bug fixes, etc.