Data Science Digest

Data Science Digest — 05.05.21


Articles

Deep Learning for Audio with the Speech Commands Dataset

If you want to learn how to train a simple model on the Speech Commands audio dataset, this article by Peter Gao is for you. He explains how to choose a dataset and handle data, how to train, test, and tune the model, and, most importantly, how to do error analysis (and analyze failure cases) to improve model performance over time.

 

Boosting Natural Language Processing with Wikipedia

In this hands-on tutorial, Nicola Melluso explains how you can take advantage of Wikipedia to improve your Natural Language Processing models. To illustrate how it works, he takes such NLP tasks as Named Entity Recognition and Topic Modeling, and then goes deep step by step, to explain how to collect and process data, build and train the models, etc.

 

Face Detection Tips, Suggestions, and Best Practices

In this tutorial, Adrian Rosenbrock and the PyImageSearch team continue to explore the topic of face detection. You will learn their tips, suggestions, and best practices to achieve high face detection accuracy with OpenCV and dlib. Though the tutorial is mostly theoretical, it features code and tons of useful links inside.

 

Building A Simple ETL With SAYN

In this article, Robin Watteaux continues to explore SAYN, an open source data processing framework for simplicity and flexibility, and explains how it works and how it can be used in ETL/ELT processes. The article is illustrated with an example of the SAYN project mimicking an ETL process. Check out Robin’s first article on SAYN here for more context.


Papers

Token Labeling: Training a 85.4% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet

In this paper, Zihang Jiang, Qibin Hou et al. explore vision transformers applied to ImageNet classification. They have developed new training techniques to demonstrate that by slightly tuning the structure of vision transformers and introducing token labeling, the models can achieve better results than the CNN counterparts and other transformer-based classification models. The code is available for review on GitHub.

 

DistGNN: Scalable Distributed Training for Large-Scale Graph Neural Networks

In this paper, Md Vasimuddin et al. present DistGNN that helps optimize Deep Graph Library (DGL) for full-batch training on CPU clusters. The results on GNN benchmark datasets show up to 3.7x speed-up using a single CPU socket and up to 97x speed-up using 128 CPU sockets over baseline DGL implementations running on a single CPU socket. The team hopes the research will make it easier to run full-batch training on GNNs.

 

skweak: Weak Supervision Made Easy for NLP

In this paper, Pierre Lison et al. present skweak, a versatile, Python-based software toolkit to help NLP developers apply weak supervision to a wide range of NLP tasks. The toolkit makes it easy to implement a large spectrum of labelling functions (such as heuristics, gazetteers, neural models or linguistic constraints) on text data, apply them on a corpus, and aggregate their results in a fully unsupervised fashion.

 

Fully Convolutional Line Parsing

In this paper, Xili Dai et al. present a one-stage Fully Convolutional Line Parsing network (F-Clip) that detects line segments from images. The proposed network is simple and flexible with variations that trade off between speed and accuracy for different applications. F-Clip is reported to significantly outperform all state-of-the-art line detectors on accuracy at a similar or even higher frame rate.


Projects

NLP Profiler

NLP Profiler is a simple but useful NLP library created by @neomatrix369. It enables Data Science practitioners to easily profile datasets with one, two, or more text columns. The library is designed to return either high-level insights or low-level/granular statistical information about the text when given a dataset and a column name containing text data, in that column. Check out the library and let us know what you think.