TAMU Datathon - Workshop Descriptions

Workshop Descriptions

High Dimensional Data Clustering - STAT Department

2:30 PM

Clustering is a major goal in data science in which we seek to partition a set of objects into a much smaller number of groups, or clusters. Clustering has an incredible array of applications in solving real world problems such as detecting anomalies (like tumors or asteroids), discovering cancer types using genetic information, grouping similar documents, and recommending products to customers. We will briefly discuss some popular clustering methods and give examples of their use in practice as well as their limitations. Additionally, we will discuss some recent advances in clustering methodology that we are currently developing. We will also discuss some of the exciting educational and research opportunities that are currently going on in the Statistics department.

The Future of PropTech - CBRE

3:00 PM

Big Data: After the Honeymoon Phase - USAA

3:30 PM

Over the past decade, focus on Big Data has exploded to the point that it is now ubiquitous across nearly every business sector. Yet, even though the appetite to harness the power of Big Data has never been stronger, expectations over the quickness and ease of its implementation are often based on an oversimplified and idealized view of the technology. This can lead to scope creep, friction, and commitment fatigue whenever organizations are unprepared for the unique challenges Big Data presents. In this talk we’ll discuss how Big Data often appears in the real-world: dirty, noisy, and uncooperative. We’ll share some of the techniques and technologies to help wrangle Big Data and employ Big Data driven solutions. We’ll talk about some of our own examples with Big Data and Machine Learning in the financial fraud and cybercrime domains.

Overfitting and the Double Descent Curve in Neural Networks and High Dimensional Regression - MATH Department

4:00 PM

Overfitting is an important problem in data science and machine learning. It occurs when a model perfectly fits training data but fails to generalize to new examples. Intuitively, this can happen when the model is overparameterized (e.g. has many free parameters than data points) and can simply "memorize" all the training data. Neural Networks, however, have the curious property that they are highly overparameterized and yet not only fit training data perfectly but also generalize quite well to unseen and even corrupted data. That this is possible has sparked a revolution in our understanding of overfitting in complex models, giving rise to the so-called Double Descent Curve. This workshop will describe what this is and explain what we know about it so far both for neural networks and in high dimensional regression problems.

Introduction to Convolutional Neural Networks - ConocoPhillips

5:30 PM

In this tech talk, you will learn about convolutional neural networks (CNNs) – a type of neural network used in AI for image classification, object detection and object segmentation machine learning problems. Also, you will have an opportunity to participate in an interactive activity where you or a small group act as a CNN to classify images into categories. No former experience with CNNs is required. Come and learn and gain access to a Python GitHub repo to play with CNNs, transfer learning, and synthetic data generation from original training images.

Saving Lives, Time & Resources with World-Class Transportation Data Research - TTI

6:00 PM

This workshop will provide an overview of data science and analytics used in transportation research at Texas A&M Transportation Institute to make our streets and highways safer and more efficient for moving people and goods. Plus, we will show some cool car crash testing videos, and talk about how big data is used in many aspects of TTI research

Introduction to Natural Language Processing - Walmart

6:30 PM

In this workshop, we introduce the top three widely used open-source python libraries for analyzing the syntax and semantics of texts. These libraries can practically perform state of the art techniques for understanding the grammar and meaning of a text. These python libraries are Spacy, Gensim, and TextBlob. Spacy performs text preprocessing tasks that give insights into a text’s grammatical structures. Spacy extracts the word types, like the parts of speech, the word shape, i.e. capitalization, punctuation or digit, performs lemmatization, draws the dependency tree, i.e. the relation between words, extracts named entities and etc. We introduce the Genism library as the second python library to compute text similarity based on word embedding models. In the end, we perform a sentiment analysis by introducing the TextBlob library.

Scaling Capacity to Support the Democratization of Machine Learning - Facebook

10:00 PM

Machine learning is an increasingly crucial tool at Facebook, used in an ever-expanding list of products and services ranging from newsfeed ranking to blocking harmful content. Correspondingly, Facebook like other major companies has undergone substantial investments in understanding and streamlining the development cycle for ML models, putting the power of machine learning in the hands of more software engineers and data scientists. With this explosive growth in both the scale of ML models and the number of potential use cases comes added strain on the company’s technical infrastructure, invariably raising unique and interesting challenges at the juncture of scheduling and resource management. In this talk, we explore some of the ways in which data science has been used to address issues around the efficiency and reliability of running ML workloads at scale, and some lessons we’ve learned in the process.

Engineering Panel Discussion - Goldman Sachs

10:30 PM

Join Goldman Sachs Engineers for a panel discussion where our Engineers will discuss what they do, how they got there, the technologies they use and more!

Back