Hackathon Projects (“Hacks”):
The projects that have been created for DSSE hackathons (or for anyone who wishes to use them for learning/teaching purposes!) are based on real-world problems and make use of real data. They cover a diverse range of topics relating to development, scientific research, or commercial/industry applications. They also vary in their level of difficulty and the computing power necessary to run the code, i.e. some may be suitable for high school learners while others require more advanced knowledge of programming and machine learning, and some may run with adequate compute power from the average laptop while others may require access to more compute power, e.g. the use of cloud computing. These details can be found in each of the project descriptions on the respective GitHub page. All project tutorials are written in Python.
SKA Data Challenge
The SKA planned for 4 data challenges to prepare the scientific community for the challenges that the SKA will present. We took the 1st data challenge and turned it into a tutorial format suitable for early career researchers. From the tutorials, you will learn the following:
- Source finding (RA, Dec)
- Source property characteristics
- Source classification (Star-forming galaxies, Active galactic nuclei)
Image Classification of Galaxies
This project makes use of galaxy images from the GalaxyMNIST dataset. The dataset contains 10,000 images which have been classified through the Galaxy Zoo citizen science project. This dataset will be used in order to build an image classifier model! We will also take a look at how we could perform this classification using unsupervised learning methods in cases where we don’t have access to labelled data.
This project makes use of infrared spectra data for three apple cultivars in order to classify between bruised or sound apples. The following is covered in the tutorials:
- Tutorial 1 – Data cleaning & visualisation
- Tutorial 2 – Baseline calculation
- Tutorial 3 – Feature engineering and selection
- Tutorial 4 – Classification of apples using machine learning
Sentiment Analysis of Tweets
This project involves Natural Language Processing (NLP) and sentiment analysis of Twitter data. NLP involves giving computers the ability to understand text and spoken words similar to how humans can. The tutorials will demonstrate how to collect and clean Twitter data, perform sentiment analysis using existing toolkits and finally, perform sentiment analysis using machine learning. The dataset we will use consists of 2000 tweets collected from a dataset called Sentiment140, which contains 1.6 million general tweets and their corresponding sentiment labels.
Rooibos Tea Classification
In this project you will make use of chemical signatures to classify rooibos tea. The tutorials will cover the following:
- Data visualisation
- Data correlation
- Classification of tea using statistics
- Classification using machine learning
Flood Detection using Remote Sensing
This project makes use of Earth Observation data from the Sentinel satellite. The multi-spectral image data is used in order to determine land coverage, be it vegetation, water etc. This allows one to determine regions of flooding after a natural disaster.
The tutorials cover the following:
- Introduction to optical satellite imagery
- Data preparation and clustering methods
- Image segmentation
Search for Extraterrestrial Intelligence
The goal of the Voyager Tutorial is to take you through the “SETI Pipeline”, that is the method used by the Breakthrough Listen team to search for alien techno-signatures! You will take real data gathered by Breakthrough Listen at the Green Bank Telescope in West Virginia, run it through a few algorithms, and see if you can find an alien or two!
Web Scraping & Image Classification
In this challenge you will learn how to web-scrape images from Google and use them to train/test a machine learning model. The aim is to come up with an image classification problem (cats vs dogs, people vs trees, Trump vs an orange Cheeto etc), web-scrape the images and then use ML for the classification.
In this challenge you are tasked with building a classifier to separate out real astronomical signals from man-made radio frequency interference (RFI). The astronomical signals that you’re looking for come from pulsars, the ultra-dense relics of exploded stars. The dataset is available in two formats: (i) as a set of eight numerical features per sample suitable for classification using random forests, SVMs, neural nets etc. and (ii) as a set of images that show the data that the numerical features are drawn from, which are suitable for CNN based classification. Your hack challenge is to build the best classifier that you possibly can, remembering that we want as many correctly classified pulsars as possible, with as little contamination from RFI as possible.