Where the Earth Meets the Sky

Europe/Copenhagen
Zoom Webinar (online)

Zoom Webinar

online

Iary DAVIDZON (NBI/Cosmic Dawn Center), Adriano AGNELLO (Niels Bohr Institute)
Description

<earth and sky banner (picture by Gabriel Brammer)>

This inter-disciplinary workshop aims at bringing together researchers from Astrophysics and Medical/Public Health Sciences, to discuss the statistical and machine learning methods that are transforming both fields and create an occasion of "cross-pollination".

At a first glance, the two research fields may seem disconnected from each other, as one focuses on problems affecting our planet while the other explores the depth of the sky.  However the framework to address those issues is actually very similar: for example, in the next few years answers shall come from large data sets (digital astronomical surveys, digital medical records) analyzed through innovative “data-driven” tools.

This workshop will be held at the DTU branch of the Cosmic Dawn Center in Copenhagen, 27 and 28 of May 2021. It will offer an ideal space for scientific interaction (talks, discussion panels) and practical coding sessions. The programme will take into account different levels of skill and knowledge: either if you have several years of experience, or have started only recently to play with these techniques, you are most welcome to participate!

More details can be found in the Scientific Rationale and the scheduled Programme. We expect to have 50 seats available at the venue in addition to on-line participation. These figures are not definitive, since regulations change periodically according to the evolution of Covid-19 pandemic. In any case, online participation is guaranteed. Please check regularly the FAQ page, which also contains the latest updates.


Invited speakers:

 

SOC and LOC:

  • Adriano Agnello (DARK, University of Copenhagen)
  • Nikki Arendse (DARK, University of Copenhagen)
  • Nina Bonaventura (DAWN, University of Copenhagen)
  • Iary Davidzon (DAWN, University of Copenhagen)
  • Theis Lange (IFSV, University of Copenhagen)
  • Thomas Mellan (Imperial College London)
  • John R. Weaver (DAWN, University of Copenhagen)

 

We acknowledge support from Novo Nordisk Foundation through the grant NNF19OC0059326 and from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 896225. The image above is a detail of a photo by Gabe Brammer (all rights reserved)

Participants
  • Aashish GUPTA
  • Aayush GAUTAM
  • Adam MCCARRON
  • Adelie GORCE
  • Adriano AGNELLO
  • Alain SCHMITT
  • Alba Vega ALONSO TETILLA
  • Aleksandra CIPRIJANOVIC
  • Alexander RATHCKE
  • Alexander SZALAY
  • Alexandros ANTONIADIS KARNAVAS
  • Alireza VAFAEI SADR
  • Amar Deo CHANDRA
  • Amir Hossein FEIZ
  • Amir JABBARPOUR
  • Andrea JOENSEN
  • Andreas RIECKMANN
  • Andrew ZWANIGA
  • Angela PINOT DE MOIRA
  • Aritra GHOSH
  • Athanasios ANASTASIOU
  • Benjamin GREEN
  • Bo MILVANG-JENSEN
  • Brice MÉNARD
  • Carter RHEA
  • Chinedu NNAJI
  • Christian Kragh JESPERSEN
  • Coln JACOBS
  • Constantina NICOLAOU
  • Daniela SAADEH
  • David BLÁNQUEZ
  • Davide PIRAS
  • Difu SHI
  • Dimitrios IRODOTOU
  • Dmitry MEDVEDEV
  • Doogesh KODI RAMANAH
  • Elahe KHALOUEI
  • Emiliano MERLIN
  • Emille ISHIDA
  • Emmanuel SEKYI
  • Filip MORAWSKI
  • Filippo PAGANI
  • Fionagh THOMSON
  • Francesc TORRADEFLOT
  • Francisco ARDEVOL MARTINEZ
  • George NAPOLITANO
  • Georgios MAGDIS
  • Ghassem GOZALIASL
  • Gustav NERVIL
  • Henri BOFFIN
  • Heresh AMINI
  • Hubert BRETONNIÈRE
  • Ian HOTHI
  • Ian MCCHEYNE
  • Iary DAVIDZON
  • Isabella CORTZEN
  • Iwona HAWRYLUK
  • James NIGHTINGALE
  • Jayatee KANWAR
  • Jean-Luc STARCK
  • Jean-Luc STARCK
  • Jeffrey ROSKES
  • Jeroen AUDENAERT
  • Jessica May HISLOP
  • Jielai ZHANG
  • John RUAN
  • John WEAVER
  • Josephine BILSTEEN
  • Joshua DOYLE
  • Julius OHRNBERGER
  • Ka Ho YUEN
  • Kamile LUKOSIUTE
  • Karan MOLAVERDIKHANI
  • Kartheik IYER
  • Katarzyna Magdalena DUTKOWSKA
  • Laurence FITZPATRICK
  • Lavanya NEMANI
  • Laya GHODSI
  • Leonid CHINDELEVITCH
  • Line CLEMMENSEN
  • Mara KONT
  • Marc HUERTAS-COMPANY
  • Marco CASTELLANO
  • Margaret EMINIZER
  • Margit RIIS
  • Maria del Carmen CAMPOS VARILLAS
  • Mariela MARTINEZ
  • Marko SHUNTOV
  • Martin ERIKSEN
  • Max HIPPERSON
  • Maya LEVIN SCHTULBERG
  • Miloš KOVAČEVIĆ
  • Mohammad Javad SHAHHOSEINI
  • Mohammadhossein NAMDAR
  • Mohsen SHAMOHAMMADI
  • Monica ALDERIGHI
  • Motahare TORKI
  • Natasja EHLERS
  • Nicole LØNFELDT
  • Niels KVORNING TERNOV
  • Nikki ARENDSE
  • Nina BONAVENTURA
  • Noah KASMANOFF
  • Pablo LEMOS
  • Paola SANTINI
  • Patrick KOCH
  • Philippe CIUCIU
  • Ping LIN
  • Regina SARMIENTO
  • Renata RETKUTE
  • Richard MASSEY
  • Ripon SAHA
  • Rishabh SINGH
  • Sam WRIGHT
  • Samantha BROWN SEVILLA
  • Sami DIB
  • Samir BHATT
  • Samuel FARRENS
  • Samuel GAGNON-HARTMAN
  • Sandeep Singh SENGAR
  • Sara ALMEIDA
  • Sarah CADDY
  • Sarah CASURA
  • Sepideh GHAZIASGAR
  • Sergio D'ANGELO
  • Shashank MARKANDE
  • Siddharth CHAINI
  • Sihao CHENG
  • Sim VIRDI
  • sneha DAS
  • Sofie BRUUN
  • Subhrata DEY
  • Søren BRUNAK
  • Tao WANG
  • Terese Sara JØRGENSEN
  • Teymoor SAIFOLLAHI
  • Theis LANGE
  • Themiya NANAYAKKARA
  • Thomas MELLAN
  • Till KAEUFER
  • V. Ashley VILLAR
  • Vijay VASWANI
  • Viraj NISTANE
  • Yu-Ting WU
  • Yu-Yen CHANG
  • Zorana ANDERSEN
    • 09:00
      Webinar is now open for connections Zoom Webinar

      Zoom Webinar

      If you haven't obtained your personal link yet, register at https://indico.nbi.ku.dk/event/1330/registrations/ and follow the instruction you will receive via email.
    • 1
      Welcome and introduction Zoom Webinar

      Zoom Webinar

      online

      A few words from the organizers to introduce the first day of the workshop.

    • Morning 1: first half

      Morning session of the first day.

      • 2
        When the brain meets the sky: Compressed Sensing for Magnetic Resonance Imaging & Radio-Astronomy

        The discrete nature of radio interferometry measurements can be interpreted through the ``Compressed Sensing'’ (CS) acquisition theorem, which supports the idea of using a specific mechanism, called sparsity, to reconstruct images from measured data called visibilities. In Magnetic Resonance Imaging (MRI), similar observations have been made leading to a new range of MRI imaging acquisition and reconstruction techniques. We will first briefly introduce CS and sparsity idea, then we will show how the resolution of a radio-astronomical image can improved by a factor four compared to the state-of-art. Then we will present new MRI acquisition schemes developed for the NeuroSpin MRI instrument which will achieve 11.7 tesla this year. Using such CS acquisitions,
 associated to our astrophysical reconstruction methods, allows a significant acceleration of the MRI acquisition time. We present results on real MRI measurements.

        Speakers: Jean-Luc STARK (CEA Saclay, CosmoStat laboratory), Philippe CIUCIU (CEA Saclay, Neurospin)
      • 3
        Time Series Classification: From Astronomical Transients to Human Heart Sounds

        Background and objectives: Cardiovascular disease (CVD) is still the leading cause of mortality. Early diagnosis of heart disease is crucial in preventing mortality. On the other hand, according to the World Health Organization, approximately 80% of deaths due to cardiovascular disease occur in low and middle-income countries. There is a lack of medical specialists in remote areas to diagnose these diseases. Artificial intelligence-based decision support systems and automated diagnostic tools can help with heart disease diagnosis.
        An automated classification method using LightGBM and deep learning is investigated in this study to classify four types of heart sounds including artifact, extra heart sound, murmur, and normal. The classification algorithm for this purpose is inherited from the previously used algorithm for the classification of LSST astronomical time series known as the PLAsTiCC dataset.
        Methods: The classification model is developed using a total of 542 PhonoCardioGram(PCG) samples collected by electronic stethoscopes or mobile devices. We use both deep and feature-based approaches for classification. After pre-processing and denoising of PCG signals, features have been extracted from the raw data. Extracted features fed to LightGBM. We also used a deep learning approach, feeding raw signals to a convolutional neural network (CNN).To improve accuracy in a noisy environment and robust model, the proposed method uses data augmentation techniques for training data sets.
        Results: The accuracy of the LGBM on the test set is 0.82 for the classification of heart sounds. It also had achieved good results in our previous review related to astronomical time series classification. Precision, recall, and F1 score of multi-class classification are 0.84, 0.84, and 0.82 respectively. If we consider all abnormal cases as a unique class and normal cases as another, LGBM shows 0.87 accuracies for binary classification of normal and abnormal classes. Using the CNN approach obtains 79% accuracy in PCG multi-class classification.

        Conclusions: Machine-learning techniques provide high precision results in cardiac diagnosis and screening. The proposed model can be transferred to any computing device. This non-invasive method is efficient and suitable for real-time applications and can provide a diagnostic tool in primary health care centers.

        Speaker: Motahare TORKI (Shahid Beheshti University)
      • 4
        Deep learning framework for clinical diagnosis - a healthcare system

        One of the main targets of computer vision is to interpret the content of image and video. To interpret image content, one of the essential goals is to build a model depending on a known set of features extracted from image data. The built model is then employed to produce inference of the unknown data-set. Medical image segmentation is the part of computer vision and its target is to labeling each pixel of an object of interest. It is often a key task for clinical applications, varying from computer aided diagnosis for lesions detection to therapy planning and guidance. Medical image segmentation helps clinicians focus on a particular area of the disease and extract detailed information for a more accurate diagnosis. An end to end deep learning approach, Convolutional Neural Networks have shown state-of-the-art performance for automated medical image segmentation. However, it doesn’t perform well in case of complex environments. U-Net is another popular deep learning architecture especially for biomedical imaging. It consists of contraction and expansion path to pixel-wise predict the dataset. This model is better than previously available medical image segmentation approaches. However again, it fails to produce the promising results with voxels. For that, incremental version of U-Net, Multiplanar U-Net has been developed in 2019. In this talk, we will discuss about a simple and thoroughly evaluated deep learning frameworks for segmentation of arbitrary medical image volumes. We will also discuss about specific framework, which doesn’t require human interaction, task-specific information and it is based on a fixed model topology and hyper parameter set.

        Speaker: Sandeep Singh SENGAR (University of Copenhagen, Denmark)
    • 10:50
      Break

      unfortunately we cannot serve virtual coffee :P

    • Morning 1: second half

      Morning session of the first day.

      • 5
        Supervised classification of variable stars from the NASA TESS survey with features originating from the biomedical domain

        The currently ongoing NASA TESS space mission is expected to observe tens of millions of stars. The resulting stellar surface brightness measurement time series (“light curves”) allow astronomers to search for specific types of stars or planets, as well as to then infer their fundamental physical parameters. Given that we can observe different types of light curves for different types of stars, the first step is to classify the light curves according to their underlying variability type. As we are working with vast amounts of data, it is infeasible to manually classify all observations and we therefore require automated techniques.

        Hence, we developed a classification method based on a Random Forest classifier that can successfully classify stars according to their variability type. In order to find the ideal feature sets to characterize the different types of stellar variability, we turned to the biomedical literature on EEG signal processing as these signals share some common characteristics with stellar variability signals. We specifically turned to the field of entropy analysis, from which we then adopted the multiscale entropy from Costa et al. (2005) to characterize the complexity and uncertainty present in stellar variability signals. We used this to complement our more traditional Fourier and statistical feature sets, and discovered that the entropy metrics proved to be important features in our classifier due to their ability to differentiate light curves based on their unpredictability and complexity levels.

        We then incorporated our classifier into the larger TESS Data for Asteroseismology (T’DA) classification pipeline to obtain the best results. In the pipeline we first train multiple distinct classifiers with different feature sets on the same data and then pass their results (the class probabilities) on to a meta-classifier that combines the predictions from this ensemble of models and returns a final classification. The benefit of this approach is that the metaclassifier accounts for the strengths and weaknesses of each of the classifiers and in this way returns an optimal classification. We validated our method on data from the previous NASA Kepler mission, given that we already had labelled datasets available here.

        Speaker: Jeroen AUDENAERT (KU Leuven, Institute of Astronomy)
      • 6
        Towards an interpretable and transferable acoustic emotion recognition for the detection of mental disorders

        Motivation
        Automatic speech emotion recognition (ASER) refers to a group of algorithms that deduce the emotional state of an individual from their speech utterances. The methods are deployed in a wide range of tasks, including the detection and intervention of mental disorders. State-of-the art ASER techniques have evolved from the more conventional ML based methods to the current advanced deep neural network based solutions. Despite the long history of research contributions in this domain, state-of-art methods still struggle to generalize across languages, between corpora with different recording conditions, etc. Furthermore, most of the methods lack in interpretation and transparency of the models and their decision making process. These aspects are especially crucial when the methods are deployed in applications with impact on human lives.

        Contribution
        Autoencoders and latent representation studies are useful tools in the exploration of interpretable and generalizable models. We present results on the benefits of using autoencoders and its variants for ASER, predominantly on emotional states like anger, sadness, happiness and the neutral state. We show that the clusters in the latent space are representative of the desired emotional clusters, although some classes of emotions are more discriminative than others. We take a step further to illustrate the use of SHAP and DeepLIFT to gain insights into the feature subsets that contribute to the discriminative clustering of emotion classes in the latent space. Furthermore, we study the robustness of the methods by investigating the differences that occur in the latent representations when the underlying data conditions are modified. In other words, how the differences in the language of the corpus, recording conditions of the corpus~(acted, `in the wild') manifest in the latent space. In addition, we explore the discrete and continuous scales for their appropriateness in modelling speech emotions and their correspondence to each other. Lastly, we use the feature subset that provides the most stable representations of emotional states over different corpora, languages and recording conditions to transfer the knowledge to languages with few or no labelled emotion corpus.

        Speaker: Sneha DAS (Technical University of Denmark)
      • 7
        Hand-On Image Processing with PySAP

        PySAP (Python Sparse data Analysis Package) is an open-source, multidisciplinary image processing tool developed in collaboration between astrophysicists and biomedical imaging experts at CEA Paris-Saclay through the COSMIC project (see Farrens et. al (2020)). The objective of this collaboration being to share the image processing experience gained in different domains with the community.

        In this hands-on session I will demonstrate some of the "out of the box" image processing tools available in the PySAP-MRI and PySAP-Astro plugins. In particular, I will focus on some standard problems like denoising or deconvolving galaxy images, and reconstructing subsampled MRI scans of a brain. I will also show how to use some of the modular features of PySAP to design image processing tools for new problems of even different imaging domains.

        Throughout the session I will endeavour to provide some basic technical background and present the problems such that it should be possible for everyone in the audience to follow. I will also include links to tutorials for more in-depth explanations of some of these topics for those who are interested.

        Finally, I will reserve a few minutes at the end to explain how you can contribute to the development of PySAP.

        Speaker: Samuel FARRENS (CosmoStat, CEA Paris-Saclay)
    • 12:30
      Lunch break
    • Poster session 1: four posters

      Posters are presented virtually, by playing a pre-recorded presentation. After each video it is possible to ask questions to the authors.

      • 8
        The AstroPath Image Acquisition and Segmentation Workflow

        Multidimensional, spatially resolved analyses of cells from pathology slides are of great diagnostic and prognostic interest. New multispectral, multiplex immunofluorescence microscopy platforms have the potential to facilitate such analyses, and here, we further improve and standardize the image acquisition and cell classification workflow. Studies to date on this emerging technology have typically assessed ~10 operator-dependent high power fields (HPFs) per slide, which represents a fraction of the tissue available for study. Standard cell segmentation and classification algorithms often oversegment larger cells, when they are segmented at the same time as smaller cells. Here we describe our AstroPath imaging platform, which addresses each of these considerations. In our study, slides from formalin-fixed paraffin embedded tissue specimens were stained with an optimized 6-plex multiplex immunofluorescence (mIF) assay. The slides were then scanned at 35 unique wavelengths using a multispectral microscope (Vectra 3.0 or Vectra Polaris) with 20% overlap of HPFs in an operator-independent fashion. An average of 1300 HPFs per slide was required to image the entire tissue, and each microscope scanned between 2 to 3 slides per day with this approach. After the images were captured and organized, overlaps were used to measure, quantify and correct systematics in the imagery (see Eminizer abstract). The central parts of the images were used to create a set of seamless “primary” tiles, similar to the strategy of the Sloan Digital Sky Survey, for a statistically fair pixel coverage of the whole tissue area (see Roskes abstract). Images were then linearly unmixed from the 35 wavelengths to 8 component layers (DAPI, tissue auto-fluorescence, and the 6 added fluorescent dyes) using inForm Cell Analysis©. We then employed a bespoke method for ‘multi-pass’ classification of cells wherein each marker was segmented and classified separately from the other markers, then merged into a single plane using a unique set of rules and predefined cell hierarchy. We showed that our segmentation and classification method reduced error in over-counting larger cells, e.g. tumor cells, by 25% and increased the specificity and sensitivity in each classification algorithm. Due to the amount of data, each algorithm was run automatically through one of 20 virtual machines housed on a set of servers in the Physics and Astronomy Department. Following the methodology developed during the SDSS project, image data was stored in a well-defined file system structure that facilitated further automatic processing and ingestion into a SQL Server database. Raw data for each slide was 200-300 GBs, which is on par with a full scale (30x) human genome. In summary, we have developed a unique facility and workflow that generates whole slide multispectral imagery with high-fidelity, single cell resolution. Our facility houses five multispectral microscopes (2 Vectra 3.0 and 3 Vectra Polaris) allowing us to collect a petabyte of raw data per year, on scale of the largest sky survey.

        Speaker: Benjamin GREEN (Johns Hopkins University)
      • 9
        Automated morphological decomposition of galaxies in large imaging surveys

        Many studies of the properties and evolution of galaxies need reliable structural parameters of their components. There is no "universal solution" to this problem yet due to the diversity of the galaxy population along with the evolving quality of images (in terms of depth, resolution, wavelength coverage and data volume). We present our efforts to decompose ~13000 galaxies from the Galaxy and Mass Assembly (GAMA) survey in 9 optical and near-infrared images (each) into their bulges and disks. The model fitting uses a fully automated MCMC analysis with the 2-dimensional Bayesian profile fitting code ProFit (Robotham et al. 2017). The preparatory work includes image segmentation/source identification, background subtraction and point-spread-function estimation and is also carried out in an automated fashion using the sister package ProFound (Robotham et al. 2018; an astronomical image analysis package rooted in medical imaging software). After fitting the galaxies, we perform outlier rejection, model selection and quality control in various semi-automated steps, including a detailed study of systematic uncertainties. We find that our fit results are robust across a variety of galaxy types and image qualities with minimal biases, while MCMC uncertainties typically underestimate the true errors by factors 2-3.

        Speaker: Sarah CASURA (Universität Hamburg, Germany)
      • 10
        Building Seamless Whole Slide Multiplex Images in the AstroPath Project

        The analysis of microscope images is rapidly advancing. Where previously most analyses relied on manual inspection of certain regions of images, advanced quantitative techniques are now being utilized to analyze large sets of entire images at high magnifications. Current multiplex microscopy for cancer imaging in the AstroPath project relies on collecting more than a thousand high resolution images over the tissue on a given microscope slide (see Green abstract). To create a seamless whole slide image over the whole tissue area, a precise registration of individual high-power fields within a single whole slide coordinate system is needed. The mechanical positioning of the slide in the focal plane of the microscope has a small but noticeable jitter, which requires additional compensation before we can assemble the whole slide images. In the AstroPath project, we have designed a strategy that involves collecting overlapping images, similar to the Sloan Digital Sky Survey, enabling the correct segmentation of cells near the edges of the images. This poster describes the AstroPath framework's method for stitching the images together, in which high-power fields are registered using the pixels in the 20% overlap areas. Adjacent high-power fields are pairwise aligned using the cross correlation of their overlapping regions, and a final whole slide registration is performed using a spring-based model to minimize overall pixel shift. The results of the technique are illustrated using sets of archival pathology specimens from patients with melanoma imaged using a Vectra 3.0 microscope. Whole-slide registration with the AstroPath framework reduces the 5th to 95th percentile misalignment error from 3 horizontal pixels and 5 vertical pixels to less than one pixel in both directions, and the cells on the edges of high-power fields, which represent 3-6% of the total number of cells in the image, become more meaningful.

        Speaker: Jeffrey ROSKES (Johns Hopkins University)
      • 11
        ODUSSEAS: a machine learning tool to derive effective temperature and metallicity for M dwarf stars

        The derivation of spectroscopic parameters for M dwarf stars is very important in the fields of stellar and exoplanet characterization.
        We present our easy-to-use computational tool ODUSSEAS, which is based on the measurement of the pseudo equivalent widths for more than 4000 stellar absorption lines and on the use of the machine learning Python package "scikit-learn" for predicting the stellar parameters. It offers a quick automatic derivation of effective temperature and [Fe/H] for M dwarf stars using their 1D spectra and resolutions as input. The main advantage of this tool is that it can operate simultaneously in an automatic fashion for spectra of different resolutions and different wavelength ranges in the optical.
        ODUSSEAS is able to derive parameters accurately and with high precision, having precision errors of 30 K for Teff and 0.04 dex for [Fe/H]. The results are consistent for spectra with resolutions between 48000 and 115000 and S/N above 20.

        Speaker: Alexandros ANTONIADIS KARNAVAS (Instituto de Astrofísica e Ciências do Espaço, Universidade do Porto)
    • Afternoon 1: first half

      Afternoon session of the first day

      • 12
        The role of domain knowledge experts in the era of big data

        The significant increase in volume and complexity of data resulting from technological development is a common challenge faced by many scientific disciplines. More sensitive detectors and large scale experiments are currently overwhelming researchers who turn to automatic machine learning algorithms in order to filter, order, or pre-select potentially interesting subsets of the original data for further scrutiny. In this scenario, algorithms need to be carefully designed to select scientifically interesting/useful examples, thus optimizing the distribution of human efforts in scanning a large data set.
        In this talk I will show a few examples of such adaptive learning environments, where expert feedback is sequentially incorporated into the learning algorithm. Using examples from astronomy, I will describe how we can achieve optimal classification results with minimum labeling efforts and discuss the role played by the domain knowledge experts in the era of data driven scientific discoveries.

        Speaker: Emille ISHIDA (CNRS/Laboratoire de Physique de Clermont)
      • 13
        Training an interpretable ML algorithm with only a dab of real data: An extragalactic perspective

        In the last decade, convolutional neural networks (CNNs) have revolutionized the field of image processing and have become increasingly popular among astronomers for morphological analysis of galaxies. This push has been driven by the fact that they are the perfect alternative to the traditional techniques of obtaining morphological classifications --- expert visual classification, citizen science projects, and fitting light profiles, none of which is easily scalable to large data volumes.

        However, most previous applications of CNNs to morphological analysis have required a large training set of real galaxies with pre-determined classifications. However, if CNNs are to become the method of choice for analyzing unclassified data from future surveys, this necessitates an algorithm that does not require a large pre-classified training set of real galaxies from the same survey. The challenge of training a machine learning algorithm to classify brand new data, which has not been manually/previously looked at, is not unique to astronomy and is applicable to many other scientific fields which use large amounts of data such as the biomedical sciences.

        In this talk, I will outline how we have successfully trained a Bayesian CNN called Galaxy Morphology Network (GaMorNet) with a very small amount of real data and used it to extract morphological parameters of galaxies at a variety of redshifts from different surveys. We first trained GaMorNet on a large simulation suite of galaxies and then used a small amount of real data to perform transfer-learning/domain adaptation. We have already demonstrated that a preliminary classification-version of GaMorNet (Ghosh et. al. 2020) can be successfully applied to data from different surveys with misclassification rates of $\leq 5\%$. We have also used GaMorNet to study the morphology and quenching of $\sim100,000$ ($z\sim0$) SDSS and $\sim20,000$ ($z\sim1$) CANDELS galaxies using morphology-separated color-mass diagrams. Using the GaMorNet classifications, we find that bulge- and disk-dominated galaxies have distinct color-mass diagrams with separate evolutionary pathways. For both datasets, disk-dominated galaxies peak in the blue cloud, across a broad range of masses, consistent with the slow exhaustion of star-forming gas. In contrast, bulge-dominated galaxies are mostly red, with much smaller numbers down toward the blue cloud, suggesting rapid quenching and fast evolution across the green valley. GaMorNet is one of the very few publicly available CNNs in astronomy, complete with trained models.

        I will also outline in this talk why GaMorNet is not a black-box and how the representations learned by the network are highly amenable to visual interpretation. We have used a combination of different CNN visualization techniques to investigate and shed light on GaMorNet’s decision-making process, making our results interpretable, reproducible, and robust.

        Speaker: Aritra GHOSH
      • 14
        PyAutoFit: A Classy Probabilistic Programming Language For Cosmology and Cancer

        A major trend in astronomy and healthcare is the rapid adoption of Bayesian statistics for data analysis and modeling. With modern data-sets growing by orders of magnitude in size, the focus is now on developing methods capable of applying contemporary inference techniques to extremely large datasets. To this aim, I present PyAutoFit (https://github.com/rhayes777/PyAutoFit), an open-source probabilistic programming language for automated Bayesian inference.

        PyAutoFit is an offshoot of the astrophysics project PyAutoLens (https://github.com/Jammy2211/PyAutoLens), which uses PyAutoFit’s advanced modeling tools to automate the analysis of images of strong lens galaxies. In collaboration with UK tech-healthcare company ConcR, PyAutoFit is now being adapted to a healthcare setting, for example modeling patient responses to cancer treatments and as part of a clinical trial run by healthcare company Roche.

        In this hands on demonstration, I will:

        • Give an overview of how to compose a model in PyAutoFit.
        • Demonstrate a simple model-fitting example using an Astronomy based science-case.
        • Illustrate PyAutoFit’s graphical modeling tools using a healthcare based toy-model, which is the core feature we are developing with ConcR for analysing large patient datasets.

        PyAutoFit offers a generalized framework for performing Bayesian inference which can be adapted to many different scientific domains. The aim of this presentation is therefore to instigate new multi-disciplinary collaborations which will enable the statistical techniques being developed in Astronomy to be applied by the wider science community.

        Speaker: James NIGHTINGALE (Durham University)
    • 15:20
      Break
    • Afternoon 1: second half

      Afternoon session of the first day

      • 15
        Modelling spatial phenomena: a brief tour of the current state of the art

        Interpolation is critical to filling in gaps in data regardless of the field. Objects closeness in distance can easily be leveraged to greatly improve interpolation. In this talk I will briefly introduce the Gaussian process and show why it is a popular tool for spatial analysis. I will then give a brief tour of computational and theoretical aspects in the current state-of-the-art.

        Speaker: Samir BHATT (University of Copenhagen, Imperial College London)
      • 16
        AstroPath: Astronomy Meets Pathology to Characterize the Tumor MicroEnvironment

        Multispectral, multiplex immunofluorescence (mIF) enables the study of the complex tumor microenvironment (TME) through quantification of key immunomodulatory marker co-expression patterns on specific cells and spatial relationships between different cell types. We have developed a detailed, automated, multi-step approach to mIF staining and image analysis that can support the standardization of results across large, multi-institutional datasets. Our framework for assay development and associated data quality standards for mIF are likely to become universal performance standards for multiple commercial and open source hardware and software offerings. Such universal standards facilitate scaling, particularly important as Tumor-Immune Atlases with billions of cells are generated. These Atlases will be mined by thousands of laboratory and clinical investigators, amplifying the discovery, similar to sky surveys in astronomy.
        Using expertise developed over 20 years in astronomy we have built AstroPath to scale up to whole slide analysis by combining a robust hardware infrastructure with a complex processing workflow. We perform automated 35 to 45-band imaging with six Vectra-3 and Vectra Polaris microscopes from Akoya BioSciences. These are then processed, flat-fielded, unwarped, calibrated, segmented, stitched, and loaded into a database. The database and the integrated visual interface, modelled after the SDSS SkyServer, can display complex spatial and multicolor data interactively.
        Using the AstroPath pipelines, we have processed 3 cohorts of 235 slides which have all been ingested into the AstroPath database. They contain a total of 184,320 High Powered Fields, 226,619,428 detected cells, of which our unique statistical sample contains 97,041,546 cells, with full geometric boundaries for their nuclei and membranes. In order to support the TME studies, the databases contain precomputed neighbors within 25 microns to each cell, for a total of 3.5 billion cell pairs, with their respective phenotypes. In order to do this, we had to do complex image processing over 8.7 trillion pixels. For comparison, the Sloan Digital Sky Survey (SDSS) has collected about 6.4 trillion raw pixels from the sky over 16 years of imaging.
        Our group brings together astronomers from the SDSS, physicists having analyzed LHC data, Machine Vision and Deep Learning experts with pathologists working on cancer. We have heavily relied on lessons from sky surveys. Over the last two years we have worked very hard to adopt theses to multiplex imaging in a systematic fashion. The resulting AstroPath platform demonstrates the feasibility of an automated system feeding into an open public database. It will also lead to open-source tumor-immune Atlases with billions of spatially mapped single cells, enabling analyses at unprecedented scales. Our talk will describe the main features and components of the AstroPath system, and how lessons learned in astronomy have enabled us to short circuit the whole design and implementation process.

        Speaker: Alexander SZALAY (Johns Hopkins University)
      • 17
        Correcting for Systematics in Multiplex Cancer Imaging in the AstroPath Project

        Historically, analysis of microscope images has relied mainly upon extensive manual review by experienced medical professionals, but recent developments in microscope technology, biological workflows, and data handling capabilities have opened the door to new ways of thinking about microscopy data. The AstroPath group is working to translate a host of "big data" methods familiar in astronomy image analysis to the field of microscopy, with a focus on accelerating the pace of discovery and development by increasing data throughput and accessibility. A main goal is to increase automation in analysis pipelines, which requires a high degree of quantitative precision. To this end, the AstroPath platform includes software methods to characterize and correct systematic effects inherent to microscopy data, and to automatically detect image regions of interest in large datasets. These methods were developed using multiplexed immunofluorescence microscopy data from a cohort of archival pathology specimens from patients with melanoma imaged using an Akoya Biosciences Vectra 3.0 digital microscope. Corrections are made for spatially dependent illumination variation introduced by nonuniformities in the microscope optical paths by averaging over tissue regions in large datasets. Subtle warping effects introduced by microscope objectives and camera lenses are corrected using models from OpenCV optimized by comparing overlapping regions of slides. Otsu's algorithm is adapted to find background thresholds, and tissue is disambiguated from empty background by applying those thresholds and analyzing multiple image layers at once. Blur detection with the Laplacian operator is used to identify regions of images that have been obscured by dust or that depict tissue that is folded over on itself. Images whose exposures are automatically curtailed at the time of collection are used to determine saturation thresholds, and those thresholds are used to identify regions of images that are overbright due to excess stain, the presence of red blood cells, or other exogenous and endogenous artifacts. Overlapping regions of images are compared to identify images that exhibit ambiguous unmixing of their spectral components. Segmenting empty background and artifacts from images and applying the derived corrections is shown to reduce the overall variation in 5th to 95th percentile illumination from 11.2% relative to the mean to 1.2% on average over all image layers.

        Speaker: Margaret EMINIZER (The Johns Hopkins University)
    • 17:05
      Wrap-up and see you next morning Zoom Webinar

      Zoom Webinar

      online

      Conclusion for the end of the day

    • 09:00
      Webinar is now open for connections Zoom Webinar

      Zoom Webinar

      If you haven't obtained your personal link yet, register at https://indico.nbi.ku.dk/event/1330/registrations/ and follow the instruction you will receive via email.
    • 18
      Welcome back Zoom Webinar

      Zoom Webinar

      online

      A few words from the organizers to introduce the 2nd day of workshop.

    • Morning 2: first half

      Morning session of the 2nd day.

      • 19
        Longitudinal Disease Trajectories at Population Scale

        Multi-step disease and prescription trajectories are key to the understanding of human disease progression patterns and their underlying molecular level etiologies. The number of human protein coding genes is small, and many genes are presumably impacting more than one disease, a fact that complicates the process of identifying actionable variation for use in precision medicine efforts. We present approaches to the identification of frequent disease and prescription trajectories from population-wide healthcare data comprising millions of patients and corresponding strategies for linking disease co-occurrences to genomic individuality. We carry out temporal analysis of clinical data in a life-course oriented fashion. We use data covering 7-10 million patients from Denmark collected over a 20-40 year period and use them to “condense” millions of individual trajectories into a smaller set of recurrent ones. Such sets represent patient subgroups sharing longitudinal phenotypes that could form a basis for differential treatment designs of relevance to individual patients.

        Speaker: Søren BRUNAK
      • 20
        Fast complex dynamics emulators using deep generative models

        Cosmological simulations of galaxy formation are inexorably limited by the availability of finite computational resources. Drawing from recent advances in deep generative modelling techniques, we present two physical engines, motivated by our understanding of physics and knowledge of fundamental symmetries, to emulate the complex dynamics and currently unresolved physics involved in galaxy formation. First, we design a (Wasserstein) generative adversarial network for mapping approximate dark matter simulations to 3D halo fields, thereby obviating the need for full N-body simulations and halo finding algorithms (arXiv:1903.10524). We subsequently extend our existing framework to emulate high-resolution cosmological simulations from low-resolution ones (arXiv:2001.05519) . These halo painting and super-resolution emulators pave the way to detailed and high-resolution analyses of next-generation galaxy surveys via Bayesian forward modelling techniques.

        Speaker: Doogesh KODI RAMANAH (DARK, Niels Bohr Institute, University of Copenhagen)
      • 21
        CycleGANs can bridge to understanding/closing the reality gap for CMB simulations

        Deep learning models demonstrate a considerable improvement in machine learning problems. On the other hand, using more complex models leads to less model interpretability if one needs to analyze and extract the most important features.
        Layer visualization techniques and CycleGAN are proposed for finding important features/regions. For example, the results can be potential biometrics in medical images.
        In this study, we used CycleGAN to translate images between CMB simulation to Planck observations. We also showed how one can find differences between simple simulations and model the simulation pipeline using CycleGAN.

        Speaker: Alireza VAFAEI SADR (university of Geneva)
    • 10:50
      Break
    • Morning 2: second half

      Morning session of the 2nd day.

      • 22
        How is deep learning useful to understand the physics of galaxies?

        As available data grow in size and complexity, deep learning has rapidly emerged as an appealing solution to address a variety of astrophysical problems. In my talk, I will review applications of supervised, unsupervised and self-supervised deep learning to several galaxy formation related science cases, including basic low level data processing tasks such as segmentation and deblending, anomaly detection to more advanced problems involving simulations and observations. I will try to emphasize success, failures and discuss promising research lines for the future.

        Speaker: Marc HUERTAS-COMPANY
      • 23
        What to do with curved distributions in MCMC

        Curved posterior distributions come up quite often in practice when running Monte Carlo Markov Chains (MCMC) on complex models. They can go undetected, and even if they are detected, often people just ignore them because there's very little guidance on how to deal with them in practice. If the distribution is particularly difficult, results may unknowingly be quite biased. So, I thought it would be useful to discuss what I would personally do (as an MCMC person) if I found that some of the posterior distributions from my MCMC sample were curved/L-shaped/difficult. There will be a lot of pretty pictures.

        Speaker: Filippo PAGANI (University of Cambridge)
    • 12:30
      Lunch break
    • Poster session 2: seven posters

      Posters are presented virtually, by playing a pre-recorded presentation. After each video it is possible to ask questions to the authors.

      • 24
        Anomaly Detection using Dimensionality reduction - an Active learning approach.

        Anomaly detection can be extremely challenging in real-world situations considering the big data problem. The features that distinguish the anomalies are usually unknown. In this case, standard anomaly detection algorithms may perform very poorly because they are not being fed the correct features. Learning these features even with a few examples of anomalies is challenging. We introduce an algorithm based on dimensionality reduction methods. It learns about primary prototypes in the data while identifies the anomalies by their large distances from the prototypes. Besides, it can identify the anomalies as a new class and get customized to find interesting objects. We evaluated our algorithm on a wide variety of simulated and real datasets, in up to 3000 dimensions. It shows to be robust and highly competitive with commonly-used anomaly detection algorithms, especially in high dimensions.

        Speaker: Emmanuel SEKYI (African Institute for Mathematical Sciences, Cape Town)
      • 25
        A Method to Distinguish Quiescent and Dusty Star-forming Galaxies with t-SNE

        Large galaxy surveys have revealed a surprising number of galaxies which have ceased (or quenched) their star-formation, seen as they were when the Universe was only half it’s current age. However, identifying large, but clean, samples of these “quiescent” galaxies has proven difficult. Their spectral shape, as measured by broad-band photometric measurements, are highly degenerate with dusty star-forming galaxies whose light is attenuated by thick clouds of dust and present as an interloper population. We describe a new technique for identifying pure samples of quiescent galaxies based upon t-distributed stochastic neighbor embedding (t-SNE), an unsupervised machine learning algorithm for dimensionality reduction. This t-SNE selection provides an improvement in both purity and completeness over traditional methods (e.g. UVJ, NUVrJ) and over spectral template fitting. We find that t-SNE outperforms spectral fitting in 63% of trials at distances where large training samples already exist. Remarkably, we find evidence for increased performance when applying the method to samples in the even more distant Universe.

        Speaker: John WEAVER (Cosmic Dawn Center, NBI)
      • 26
        Deep Multi-Task Learning for Series Classification by Pulse Sequence Type and Orientation

        Workshop Relevance: For Astrophysics and the Medical/Public Health Sciences, the opportunity for machine learning to complement rule-based image routing or alert systems is shared. In this work, we demonstrate how deep learning, random forests, and an ensemble of the two can aid and improve medical image (MRI) routing systems. This work has parallels to astronomical datasets such as Photometric LSST Astronomical Time-Series Classification Challenge (https://plasticc.org/) where the creation of an observation routing or alert system based on machine learning algorithms can prove effective. In particular, these systems may assist in collecting observations of similar transient phenomena and celestial objects across observatories and nights, allowing us to learn more about their behavior. We have created a system with such an objective for medical imaging, and present it to the Earth meets Sky community for discussion on how such a tool can assist astronomical surveys.

        Background and Purpose: Increasingly complex MRI studies and variable series naming conventions reveal limitations of rule-based image routing, especially in health systems with multiple scanners/sites. Accurate methods to identify series based on image content would aid post-processing and PACS viewing. Recent deep/machine learning efforts classify 5-8 basic brain sequences. We present an ensemble model combining a convolutional neural network (CNN) and a random forest classifier (RFC) to differentiate 23 sequences and image orientation.

        Materials and Methods: Series were grouped by descriptions into 23 sequences and 4 orientations. Dataset A, obtained from our institution, was divided into training (20,904 studies; 86,089 series; 235,674 images) and test sets (7,265 studies; 62,223 series; 3,658,450 images). Dataset B, obtained from a separate hospital, was used for out-of-domain external validation (1,252 studies; 2,150 series; 234,944 images). We developed an ensemble model combining a 2D CNN with a custom multi-task learning (MTL) architecture and RFC trained on DICOM metadata.

        Results: Dataset A overall accuracy by RFC was 97%, CNN 97%, and ensemble 98%. Dataset B overall accuracy by RFC was 99%, CNN 98%, and ensemble 99%. Error analysis revealed different types of discrepancies: non-brain studies; incorrect ground-truth labels; and incorrect predictions, some with possible explanations.

        Conclusion: The ensemble model for series identification accommodates the complexity of brain MRI studies in state-of-the-art clinical practice, performing slightly better than the CNN and RFC separately. Expanding on previous work demonstrating proof-of-concept, our approach is more comprehensive with more classes and orientation classification.

        Future Work: Using a network trained on this data as a tool for fine-tuning on downstream tasks, scaling up to predict more attributes of medical images, and exploring the potential of an analogous system to be inserted in astronomical observation systems.

        Speaker: Noah KASMANOFF (New York University)
      • 27
        Novel Application in Machine Learning: Predicting the Issuance of COVID-19 Stay-at-Home Orders in Africa

        During the COVID-19 pandemic many countries have issued stay-at-home orders (SAHO) to reduce viral transmission. Because of their social and economic consequences SAHO are a politically risky decision for governments. Within the health policy literature five factors are identified as theoretically significant to the issuance of SAHO: economic, external, medical, political, social however research exploring the relative importance of these factors is limited. To test this hypothesis, we apply a random forest classifier to a novel dataset of n=54 African countries. Our dataset includes a wide range of variables from the World Bank, World Health Organization, and State Fragility Index. Generated using 10000 simulations, our model predicts the issuance of SAHO in our sample with >80% accuracy based on six variables.

        Speaker: Carter RHEA (L'Université de Montréal)
      • 28
        A new vocabulary for textures and its astronomical applications

        Textures and patterns are ubiquitous in imaging data but
        challenging for quantitative analysis. I will present a new tool, called
        the “scattering transform”. It borrows ideas from convolutional neural
        nets while sharing advantages of traditional statistical
        estimators. As an example, I will show its application to cosmic density
        maps for cosmological parameter inference and show that it
        outperforms classic statistics. It is a powerful new approach in
        astrophysics and beyond.

        Speaker: Sihao CHENG (Johns Hopkins University)
      • 29
        Cross-Registration of Whole Slide Images from Different Modalities in AstroPath

        In this work, we established routines and methods for the registration of digital whole-slide histological imaging specifically focused on the challenges associated with combining multiplexed immunofluorescent imaging with immunohistochemical ones. There exist multiple platforms for whole-slide digital image registration, however, these are primarily limited to brightfield hematoxylin and eosin and brightfield immunohistochemical microscopy, or registering multiplex images in the same marker or narrow filter band. Other studies have demonstrated the ability to use mutual information to register immunofluorescent images with brightfield hematoxylin and eosin, however, this process requires the acquisition of a fluorescent nuclear image for each additional registered image.
        We have developed and implemented a fully automated process to perform a hierarchical, parallelizable, and deformable registration of multiplexed digital whole-slide images, that is applicable even between different modalities (e.g immunofluorescent vs H&E). We generalized the calculation of mutual information as a registration criterion to an arbitrary number of dimensions, a task well-suited for multiplexed imaging. We also further leveraged the power/utility of information theory by using self information of a given immunofluorescent channel to determine its suitability for the registration process. This allows for the identification and avoidance of channels containing staining artifacts and the identification and selection of channels containing adequate information for the registration.

        Challenges associated with multiplexed image registration include the potential sparseness of channels and independence between channels. Registration algorithms focused on multispectral images should be designed to accommodate these unique challenges. This effort builds on and extends existing whole-slide image registration studies by demonstrating the ability to register whole-slide multispectral immunofluorescent images with whole-slide immunohistochemical brightfield images using the fluorescent channels determined by their self information. We present our first results from successfully registering whole-slide six-channel multiplex immunofluorescent images with whole-slide brightfield immunohistochemical ones. Future work will focus on sequential slide registration and reconstruction.

        Speaker: Joshua DOYLE (Johns Hopkins University)
      • 30
        CellView: an Interactive Image Viewer for Multiplex Cancer Images in the AstroPath Project

        We present CellView, a web-based image viewer built using free and open-source software components that aims to provide some of the basic functionality available in more sophisticated commercial image analysis tools used in digital pathology, such as HALO. CellView is designed as a three-tier application with a rich single-page viewer client, tile server middleware, and a SQL database storing imagery and spatial features. We are using the OpenLayers JavaScript library to display high-resolution tissue imagery, cell geometry and region annotations in a scalable manner using a hierarchy of image tiles, allowing for smooth transition between zoom levels. Image rendering tasks are distributed between the client and the tile server, simpler vector graphics and color adjustments performed on the client and more complicated rendering assigned to the tile server. This, combined with efficient spatial indexing at the database level, allows for fast rendering of thousands of cells with minimal delay while zooming or scrolling, using an underlying space-filling curve (Morton z-index). Various region metadata (pathological tissue annotations, field outlines, outlines of cell membranes and cell nuclei) are stored as GIS polygons. Rendering of polygons is also done on a per tile basis. Our indexing enables us to render cells straddling tile boundaries correctly. Visualization of cells is handled by switching between multiple methods, depending on the level of detail, from many cells per screen pixel to many pixels covering a single cell.
        Another interesting feature is the user-defined color mixing of the images. For speed the color mixes are performed server side and a JPEG compressed color tile is sent to the client, but the users are able to specify a mixing matrix of the 8 (or more) image layers applied on the server. The client UI, built using the Vue.js framework, provides many customization options, including color and transparency settings for imagery layers and geometrical features, cell filtering settings, and configurable user-defined and predefined region annotation layers. Customization presets can be saved either on the client-side in a browser’s local storage, or in a database on the server side. The CellView will soon host data about close to a billion cells for the public as part of an Open Cancer Cell Atlas.

        Speaker: Dmitry MEDVEDEV (Johns Hopkins University)
    • Afternoon 2: first half

      Afternoon sessin of the 2nd day.

      • 31
        Searching for needles in the cosmic haystack

        There is a shortage of multiwavelength and spectroscopic followup capabilities given the number of transient and variable astrophysics events discovered through wide-field, optical surveys. From the haystack of potential science targets, astronomers must pick the valuable needles to study. Given millions of events discovered annually, how does one find a one-in-a-million anomaly? Here we present an unsupervised method to search for out-of-distribution transient events in real time, multivariate, aperiodic data. In particular, we develop a novel variational recurrent autoencoder architecture, and search the resulting learned encoding space for anomalous events. Our pipeline is able to flag anomalous events well before their brightest point (at which point astronomer’s want to trigger follow up resources). Our pipeline can be applied to similar multivariate, sparse time series.

        Speaker: Ashley VILLAR (Columbia University)
      • 32
        Domain Adaptation for Cross-Domain Studies in Astronomy: Merging Galaxies Identification

        In astronomy as well as other sciences, neural networks are often trained on simulation data with the prospect of being used on real instrument data. Astronomical large-scale surveys are already producing very large datasets, and machine learning will play a crucial role in enabling us to fully utilize all of the available data. Unfortunately, training a model on simulated data and then applying it to observations can potentially lead to a substantial decrease in model accuracy on the new target dataset. Simulated and telescope data represent different data domains, and for an algorithm to work in both, domain-invariant learning is necessary. Here we study the problem of distinguishing between merging and non-merging galaxies in simulated (Illustris-1 cosmological simulation) and observational data (Sloan Digital Sky Survey). Understanding galaxy mergers is an important step in understanding the evolution of matter in the universe, and our ability to utilize and combine knowledge from different data domains will be very important for these efforts. In order to unable deep learning algorithms to work in multiple domains we test two domain adaptation techniques: Maximum Mean Discrepancy (MMD) and Domain Adversarial Neural Networks (DANNs). We show that the addition of domain adaptation improves target domain classification accuracy up to ${\sim}20\%$ in the target domain. With further development, these techniques will allow different domain scientists to construct machine learning models that can successfully combine the knowledge from simulated and instrument data or data originating from multiple instruments.

        Speaker: Aleksandra CIPRIJANOVIC (Fermi National Accelerator Laboratory, USA)
      • 33
        Bayesian Model Comparison applied to COVID 19

        Bayesian parameter estimation and model comparison are widely used in cosmology. This has led to the development of very efficient and user-friendly codes that perform these complex calculations. In this presentation, we will demonstrate the wider applicability of these algorithms, by applying them to study the COVID pandemic. We will perform Bayesian parameter estimation and model comparison using MCMC and Nested Sampling on different variations of the SIR model. This serves not only to learn which models of the pandemic are favored by the data but also to illustrate the usefulness of these algorithms outside of cosmology.

        Speaker: Pablo LEMOS (University of Sussex)
    • 15:20
      Break
    • Afternoon 2: second half

      Afternoon sessin of the 2nd day.

      • 34
        Chaotic_Neural: Improving Literature Surveys in Astronomy with Machine Learning

        Literature surveys in astronomy are greatly facilitated by both open-access preprint servers (ArXiv) and online tools like the Astrophysics Data System (ADS). However, the astrophysics literature often uses specialised jargon, sometimes using multiple identifiers for the same phenomena. For example, the terms SFR-M$_*$ correlation, Star Forming Sequence and Star Formation Main Sequence, all mean the same thing in the galaxy context, not to be confused with just Main Sequence which pertains to stellar evolution. This can often be challenging for young researchers to parse, and can cause even established astrophysicists to sometimes miss relevant references. Other issues include in-group bias towards referencing and citing literature in a paper, or papers sometimes getting overlooked due to the large volume of new literature.

        To help circumvent these issues and provide agnostic, context-aware searches for relevant literature, we present chaotic_neural, a public python package that trains a Doc2Vec model on abstracts from the ArXiv to enable finding relevant literature. The model works by using a neural network to transform abstracts into a high-dimensional vector space. An input vector is generated using an abstract or a set of keywords. Relevant literature can then be searched for by looking for papers that lie in the vicinity of the input vector. Since the computation happens in a vector space, the search can be further refined with linear algebra using keywords. This introduces the possibility of adding and subtracting keywords and/or papers from other keywords and/or papers. The model also provides utility beyond literature surveys, creating a discovery space for future analysis and hypothesis testing. The currently available model (available at https://github.com/kartheikiyer/chaotic_neural) is trained on a large galaxies dataset (https://arxiv.org/list/astro-ph.GA), but can easily be adapted to other fields and datasets.

        Speaker: Kartheik IYER (Dunlap Institute for Astronomy and Astrophysics, University of Toronto)
    • Open Discussion Zoom meeting room (673 0303 4673)

      Zoom meeting room

      673 0303 4673

      passcode is 28521

      direct link to Zoom Meeting room:
      https://ucph-ku.zoom.us/j/67303034673?pwd=M2ZGWDhBY1JDVEJOenV3NGQ3NkJEQT09

      Conveners: Fionagh THOMSON (Centre for Extragalactic Astronomy, Physics, Durham University), Laurence FITZPATRICK (Durham University)
    • 16:50
      Final remarks and goodbye