CMU 07-400, Spring 2024

Select a student:

Gender Homophily in Open Source Software Networks [Poster]

Katy Yu (Advisor: Bogdan Vasilescu)

Abstract: While gender diversity in open-source software (OSS) has been widely studied, the seemingly opposite concept, gender homophily, where individuals tend to collaborate with those of the same gender, is less explored. It is, therefore, unclear whether gender homophily is present in OSS. If so, what are the reasons behind it? Does it benefit gender diversity in OSS or work against it? This study presents an initial exploration into the existence of gender homophily within OSS networks and seeks to unravel the underlying mechanisms of this phenomenon.

Our study utilizes the World of Code dataset, encompassing the FLOSS ecosystem, which includes binary gender data inferred from contributor names. We construct OSS collaboration networks where the ties represent the developer collaborations on projects from 2008 to 2022 and extract the network backbone using the computed significance score of edge weights. The network's gender composition is extremely skewed (66.38% men, 2.70% women, 30.92% unknown), rendering direct estimation of gender homophily unreliable. We ameliorate this problem by estimating the mean shortest path distance of dyads of each type of gender composition and using it as a proxy for the social proximity of ties. In addition, we use exponential random graph models to estimate the probability of same-gender ties (i.e., the "node-match" statistic with respect to gender).

Our preliminary findings from a comprehensive OSS network dataset reveal nuanced patterns of gender-based collaboration. Moreover, women contributors exhibit a strong preference for collaborating with the same gender, a behavior confirmed by ERGM nodematch coefficients, where the probability of women-women ties is 28.3% more likely than men-men ties. The observed mean shortest path length of women-women dyads was 0.453 standard deviations shorter than the random baseline, which was in turn longer than that of the men-men dyads (0.36 standard deviations longer than the random baseline).

Using Large Language Models to Generate Pragmatic Test Cases [Poster]

Raunak Sood (Advisors: Daniel Fried, Saujas Vaduguru)

Abstract: A major area of research in neural code generation involves synthesizing programs from sample test cases. However, the main issue with this task is that many different programs can fit the specification outlined by a given test-suit. For instance, consider the task of generating regular expressions from example strings. If we provided the example, (ab, ✔), the synthesizer could generate a+b* and a*b+c, both of which are consistent with the input example. Hence, the goal of this area of research is to pragmatically generate programs/test-cases by reasoning about what the user intended. In the example above, the synthesizer could reason that since the test-case was chosen intentionally, and the character c wasn't included in the test-case, the user intended the regex a+b* instead of a*b+c. Previous work by Vaduguru et. al tackles this problem on regular-expression synthesis by training speaker and listener models under the Rational Speech Acts (RSA) scheme to generate test cases and programs respectively. In this research project, we aim to adapt this technique to Python programs by generating test-cases using the large language models, CAT-LM. Then, after the model produces these test-cases, we can use inference-time algorithms to pragmatically select the most informative results.

Toward Finding the Minimal Superpermutation of Six Symbols [Poster]

Kyle Booker (Advisor: Marijn Heule)

Abstract: Superpermutations are combinatorial structures whose properties have remained enigmatic to mathematicians for decades. Specifically, inquiries into the length of the minimum superpermutation composed of \( n\) symbols have yielded mixed results. While there have been proofs showing that the lower and upper bounds converge on long predicted values for n between 1 and 5, the value for 6 has been shown to defy such predictions and its exact value is not known. In this paper, we present three different techniques for encoding the superpermutation problem as an instance of the satisfiability problem. Using state-of-the-art SAT solving techniques, we replicate the previously known bounds, opening the door to potentially solve for the case where n is 6.

Predicting Market Movement using Large Language Models [Poster]

Jerick Shi (Advisor: Burton Hollifield)

Abstract: Predicting the movement of the stock market and other assets has been valuable over the past few decades. Knowing how the value of a certain sector market may move in the future provides much information for investors, as they use that information to develop strategies in order to maximize profit or minimize risk. However, market data is quite noisy, and it is challenging to choose the right data or the right model to create such predictions.
Our goal is to use LLM's such as the GPT-3.5 model to create a portfolio of assets that minimizes risk. We use data from economic texts such as the Federal Reserve Beige Book, which provides summaries of economic conditions in different districts in the US. Using such data, we then employ such LLM's to make predictions on the correlations between different stocks and bonds as well as their volatility. We compare these results with other statistical learning models and provide understanding on whether knowing the economic conditions improves investment decisions, as well as providing an investment strategy fullying reliant on these LLM models.

Youth-Driven Algorithm Auditing [Poster]

Emily Amspoker (Advisors: Jaemarie Solyst, Jessica Hammer)

Abstract: Due to the proliferation of AI-driven technologies, youth have become key stakeholders in algorithmic systems. Youth, especially those from underserved communities, are often exposed to the harm inflicted by biases in these algorithms. Unfortunately, they are rarely consulted by the creators of these systems due to misconceptions about their ability to understand bias in AI at a high level as well as notions of equity and fairness. Prior workshop-based studies, however, demonstrated that with proper scaffolding, youth can understand and explain examples of bias in AI and ideate their own solutions. Building off of this work, this research project seeks to further understand how youth envision themselves changing algorithmic systems and what motivates youth to be involved with combating algorithmic bias. This will be achieved using two primary methods: first, a storyboarding study in which storytelling serves as a means to understand youth and parents’ conceptions of their roles in algorithm auditing. The next part of the study will focus on prototyping and testing youth-friendly interface features within the framework of a pre-existing algorithmic system.

Enhancing Spatial Transcriptomics with Vision Transformers for Single-Cell Resolution Mapping [Poster]

Anisha Chatterjee (Advisor: Jose Lugo-Martinez)

Abstract: Spatial transcriptomics allows scientists to measure the gene activity in a given tissue sample and map where the activity is occurring. It has greatly improved our ability to analyze gene expression within the context of different tissue structures, providing insights into cell-cell interactions and structural organization. Our project is to enhance current spatial transcriptomics methods by taking low-resolution data and transforming it to single-cell resolution. This process occurs in two steps. First, the pixel-level gene expression profile is created by applying XFuse, which relies on statistical modeling, on stained spot data. Using that data and additional cell image data, cells can then be spatially mapped and matched to their gene expression. The main challenge for this step is cell segmentation, essentially separating cells and understanding which cells belong in which spots. A popular method, SCS, first identifies cell nuclei from staining images using the Watershed algorithm. Then, the transformer model infers for each spot its relative position with respect to the nuclei of each cell. My project is to improve upon the SCS model by combining the steps from XFuse and SCS and using a vision transformer for both tasks. The goal is to improve accuracy and efficiency as more information is available.

Hardness Estimation for Learning Parity with Noise: Adding Variety to Quantum Cryptography [Poster]

Tanisha Saxena (Advisor: Aayush Jain)

Abstract: Following the discovery of quantum computing, many people have lost faith in the power of standard cryptography measures for data protection. Rightfully so, as the mathematical difficulty assumptions made to prove the security of many algorithms can be undone in the quantum world. To adapt, cryptographers have come up with “Learning with Errors” (LWE) – an assumption that holds strong even with quantum computing power. The promising nature of LWE means many cryptographers have chosen to avoid using other assumptions out of lack of research into their strength. Specifically, Learning Parity with Noise (LPN) has only been proven theoretically and thus its usefulness in real applications is unknown. In this paper, we estimate the concrete security of LPN by referencing Albrech et. al’s lattice estimator for LWE and showing that LPN’s parameters strongly influence its security. Furthermore, we prove the usefulness of LPN assumption by taking previous quantum algorithms and showing how LPN simplifies the proof strategy significantly. Our research on LPN allows cryptographers to develop a wider variety of protocols that are not only more efficient but more tolerant against adversary attacks because they use targeted assumptions rather than generic ones.

AI-Based Recommendations for Learning Optimal Treatment Strategies for Sepsis with UPMC Data [Poster]

Maggie Cai (Advisors: Adam Perer, Venkatesh Sivaraman)

Abstract: AI/ML in healthcare has become increasingly popular since it has the potential to improve patient outcomes. My research builds upon previous studies on AI-based recommendations trained on MIMIC, a large public ICU dataset, for learning optimal treatment strategies for sepsis in intensive care. Sepsis is a condition that occurs when a response to fight infection results in inflammation in the body, causing the organs to fail. With sepsis treatment, patients are given a certain amount of IV fluids and vasopressor drugs. However, treatment practices vary widely and decisional uncertainty is common. Thus, an AI model was built to help suggest optimal treatments for patients with sepsis. Based on patient information, such as vital signs and fluids/vasopressors previously received, and clinician recommendations, the AI model predicts the patient’s future disease severity. Researchers found that clinician recommendations have little or no impact on the model’s predictions. My work involves analyzing data from UPMC, a larger, more diverse, and not yet publicly available dataset, to determine if the information from clinician recommendations can be more meaningful to the AI model predictions. Analyzing the UPMC data showed that clinician recommendations do impact the model’s predictions, contrary to the findings in MIMIC. Using a slice-based analysis technique, I aim to determine the reason behind this discrepancy.

Optimization of MLSys Framework Using Speculative Inference and Asynchronous Verification [Poster]

Lijie Yang (Advisor: Zhihao Jia)

Abstract: This project explores the optimization of Machine Learning Systems (MLSys) frameworks through the implementation of speculative inference and asynchronous verification, specifically targeting the enhancement of generative large language models (LLMs) like ChatGPT and GPT4. These models, characterized by their extensive computational and memory requirements, pose significant challenges in achieving quick and cost-effective service. The proposed system, SpecInfer, leverages small speculative models (SSMs) and a novel approach to token tree verification to predict and verify LLM outputs in parallel, reducing both processing time and computational resource demand. The innovation lies in conducting speculative inference and token tree verification simultaneously and asynchronously, optimizing the latency issues predominantly caused by the sequential verification process, which constitutes up to 80% of the total latency in existing systems. The project's goals are ambitious, aiming for a 1.5x-2x speedup in overall performance by rearchitecting the inference process to allow for more efficient speculation and verification phases. This involves the development of a new tokenizer optimized for speculative inference, with milestones set from initial planning to final implementation and testing. The research also encompasses a thorough literature review on tokenization methods, speculative inference research, and verification optimization to support the development of an optimized framework. With the ultimate aim of contributing to the open-source community, this project endeavors to significantly improve the inference latency of LLMs, enhancing their accessibility and efficiency for widespread use.

Spatially Adaptive Fluids Using Dynamic Radius in SPH Methods [Poster]

Trey DuBose (Advisor: Minchen Li)

Abstract: The current practice of fluid simulations with smoothed particle hydrodynamics (SPH) methods use fixed radius sizes or at least fixed sizes of integration. However, this fixes the level of detail/accuracy that is able to be represented in any simulation. In any given simulation there will be areas where high detail is necessary/beneficial and there are other areas where low detail is required and oversampling is unnecessary or redundant. With my project I aim to demonstrate how using an adaptive radius scheme can be used to improve the computational effectiveness of fluid simulations. To achieve the intended effects I will first demonstrate how to handle particles of different sizes in a complex simulation environment and secondly I will demonstrate how the radius sizes can be changed over time depending on its surrounding environment with regards to fluid density and internal forces. Lastly, I will show how my results compare to the current state of the art methods that don’t have adaptive radii.

Deterministic Parallel Batch-Dynamic Tree Contraction [Poster]

William Gay (Advisor: Daniel Anderson)

Abstract: Dynamic Trees are an important tool in the design of dynamic graph algorithms. The problem is to maintain a set of vertex disjoint trees, which can be updated by inserting and deleting edges. The structure should be able to efficiently (\( O(log(n)) \)) answer queries about the tree, such as the weight of a path, size of a sub-tree, or lowest-common-ancestor of two vertices. The Rake-Compress (RC) Tree algorithm solves this problem by iteratively contracting vertices into clusters, which are also contracted until the structure is balanced and can be efficiently traversed. In the parallel-batch dynamic setting, the goal is to handle multiple insert and delete operations concurrently. Previous solutions for the parallel batch-dynamic update problem have been randomized. We present deterministic algorithms for performing static compression and batch-updates on RC Trees. That static algorithm performs \( O(n) \) work and \( O(log(n)log^*(n)) \) span, and the dynamic update algorithm perform \( O(k~ log(1 + \frac{n}{k} )) \) work and \( O(log(n)log^*(k)) \) span on an update batch of size \( k \), making both algorithms work efficient with low span.

Examining Catastrophic Forgetting in Synthetic Setups [Poster]

Kunal Kapoor (Advisors: Suhas Kotha, Aditi Ragunathan)

Abstract: This research project investigates the phenomenon of forgetting in large neural networks and aims to isolate the underlying causes, moving beyond traditional capacity-related explanations. Building on the observation that model size does improve forgetting, the study employs a systematic approach to identify the "real" reasons behind forgetting and explores the hypothesis that increasing capacity changes the loss landscape, making it easier to optimize. A synthetic experiment is conducted, involving fine-tuning on Gaussian distributions, to analyze how the model output evolves during the process and assess the existence of forgetting. We find that larger networks, in our synthetic setup, tend to see an increase in forgetting. We also evaluate strategies to mitigate forgetting, such as data replay. Finally, we experimentally isolate instances of forgetting in LLMs.

Codified audio language representations for Music Information Retrieval [Poster]

Jaehee Kim and Advika Jayanti (Advisor: Chris Donahue)

Abstract: For this project we aim to replicate and expand upon the Music Information Retrieval (MIR) capabilities demonstrated by Jukebox, a state-of-the-art music generation model, utilizing MusicGen as a more computationally efficient alternative. Drawing inspiration from recent advancements in Natural Language Processing (NLP), we propose a novel approach that treats music as a sequence of tokens, enabling the application of NLP methods to extract musically relevant information from audio recordings. Our methodology involves pre-training MusicGen on a diverse dataset of audio recordings paired with labels, followed by fine-tuning the model for specific MIR tasks. We provide a comprehensive evaluation of MusicGen's performance on benchmark MIR tasks, including beat and chord detection, source separation, and genre classification, comparing its efficacy against Jukebox. Additionally, we explore the potential for transfer learning by leveraging MusicGen's learned representations for downstream MIR tasks, such as predicting the melodic structure and ragas in Indian classical music. By leveraging MusicGen's simplicity, we aim to provide access for better applications in music analysis and synthesis.

Exploring Youth Empowerment in Fair AI [Poster]

Emily Estrada (Advisor: Hanan Hibshi)

Abstract: Due to the innovation in AI-driven technologies, youth have become key stakeholders in algorithmic systems. Youth, especially those from underserved communities, are regular users of AI and are exposed to harm from AI. Despite this, youth are not yet included as contributors to creating responsible AI, even though they can be sensitive to and articulate AI bias. This research project explores youth empowerment in fair AI in two ways. First, we investigate and share findings about how youth envision their role in changing AI to be more fair through a series of educational workshops. Second, this work yields insight into how child-ideated features can support youth in giving feedback about algorithms through low-fidelity prototyping.

Faster Music Generation with Speculative Decoding [Poster]

Dongkyun Kim (Advisor: Chris Donahue)

Abstract: Autoregressive transformer-based decoders paired with quantized audio tokenizers are the dominating method for music generation models due to their high fidelity reconstruction capabilities. The autoregressive decoding process, however, leads to the inference speed of these models scaling to the number of codebooks. While methods utilizing parallel codebook interleaving patterns have been presented, their implementation is impeded by the challenges of considering the conditional dependence between different codebook levels.
In this work, we will introduce a novel inference strategy for music generation models by incorporating dual pattern speculative decoding. Speculative decoding is a technique widely used in large language models, where a smaller draft model predicts potential future sequences and the original model validates it, thus taking advantage of parallelism. Dual pattern speculative decoding extends this concept by utilizing the delayed codebook pattern model as the draft model while utilizing the flatten codebook pattern model as the validator. Thus, the method effectively leverages parallel codebook interleaving without sacrificing the high generation capabilities of sequential methods. Our work offers new pathways for improving the performance of music generation systems. It also may allow real-time, high-fidelity music generation, opening up new possibilities for interactive music applications, live performance aids, and more efficient music composition tools.

Proving Validity of CSL With Relaxed Memory Models [Poster]

Aaron Gostein (Advisor: Stephen Brookes)

Abstract: We provide a semantics for concurrent separation logic (CSL) that is capable of reasoning about parallel and sequential composition of concurrent programs, parameterized by a choice of memory model characterized by a relaxation relation and a linearity guarantee. We explore memory models such as sequential consistency (SC), which is the memory model assumed in the original proof of validity of CSL in 2007, where each pair of memory actions appears in program order, total store order (TSO) where each pair of writes to variables appears in program order, and partial store order (PSO) where for each variable, each pair of writes to that variable appears in program order. Since our semantic definitions are parametrized on the memory model used, we obtain a framework in which it is easy to compare programs across a range of memory models.

Investigation on Sim search data set via ParlayANN library [Poster]

Taekseung Kim (Advisors: Guy Blelloch, Magdalen Dobson)

Abstract: There is a previous presentation about ParlayANN, which is the known fastest nearest neighbor and range search algorithm, which is parallel and graph based. However, on specific dataset called simsearchnet, this works slower than the known IVF algorithm FAISS. Our group investigates the dataset in several way, and comes up with new version of ParlayANN that works better on simsearchnet too.

CMU 07-400, Spring 2024 --- Projects

Select a student: