Get trending papers in your email inbox once a day!
Get trending papers in your email inbox!
SubscribeDon't Rank, Combine! Combining Machine Translation Hypotheses Using Quality Estimation
Neural machine translation systems estimate probabilities of target sentences given source sentences, yet these estimates may not align with human preferences. This work introduces QE-fusion, a method utilizing a quality estimation metric (QE) that better correlates with human judgments to synthesize improved translations. QE-fusion leverages a candidate pool sampled from a model, combining spans from different candidates using QE metrics such as CometKiwi. We compare QE-fusion against beam search and recent reranking techniques, such as Minimum Bayes Risk decoding or QE-reranking. Our method consistently improves translation quality in terms of COMET and BLEURT scores when applied to large language models (LLMs) used for translation (PolyLM, XGLM, Llama2, and Mistral) and to multilingual translation models (NLLB), over five language pairs. Notably, QE-fusion exhibits larger improvements for LLMs due to their ability to generate diverse outputs. We demonstrate that our approach generates novel translations in over half of the cases and consistently outperforms other methods across varying numbers of candidates (5-200). Furthermore, we empirically establish that QE-fusion scales linearly with the number of candidates in the pool. QE-fusion proves effective in enhancing LLM-based translation without the need for costly retraining of LLMs.
Best-First Beam Search
Decoding for many NLP tasks requires an effective heuristic algorithm for approximating exact search since the problem of searching the full output space is often intractable, or impractical in many settings. The default algorithm for this job is beam search -- a pruned version of breadth-first search. Quite surprisingly, beam search often returns better results than exact inference due to beneficial search bias for NLP tasks. In this work, we show that the standard implementation of beam search can be made up to 10x faster in practice. Our method assumes that the scoring function is monotonic in the sequence length, which allows us to safely prune hypotheses that cannot be in the final set of hypotheses early on. We devise effective monotonic approximations to popular nonmonontic scoring functions, including length normalization and mutual information decoding. Lastly, we propose a memory-reduced variant of Best-First Beam Search, which has a similar beneficial search bias in terms of downstream performance, but runs in a fraction of the time.
On Hallucination and Predictive Uncertainty in Conditional Language Generation
Despite improvements in performances on different natural language generation tasks, deep neural models are prone to hallucinating facts that are incorrect or nonexistent. Different hypotheses are proposed and examined separately for different tasks, but no systematic explanations are available across these tasks. In this study, we draw connections between hallucinations and predictive uncertainty in conditional language generation. We investigate their relationship in both image captioning and data-to-text generation and propose a simple extension to beam search to reduce hallucination. Our analysis shows that higher predictive uncertainty corresponds to a higher chance of hallucination. Epistemic uncertainty is more indicative of hallucination than aleatoric or total uncertainties. It helps to achieve better results of trading performance in standard metric for less hallucination with the proposed beam search variant.
Toward Reliable Biomedical Hypothesis Generation: Evaluating Truthfulness and Hallucination in Large Language Models
Large language models (LLMs) have shown significant potential in scientific disciplines such as biomedicine, particularly in hypothesis generation, where they can analyze vast literature, identify patterns, and suggest research directions. However, a key challenge lies in evaluating the truthfulness of generated hypotheses, as verifying their accuracy often requires substantial time and resources. Additionally, the hallucination problem in LLMs can lead to the generation of hypotheses that appear plausible but are ultimately incorrect, undermining their reliability. To facilitate the systematic study of these challenges, we introduce TruthHypo, a benchmark for assessing the capabilities of LLMs in generating truthful biomedical hypotheses, and KnowHD, a knowledge-based hallucination detector to evaluate how well hypotheses are grounded in existing knowledge. Our results show that LLMs struggle to generate truthful hypotheses. By analyzing hallucinations in reasoning steps, we demonstrate that the groundedness scores provided by KnowHD serve as an effective metric for filtering truthful hypotheses from the diverse outputs of LLMs. Human evaluations further validate the utility of KnowHD in identifying truthful hypotheses and accelerating scientific discovery. Our data and source code are available at https://github.com/Teddy-XiongGZ/TruthHypo.
Calculation of prompt diphoton production cross sections at Tevatron and LHC energies
A fully differential calculation in perturbative quantum chromodynamics is presented for the production of massive photon pairs at hadron colliders. All next-to-leading order perturbative contributions from quark-antiquark, gluon-(anti)quark, and gluon-gluon subprocesses are included, as well as all-orders resummation of initial-state gluon radiation valid at next-to-next-to-leading logarithmic accuracy. The region of phase space is specified in which the calculation is most reliable. Good agreement is demonstrated with data from the Fermilab Tevatron, and predictions are made for more detailed tests with CDF and DO data. Predictions are shown for distributions of diphoton pairs produced at the energy of the Large Hadron Collider (LHC). Distributions of the diphoton pairs from the decay of a Higgs boson are contrasted with those produced from QCD processes at the LHC, showing that enhanced sensitivity to the signal can be obtained with judicious selection of events.
Statistics of X-Ray Polarization Measurements
The polarization of an X-ray beam that produces electrons with velocity components perpendicular to the beam generates an azimuthal distribution of the ejected electrons. We present methods for simulating and for analyzing the angular dependence of electron detections which enable us to derive simple analytical expressions for useful statistical properties of observable data. The derivations are verified by simulations. While we confirm the results of previous work on this topic, we provide an extension needed for analytical treatment of the full range of possible polarization amplitudes.
Deductive Beam Search: Decoding Deducible Rationale for Chain-of-Thought Reasoning
Recent advancements have significantly augmented the reasoning capabilities of Large Language Models (LLMs) through various methodologies, especially chain-of-thought (CoT) reasoning. However, previous methods fail to address reasoning errors in intermediate steps, leading to accumulative errors. In this paper, we propose Deductive Beam Search (DBS), which seamlessly integrates CoT and deductive reasoning with step-wise beam search for LLMs. Our approach deploys a verifier, verifying the deducibility of a reasoning step and its premises, thus alleviating the error accumulation. Furthermore, we introduce a scalable and labor-free data construction method to amplify our model's verification capabilities. Extensive experiments demonstrate that our approach significantly enhances the base performance of LLMs of various scales (7B, 13B, 70B, and ChatGPT) across 8 reasoning datasets from 3 diverse reasoning genres, including arithmetic, commonsense, and symbolic. Moreover, our analysis proves DBS's capability of detecting diverse and subtle reasoning errors and robustness on different model scales.
Sparse Autoencoders for Hypothesis Generation
We describe HypotheSAEs, a general method to hypothesize interpretable relationships between text data (e.g., headlines) and a target variable (e.g., clicks). HypotheSAEs has three steps: (1) train a sparse autoencoder on text embeddings to produce interpretable features describing the data distribution, (2) select features that predict the target variable, and (3) generate a natural language interpretation of each feature (e.g., "mentions being surprised or shocked") using an LLM. Each interpretation serves as a hypothesis about what predicts the target variable. Compared to baselines, our method better identifies reference hypotheses on synthetic datasets (at least +0.06 in F1) and produces more predictive hypotheses on real datasets (~twice as many significant findings), despite requiring 1-2 orders of magnitude less compute than recent LLM-based methods. HypotheSAEs also produces novel discoveries on two well-studied tasks: explaining partisan differences in Congressional speeches and identifying drivers of engagement with online headlines.
Returning The Favour: When Regression Benefits From Probabilistic Causal Knowledge
A directed acyclic graph (DAG) provides valuable prior knowledge that is often discarded in regression tasks in machine learning. We show that the independences arising from the presence of collider structures in DAGs provide meaningful inductive biases, which constrain the regression hypothesis space and improve predictive performance. We introduce collider regression, a framework to incorporate probabilistic causal knowledge from a collider in a regression problem. When the hypothesis space is a reproducing kernel Hilbert space, we prove a strictly positive generalisation benefit under mild assumptions and provide closed-form estimators of the empirical risk minimiser. Experiments on synthetic and climate model data demonstrate performance gains of the proposed methodology.
Interpreting Black Box Models via Hypothesis Testing
In science and medicine, model interpretations may be reported as discoveries of natural phenomena or used to guide patient treatments. In such high-stakes tasks, false discoveries may lead investigators astray. These applications would therefore benefit from control over the finite-sample error rate of interpretations. We reframe black box model interpretability as a multiple hypothesis testing problem. The task is to discover "important" features by testing whether the model prediction is significantly different from what would be expected if the features were replaced with uninformative counterfactuals. We propose two testing methods: one that provably controls the false discovery rate but which is not yet feasible for large-scale applications, and an approximate testing method which can be applied to real-world data sets. In simulation, both tests have high power relative to existing interpretability methods. When applied to state-of-the-art vision and language models, the framework selects features that intuitively explain model predictions. The resulting explanations have the additional advantage that they are themselves easy to interpret.
Sparks of Science: Hypothesis Generation Using Structured Paper Data
Generating novel and creative scientific hypotheses is a cornerstone in achieving Artificial General Intelligence. Large language and reasoning models have the potential to aid in the systematic creation, selection, and validation of scientifically informed hypotheses. However, current foundation models often struggle to produce scientific ideas that are both novel and feasible. One reason is the lack of a dedicated dataset that frames Scientific Hypothesis Generation (SHG) as a Natural Language Generation (NLG) task. In this paper, we introduce HypoGen, the first dataset of approximately 5500 structured problem-hypothesis pairs extracted from top-tier computer science conferences structured with a Bit-Flip-Spark schema, where the Bit is the conventional assumption, the Spark is the key insight or conceptual leap, and the Flip is the resulting counterproposal. HypoGen uniquely integrates an explicit Chain-of-Reasoning component that reflects the intellectual process from Bit to Flip. We demonstrate that framing hypothesis generation as conditional language modelling, with the model fine-tuned on Bit-Flip-Spark and the Chain-of-Reasoning (and where, at inference, we only provide the Bit), leads to improvements in the overall quality of the hypotheses. Our evaluation employs automated metrics and LLM judge rankings for overall quality assessment. We show that by fine-tuning on our HypoGen dataset we improve the novelty, feasibility, and overall quality of the generated hypotheses. The HypoGen dataset is publicly available at huggingface.co/datasets/UniverseTBD/hypogen-dr1.
Uncertain Evidence in Probabilistic Models and Stochastic Simulators
We consider the problem of performing Bayesian inference in probabilistic models where observations are accompanied by uncertainty, referred to as "uncertain evidence." We explore how to interpret uncertain evidence, and by extension the importance of proper interpretation as it pertains to inference about latent variables. We consider a recently-proposed method "distributional evidence" as well as revisit two older methods: Jeffrey's rule and virtual evidence. We devise guidelines on how to account for uncertain evidence and we provide new insights, particularly regarding consistency. To showcase the impact of different interpretations of the same uncertain evidence, we carry out experiments in which one interpretation is defined as "correct." We then compare inference results from each different interpretation illustrating the importance of careful consideration of uncertain evidence.
MOOSE-Chem3: Toward Experiment-Guided Hypothesis Ranking via Simulated Experimental Feedback
Hypothesis ranking is a crucial component of automated scientific discovery, particularly in natural sciences where wet-lab experiments are costly and throughput-limited. Existing approaches focus on pre-experiment ranking, relying solely on large language model's internal reasoning without incorporating empirical outcomes from experiments. We introduce the task of experiment-guided ranking, which aims to prioritize candidate hypotheses based on the results of previously tested ones. However, developing such strategies is challenging due to the impracticality of repeatedly conducting real experiments in natural science domains. To address this, we propose a simulator grounded in three domain-informed assumptions, modeling hypothesis performance as a function of similarity to a known ground truth hypothesis, perturbed by noise. We curate a dataset of 124 chemistry hypotheses with experimentally reported outcomes to validate the simulator. Building on this simulator, we develop a pseudo experiment-guided ranking method that clusters hypotheses by shared functional characteristics and prioritizes candidates based on insights derived from simulated experimental feedback. Experiments show that our method outperforms pre-experiment baselines and strong ablations.
Halu-J: Critique-Based Hallucination Judge
Large language models (LLMs) frequently generate non-factual content, known as hallucinations. Existing retrieval-augmented-based hallucination detection approaches typically address this by framing it as a classification task, evaluating hallucinations based on their consistency with retrieved evidence. However, this approach usually lacks detailed explanations for these evaluations and does not assess the reliability of these explanations. Furthermore, deficiencies in retrieval systems can lead to irrelevant or partially relevant evidence retrieval, impairing the detection process. Moreover, while real-world hallucination detection requires analyzing multiple pieces of evidence, current systems usually treat all evidence uniformly without considering its relevance to the content. To address these challenges, we introduce Halu-J, a critique-based hallucination judge with 7 billion parameters. Halu-J enhances hallucination detection by selecting pertinent evidence and providing detailed critiques. Our experiments indicate that Halu-J outperforms GPT-4o in multiple-evidence hallucination detection and matches its capability in critique generation and evidence selection. We also introduce ME-FEVER, a new dataset designed for multiple-evidence hallucination detection. Our code and dataset can be found in https://github.com/GAIR-NLP/factool .
Literature Meets Data: A Synergistic Approach to Hypothesis Generation
AI holds promise for transforming scientific processes, including hypothesis generation. Prior work on hypothesis generation can be broadly categorized into theory-driven and data-driven approaches. While both have proven effective in generating novel and plausible hypotheses, it remains an open question whether they can complement each other. To address this, we develop the first method that combines literature-based insights with data to perform LLM-powered hypothesis generation. We apply our method on five different datasets and demonstrate that integrating literature and data outperforms other baselines (8.97\% over few-shot, 15.75\% over literature-based alone, and 3.37\% over data-driven alone). Additionally, we conduct the first human evaluation to assess the utility of LLM-generated hypotheses in assisting human decision-making on two challenging tasks: deception detection and AI generated content detection. Our results show that human accuracy improves significantly by 7.44\% and 14.19\% on these tasks, respectively. These findings suggest that integrating literature-based and data-driven approaches provides a comprehensive and nuanced framework for hypothesis generation and could open new avenues for scientific inquiry.
HypoBench: Towards Systematic and Principled Benchmarking for Hypothesis Generation
There is growing interest in hypothesis generation with large language models (LLMs). However, fundamental questions remain: what makes a good hypothesis, and how can we systematically evaluate methods for hypothesis generation? To address this, we introduce HypoBench, a novel benchmark designed to evaluate LLMs and hypothesis generation methods across multiple aspects, including practical utility, generalizability, and hypothesis discovery rate. HypoBench includes 7 real-world tasks and 5 synthetic tasks with 194 distinct datasets. We evaluate four state-of-the-art LLMs combined with six existing hypothesis-generation methods. Overall, our results suggest that existing methods are capable of discovering valid and novel patterns in the data. However, the results from synthetic datasets indicate that there is still significant room for improvement, as current hypothesis generation methods do not fully uncover all relevant or meaningful patterns. Specifically, in synthetic settings, as task difficulty increases, performance significantly drops, with best models and methods only recovering 38.8% of the ground-truth hypotheses. These findings highlight challenges in hypothesis generation and demonstrate that HypoBench serves as a valuable resource for improving AI systems designed to assist scientific discovery.
Observation of nuclear modification of energy-energy correlators inside jets in heavy ion collisions
Energy-energy correlators are constructed by averaging the number of charged particle pairs within jets, weighted by the product of their transverse momenta, as a function of the angular separation of the particles within a pair. They are sensitive to a multitude of perturbative and nonperturbative quantum chromodynamics phenomena in high-energy particle collisions. Using lead-lead data recorded with the CMS detector, energy-energy correlators inside high transverse momentum jets are measured in heavy ion collisions for the first time. The data are obtained at a nucleon-nucleon center-of-mass energy of 5.02 TeV and correspond to an integrated luminosity of 1.70 nb^{-1}. A similar analysis is done for proton-proton collisions at the same center-of-mass energy to establish a reference. The ratio of lead-lead to proton-proton energy-energy correlators reveals significant jet substructure modifications in the quark-gluon plasma. The results are compared to different models that incorporate either color coherence or medium response effects, where the two effects predict similar substructure modifications.
IRIS: Interactive Research Ideation System for Accelerating Scientific Discovery
The rapid advancement in capabilities of large language models (LLMs) raises a pivotal question: How can LLMs accelerate scientific discovery? This work tackles the crucial first stage of research, generating novel hypotheses. While recent work on automated hypothesis generation focuses on multi-agent frameworks and extending test-time compute, none of the approaches effectively incorporate transparency and steerability through a synergistic Human-in-the-loop (HITL) approach. To address this gap, we introduce IRIS: Interactive Research Ideation System, an open-source platform designed for researchers to leverage LLM-assisted scientific ideation. IRIS incorporates innovative features to enhance ideation, including adaptive test-time compute expansion via Monte Carlo Tree Search (MCTS), fine-grained feedback mechanism, and query-based literature synthesis. Designed to empower researchers with greater control and insight throughout the ideation process. We additionally conduct a user study with researchers across diverse disciplines, validating the effectiveness of our system in enhancing ideation. We open-source our code at https://github.com/Anikethh/IRIS-Interactive-Research-Ideation-System
Addendum to Research MMMCV; A Man/Microbio/Megabio/Computer Vision
In October 2007, a Research Proposal for the University of Sydney, Australia, the author suggested that biovie-physical phenomenon as `electrodynamic dependant biological vision', is governed by relativistic quantum laws and biovision. The phenomenon on the basis of `biovielectroluminescence', satisfies man/microbio/megabio/computer vision (MMMCV), as a robust candidate for physical and visual sciences. The general aim of this addendum is to present a refined text of Sections 1-3 of that proposal and highlighting the contents of its Appendix in form of a `Mechanisms' Section. We then briefly remind in an article aimed for December 2007, by appending two more equations into Section 3, a theoretical II-time scenario as a time model well-proposed for the phenomenon. The time model within the core of the proposal, plays a significant role in emphasizing the principle points on Objectives no. 1-8, Sub-hypothesis 3.1.2, mentioned in Article [arXiv:0710.0410]. It also expresses the time concept in terms of causing quantized energy f(|E|) of time |t|, emit in regard to shortening the probability of particle loci as predictable patterns of particle's un-occurred motion, a solution to Heisenberg's uncertainty principle (HUP) into a simplistic manner. We conclude that, practical frames via a time algorithm to this model, fixates such predictable patterns of motion of scenery bodies onto recordable observation points of a MMMCV system. It even suppresses/predicts superposition phenomena coming from a human subject and/or other bio-subjects for any decision making event, e.g., brainwave quantum patterns based on vision. Maintaining the existential probability of Riemann surfaces of II-time scenarios in the context of biovielectroluminescence, makes motion-prediction a possibility.
Model-agnostic search for the quasinormal modes of gravitational wave echoes
Post-merger gravitational wave echoes provide a unique opportunity to probe the near-horizon structure of astrophysical black holes, that may be modified due to non-perturbative quantum gravity phenomena. However, since the waveform is subject to large theoretical uncertainties, it is necessary to develop model-agnostic search methods for detecting echoes from observational data. A promising strategy is to identify the characteristic quasinormal modes (QNMs) associated with echoes, {\it in frequency space}, which complements existing searches of quasiperiodic pulses in time. In this study, we build upon our previous work targeting these modes by incorporating relative phase information to optimize the Bayesian search algorithm. Using a new phase-marginalized likelihood, the performance can be significantly improved for well-resolved QNMs. This enables an efficient model-agnostic search for QNMs of different shapes by using a simple search template. To demonstrate the robustness of the search algorithm, we construct four complementary benchmarks for the echo waveform that span a diverse range of different theoretical possibilities for the near-horizon structure. We then validate our Bayesian search algorithms by injecting the benchmark models into different realizations of Gaussian noise. Using two types of phase-marginalized likelihoods, we find that the search algorithm can efficiently detect the corresponding QNMs. Therefore, our search strategy provides a concrete Bayesian and model-agnostic approach to "quantum black hole seismology".
Cosmic Multipoles in Galaxy Surveys Part I: How Inferences Depend on Source Counts and Masks
We present a new approach to constructing and fitting dipoles and higher-order multipoles in synthetic galaxy samples over the sky. Within our Bayesian paradigm, we illustrate that this technique is robust to masked skies, allowing us to make credible inferences about the relative contributions of each multipole. We also show that dipoles can be recovered in surveys with small footprints, determining the requisite source counts required for concrete estimation of the dipole parameters. This work is motivated by recent probes of the cosmic dipole in galaxy catalogues. Namely, the kinematic dipole of the Cosmic Microwave Background, as arising from the motion of our heliocentric frame at approx 370 km,s^{-1}, implies that an analogous dipole should be observed in the number counts of galaxies in flux-density-limited samples. Recent studies have reported a dipole aligning with the kinematic dipole but with an anomalously large amplitude. Accordingly, our new technique will be important as forthcoming galaxy surveys are made available and for revisiting previous data.
Early Warning Signals and the Prosecutor's Fallacy
Early warning signals have been proposed to forecast the possibility of a critical transition, such as the eutrophication of a lake, the collapse of a coral reef, or the end of a glacial period. Because such transitions often unfold on temporal and spatial scales that can be difficult to approach by experimental manipulation, research has often relied on historical observations as a source of natural experiments. Here we examine a critical difference between selecting systems for study based on the fact that we have observed a critical transition and those systems for which we wish to forecast the approach of a transition. This difference arises by conditionally selecting systems known to experience a transition of some sort and failing to account for the bias this introduces -- a statistical error often known as the Prosecutor's Fallacy. By analysing simulated systems that have experienced transitions purely by chance, we reveal an elevated rate of false positives in common warning signal statistics. We further demonstrate a model-based approach that is less subject to this bias than these more commonly used summary statistics. We note that experimental studies with replicates avoid this pitfall entirely.
Efficient Massive Black Hole Binary parameter estimation for LISA using Sequential Neural Likelihood
The inspiral, merger, and ringdown of Massive Black Hole Binaries (MBHBs) is one the main sources of Gravitational Waves (GWs) for the future Laser Interferometer Space Antenna (LISA), an ESA-led mission in the implementation phase. It is expected that LISA will detect these systems throughout the entire observable universe. Robust and efficient data analysis algorithms are necessary to detect and estimate physical parameters for these systems. In this work, we explore the application of Sequential Neural Likelihood, a simulation-based inference algorithm, to detect and characterize MBHB GW signals in synthetic LISA data. We describe in detail the different elements of the method, their performance and possible alternatives that can be used to enhance the performance. Instead of sampling from the conventional likelihood function, which requires a forward simulation for each evaluation, this method constructs a surrogate likelihood that is ultimately described by a neural network trained from a dataset of simulations of the MBHB signals and noise. One important advantage of this method is that, given that the likelihood is independent of the priors, we can iteratively train models that target specific observations in a fraction of the time and computational cost that other traditional and machine learning-based strategies would require. Because of the iterative nature of the method, we are able to train models to obtain qualitatively similar posteriors with less than 2\% of the simulator calls that Markov Chain Monte Carlo methods would require. We compare these posteriors with those obtained from Markov Chain Monte Carlo techniques and discuss the differences that appear, in particular in relation with the important role that data compression has in the modular implementation of the method that we present. We also discuss different strategies to improve the performance of the algorithms.
BeamAggR: Beam Aggregation Reasoning over Multi-source Knowledge for Multi-hop Question Answering
Large language models (LLMs) have demonstrated strong reasoning capabilities. Nevertheless, they still suffer from factual errors when tackling knowledge-intensive tasks. Retrieval-augmented reasoning represents a promising approach. However, significant challenges still persist, including inaccurate and insufficient retrieval for complex questions, as well as difficulty in integrating multi-source knowledge. To address this, we propose Beam Aggregation Reasoning, BeamAggR, a reasoning framework for knowledge-intensive multi-hop QA. BeamAggR explores and prioritizes promising answers at each hop of question. Concretely, we parse the complex questions into trees, which include atom and composite questions, followed by bottom-up reasoning. For atomic questions, the LLM conducts reasoning on multi-source knowledge to get answer candidates. For composite questions, the LLM combines beam candidates, explores multiple reasoning paths through probabilistic aggregation, and prioritizes the most promising trajectory. Extensive experiments on four open-domain multi-hop reasoning datasets show that our method significantly outperforms SOTA methods by 8.5%. Furthermore, our analysis reveals that BeamAggR elicits better knowledge collaboration and answer aggregation.
Reconstruction of inclined extensive air showers using radio signals: from arrival times and amplitudes to direction and energy
Radio detection is now an established technique for the study of ultra-high-energy (UHE) cosmic rays with energies above sim10^{17} eV. The next-generation of radio experiments aims to extend this technique to the observation of UHE earth-skimming neutrinos, which requires the detection of very inclined extensive air showers (EAS). In this article we present a new reconstruction method for the arrival direction and the energy of EAS. It combines a point-source-like description of the radio wavefront with a phenomenological model: the Angular Distribution Function (ADF). The ADF describes the angular distribution of the radio signal amplitude in the 50-200 MHz frequency range, with a particular focus on the Cherenkov angle, a crucial feature of the radio amplitude pattern. The method is applicable to showers with zenith angles larger than 60^circ, and in principle up to neutrino-induced showers with up-going trajectories. It is tested here on a simulated data set of EAS induced by cosmic rays. A resolution better than 4 arc-minutes (0.07^circ) is achieved on arrival direction, as well as an intrinsic resolution of 5% on the electromagnetic energy, and around 15% on the primary energy.
Extension of the J-PARC Hadron Experimental Facility: Third White Paper
The J-PARC Hadron Experimental Facility was constructed with an aim to explore the origin and evolution of matter in the universe through the experiments with intense particle beams. In the past decade, many results on particle and nuclear physics have been obtained at the present facility. To expand the physics programs to unexplored regions never achieved, the extension project of the Hadron Experimental Facility has been extensively discussed. This white paper presents the physics of the extension of the Hadron Experimental Facility for resolving the issues in the fields of the strangeness nuclear physics, hadron physics, and flavor physics.
MOOSE-Chem: Large Language Models for Rediscovering Unseen Chemistry Scientific Hypotheses
Scientific discovery contributes largely to human society's prosperity, and recent progress shows that LLMs could potentially catalyze this process. However, it is still unclear whether LLMs can discover novel and valid hypotheses in chemistry. In this work, we investigate this central research question: Can LLMs automatically discover novel and valid chemistry research hypotheses given only a chemistry research background (consisting of a research question and/or a background survey), without limitation on the domain of the research question? After extensive discussions with chemistry experts, we propose an assumption that a majority of chemistry hypotheses can be resulted from a research background and several inspirations. With this key insight, we break the central question into three smaller fundamental questions. In brief, they are: (1) given a background question, whether LLMs can retrieve good inspirations; (2) with background and inspirations, whether LLMs can lead to hypothesis; and (3) whether LLMs can identify good hypotheses to rank them higher. To investigate these questions, we construct a benchmark consisting of 51 chemistry papers published in Nature, Science, or a similar level in 2024 (all papers are only available online since 2024). Every paper is divided by chemistry PhD students into three components: background, inspirations, and hypothesis. The goal is to rediscover the hypothesis, given only the background and a large randomly selected chemistry literature corpus consisting the ground truth inspiration papers, with LLMs trained with data up to 2023. We also develop an LLM-based multi-agent framework that leverages the assumption, consisting of three stages reflecting the three smaller questions. The proposed method can rediscover many hypotheses with very high similarity with the ground truth ones, covering the main innovations.
Maximising information from weak lensing galaxy surveys
Weak lensing galaxy surveys are currently undergoing a dramatic revolution as the dawn of the Stage-IV surveys are upon us. Hence, ensuring that our analysis methods are as accurate and precise as the raw data is of upmost importance. This motivated the development of a new implementation of the quadratic maximum likelihood power spectrum estimation technique, the application of the theoretical uncertainties approach to mitigate baryonic feedback biases, and to re-evaluating the criterion from which binary scale cuts are derived when aiming to eliminate baryonic biases. These techniques maximise the available information from weak lensing observations while minimising potential systematic biases, and shows how this PhD thesis contributes to the advancement of weak lensing cosmology.
Massive MIMO Beam Management in Sub-6 GHz 5G NR
Beam codebooks are a new feature of massive multiple-input multiple-output (M-MIMO) in 5G new radio (NR). Codebooks comprised of beamforming vectors are used to transmit reference signals and obtain limited channel state information (CSI) from receivers via the codeword index. This enables large arrays that cannot otherwise obtain sufficient CSI. The performance, however, is limited by the codebook design. In this paper, we show that machine learning can be used to train site-specific codebooks for initial access. We design a neural network based on an autoencoder architecture that uses a beamspace observation in combination with RF environment characteristics to improve the synchronization signal (SS) burst codebook. We test our algorithm using a flexible dataset of channels generated from QuaDRiGa. The results show that our model outperforms the industry standard (DFT beams) and approaches the optimal performance (perfect CSI and singular value decomposition (SVD)-based beamforming), using only a few bits of feedback.
Beyond Symmetries : Anomalies in Transverse Ward--Takahashi Identities
Anomalies in transverse Ward--Takahashi identities are studied, allowing discussion of the feasibility of anomalies arising in general non-symmetry Ward--Takahashi identities. We adopt the popular Fujikawa's method and rigorous dimensional renormalization to verify the existence of transverse anomalies to one-loop order and any loop order, respectively. The arbitrariness of coefficients of transverse anomalies is revealed, and a way out is also proposed after relating transverse anomalies to Schwinger terms and comparing symmetry and non-symmetry anomalies. Papers that claim the non-existence of transverse anomalies are reviewed to find anomalies hidden in their approaches. The role played by transverse anomalies is discussed.
Detecting LHC Neutrinos at Surface Level
The first direct detection of neutrinos at the LHC not only marks the beginning of a novel collider neutrino program at CERN but also motivates considering additional neutrino detectors to fully exploit the associated physics potential. We investigate the feasibility and physics potential of neutrino experiments located at the surface-level. A topographic desk study was performed to identify all points at which the LHC's neutrino beams exit the earth. The closest location lies about 9 km east of the CMS interaction point, at the bottom of Lake Geneva. Several detectors to be placed at this location are considered, including a water Cherenkov detector and an emulsion detector. The detector concepts are introduced, and projections for their contribution to the LHC forward neutrino program and searches for dark sector particles are presented. However, the dilution of the neutrino flux over distance reduces the neutrino yield significantly, limiting the physics potential of surface-level detectors compared to ones closer to the interaction point, including the proposed FPF.
Fast kernel methods for Data Quality Monitoring as a goodness-of-fit test
We here propose a machine learning approach for monitoring particle detectors in real-time. The goal is to assess the compatibility of incoming experimental data with a reference dataset, characterising the data behaviour under normal circumstances, via a likelihood-ratio hypothesis test. The model is based on a modern implementation of kernel methods, nonparametric algorithms that can learn any continuous function given enough data. The resulting approach is efficient and agnostic to the type of anomaly that may be present in the data. Our study demonstrates the effectiveness of this strategy on multivariate data from drift tube chamber muon detectors.
The Tracking Machine Learning challenge : Throughput phase
This paper reports on the second "Throughput" phase of the Tracking Machine Learning (TrackML) challenge on the Codalab platform. As in the first "Accuracy" phase, the participants had to solve a difficult experimental problem linked to tracking accurately the trajectory of particles as e.g. created at the Large Hadron Collider (LHC): given O(10^5) points, the participants had to connect them into O(10^4) individual groups that represent the particle trajectories which are approximated helical. While in the first phase only the accuracy mattered, the goal of this second phase was a compromise between the accuracy and the speed of inference. Both were measured on the Codalab platform where the participants had to upload their software. The best three participants had solutions with good accuracy and speed an order of magnitude faster than the state of the art when the challenge was designed. Although the core algorithms were less diverse than in the first phase, a diversity of techniques have been used and are described in this paper. The performance of the algorithms are analysed in depth and lessons derived.
Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation
There is growing excitement about the potential of Language Models (LMs) to accelerate scientific discovery. Falsifying hypotheses is key to scientific progress, as it allows claims to be iteratively refined over time. This process requires significant researcher effort, reasoning, and ingenuity. Yet current benchmarks for LMs predominantly assess their ability to generate solutions rather than challenge them. We advocate for developing benchmarks that evaluate this inverse capability - creating counterexamples for subtly incorrect solutions. To demonstrate this approach, we start with the domain of algorithmic problem solving, where counterexamples can be evaluated automatically using code execution. Specifically, we introduce REFUTE, a dynamically updating benchmark that includes recent problems and incorrect submissions from programming competitions, where human experts successfully identified counterexamples. Our analysis finds that the best reasoning agents, even OpenAI o3-mini (high) with code execution feedback, can create counterexamples for only <9% of incorrect solutions in REFUTE, even though ratings indicate its ability to solve up to 48% of these problems from scratch. We hope our work spurs progress in evaluating and enhancing LMs' ability to falsify incorrect solutions - a capability that is crucial for both accelerating research and making models self-improve through reliable reflective reasoning.
Neutron capture measurements for s-process nucleosynthesis; A review about CERN n_TOF developments and contributions
This article presents a review about the main CERN n\_TOF contributions to the field of neutron-capture experiments of interest for s-process nucleosynthesis studies over the last 25 years, with special focus on the measurement of radioactive isotopes. A few recent capture experiments on stable isotopes of astrophysical interest are also discussed. Results on s-process branching nuclei are appropriate to illustrate how advances in detection systems and upgrades in the facility have enabled increasingly challenging experiments and, as a consequence, have led to a better understanding and modeling of the s-process mechanism of nucleosynthesis. New endeavors combining radioactive-ion beams from ISOLDE for the production of radioisotopically pure samples for activation experiments at the new NEAR facility at n\_TOF are briefly discussed. On the basis of these new exciting results, also current limitations of state-of-the-art TOF and activation techniques will be depicted, thereby showing the pressing need for further upgrades and enhancements on both facilities and detection systems. A brief account of the potential technique based on inverse kinematics for direct neutron-capture measurements is also presented.
Language Models Do Not Follow Occam's Razor: A Benchmark for Inductive and Abductive Reasoning
Reasoning is a core capability in artificial intelligence systems, for which large language models (LLMs) have recently shown remarkable progress. However, most work focuses exclusively on deductive reasoning, which is problematic since other types of reasoning are also essential in solving real-world problems, and they are less explored. This work focuses on evaluating LLMs' inductive and abductive reasoning capabilities. We introduce a programmable and synthetic dataset, InAbHyD (pronounced in-a-bid), where each reasoning example consists of an incomplete world model and a set of observations. The task for the intelligent agent is to produce hypotheses to explain observations under the incomplete world model to solve each reasoning example. We propose a new metric to evaluate the quality of hypotheses based on Occam's Razor. We evaluate and analyze some state-of-the-art LLMs. Our analysis shows that LLMs can perform inductive and abductive reasoning in simple scenarios, but struggle with complex world models and producing high-quality hypotheses, even with popular reasoning-enhancing techniques such as in-context learning and RLVR.
Large Language Models for Automated Open-domain Scientific Hypotheses Discovery
Hypothetical induction is recognized as the main reasoning type when scientists make observations about the world and try to propose hypotheses to explain those observations. Past research on hypothetical induction is under a constrained setting: (1) the observation annotations in the dataset are carefully manually handpicked sentences (resulting in a close-domain setting); and (2) the ground truth hypotheses are mostly commonsense knowledge, making the task less challenging. In this work, we tackle these problems by proposing the first dataset for social science academic hypotheses discovery, with the final goal to create systems that automatically generate valid, novel, and helpful scientific hypotheses, given only a pile of raw web corpus. Unlike previous settings, the new dataset requires (1) using open-domain data (raw web corpus) as observations; and (2) proposing hypotheses even new to humanity. A multi-module framework is developed for the task, including three different feedback mechanisms to boost performance, which exhibits superior performance in terms of both GPT-4 based and expert-based evaluation. To the best of our knowledge, this is the first work showing that LLMs are able to generate novel (''not existing in literature'') and valid (''reflecting reality'') scientific hypotheses.
Enhancing the significance of astrophysical events with multimessenger coincidences
Coincident multimessenger observations of cosmic sources can offer numerous benefits, especially when used in the context of synergistic astrophysics. One significant advantage is enhancing the detection significance of separate detectors by correlating their data and assuming joint emission. We have formulated an approach for updating the Bayesian posterior probability of an astrophysical origin, namely p_{rm astro}, relying on multimessenger coincidences assuming an emission model. The description is applicable to any combination of messengers. We demonstrated the formalism for the gravitational waves and high-energy neutrinos case. Applying our method to the public data of candidate coincident high-energy neutrinos with subthreshold gravitational-wave triggers, we found that in the case of highly energetic neutrino coincidences, p_{rm astro} can increase from approximately sim 0.1 to sim 0.9. The amount of improvement depends on the assumed joint emission model. If models are trusted, the marked improvement makes subthreshold detections much more confident. Moreover, the model dependency can also be used to test the consistency of different models. This work is a crucial step toward the goal of uniting all detectors on equal footing into a statistically integrated, Earth-sized observatory for comprehensive multimessenger astrophysics.
BoxingGym: Benchmarking Progress in Automated Experimental Design and Model Discovery
Understanding the world and explaining it with scientific theories is a central aspiration of artificial intelligence research. Proposing theories, designing experiments to test them, and then revising them based on data are fundamental to scientific discovery. Despite the significant promise of LLM-based scientific agents, no benchmarks systematically test LLM's ability to propose scientific models, collect experimental data, and revise them in light of new data. We introduce BoxingGym, a benchmark with 10 environments for systematically evaluating both experimental design (e.g. collecting data to test a scientific theory) and model discovery (e.g. proposing and revising scientific theories). To enable tractable and quantitative evaluation, we implement each environment as a generative probabilistic model with which a scientific agent can run interactive experiments. These probabilistic models are drawn from various real-world scientific domains ranging from psychology to ecology. To quantitatively evaluate a scientific agent's ability to collect informative experimental data, we compute the expected information gain (EIG), an information-theoretic quantity which measures how much an experiment reduces uncertainty about the parameters of a generative model. A good scientific theory is a concise and predictive explanation. Therefore, to quantitatively evaluate model discovery, we ask a scientific agent to explain their model and then assess whether this explanation enables another scientific agent to make reliable predictions about this environment. In addition to this explanation-based evaluation, we compute standard model evaluation metrics such as prediction errors. We find that current LLMs, such as GPT-4o, struggle with both experimental design and model discovery. We find that augmenting the LLM-based agent with an explicit statistical model does not reliably improve these results.
A Survey on Hypothesis Generation for Scientific Discovery in the Era of Large Language Models
Hypothesis generation is a fundamental step in scientific discovery, yet it is increasingly challenged by information overload and disciplinary fragmentation. Recent advances in Large Language Models (LLMs) have sparked growing interest in their potential to enhance and automate this process. This paper presents a comprehensive survey of hypothesis generation with LLMs by (i) reviewing existing methods, from simple prompting techniques to more complex frameworks, and proposing a taxonomy that categorizes these approaches; (ii) analyzing techniques for improving hypothesis quality, such as novelty boosting and structured reasoning; (iii) providing an overview of evaluation strategies; and (iv) discussing key challenges and future directions, including multimodal integration and human-AI collaboration. Our survey aims to serve as a reference for researchers exploring LLMs for hypothesis generation.
A Heavy-Metal Scenario of Ultra-High-Energy Cosmic Rays
The mass composition of ultra-high-energy cosmic rays is an open problem in astroparticle physics. It is usually inferred from the depth of the shower maximum (Xmax) of cosmic-ray showers, which is only ambiguously determined by modern hadronic interaction models. We examine a data-driven scenario, in which we consider the expectation value of Xmax as a free parameter. We test the novel hypothesis whether the cosmic-ray data from the Pierre Auger Observatory can be interpreted in a consistent picture, under the assumption that the mass composition of cosmic rays at the highest energies is dominated by high metallicity, resulting in pure iron nuclei at energies above ~40 EeV. We investigate the implications on astrophysical observations and hadronic interactions, and we discuss the global consistency of the data assuming this heavy-metal scenario. We conclude that the data from the Pierre Auger Observatory can be interpreted consistently if the expectation values for Xmax from modern hadronic interaction models are shifted to larger values.
Characterising the Surface Resistance of Laser-Treated LHC Beam Screens with the Shielded Pair Method
The presence of strong electron clouds in the quadrupole magnetic field regions of the Large Hadron Collider (LHC) leads to considerable heating that poses challenges for the cryogenic cooling system, and under certain conditions to proton beam quality deterioration. Research is being conducted on laser-treated inner beam screen surfaces for the upgraded High-Luminosity LHC to mitigate this issue. Laser-induced surface structuring, a technique that effectively roughens surfaces, has been shown to reduce secondary electron emission; an essential factor in controlling electron cloud formation. Conversely, the resulting surface roughening also alters the material's surface impedance, potentially impacting beam stability and increasing beam-induced resistive wall heating. Different laser treatment patterns have been applied to LHC beam screens to estimate this potential impact and assessed for their microwave responses.
The Virtual Quantum Optics Laboratory
We present a web-based software tool, the Virtual Quantum Optics Laboratory (VQOL), that may be used for designing and executing realistic simulations of quantum optics experiments. A graphical user interface allows one to rapidly build and configure a variety of different optical experiments, while the runtime environment provides unique capabilities for visualization and analysis. All standard linear optical components are available as well as sources of thermal, coherent, and entangled Gaussian states. A unique aspect of VQOL is the introduction of non-Gaussian measurements using detectors modeled as deterministic devices that "click" when the amplitude of the light falls above a given threshold. We describe the underlying theoretical models and provide several illustrative examples. We find that VQOL provides a a faithful representation of many experimental quantum optics phenomena and may serve as both a useful instructional tool for students as well as a valuable research tool for practitioners.
Likelihood Reconstruction for Radio Detectors of Neutrinos and Cosmic Rays
Ultra-high-energy neutrinos and cosmic rays are excellent probes of astroparticle physics phenomena. For astroparticle physics analyses, robust and accurate reconstruction of signal parameters such as arrival direction and energy is essential. Radio detection is an established detector concept explored by many observatories; however, current reconstruction methods ignore bin-to-bin noise correlations, which limits reconstruction resolution and, so far, has prevented calculations of event-by-event uncertainties. In this work, we present a likelihood description of neutrino or cosmic-ray signals in radio detectors with correlated noise, as present in all neutrino and cosmic-ray radio detectors. We demonstrate, with simulation studies of both neutrinos and cosmic-ray radio signals, that signal parameters such as energy and direction, including event-by-event uncertainties with correct coverage, can be obtained. This method reduces reconstruction uncertainties and biases compared to previous approaches. Additionally, the Likelihood can be used for event selection and enables differentiable end-to-end detector optimization. The reconstruction code is available through the open-source software NuRadioReco.
The Linear Representation Hypothesis and the Geometry of Large Language Models
Informally, the 'linear representation hypothesis' is the idea that high-level concepts are represented linearly as directions in some representation space. In this paper, we address two closely related questions: What does "linear representation" actually mean? And, how do we make sense of geometric notions (e.g., cosine similarity or projection) in the representation space? To answer these, we use the language of counterfactuals to give two formalizations of "linear representation", one in the output (word) representation space, and one in the input (sentence) space. We then prove these connect to linear probing and model steering, respectively. To make sense of geometric notions, we use the formalization to identify a particular (non-Euclidean) inner product that respects language structure in a sense we make precise. Using this causal inner product, we show how to unify all notions of linear representation. In particular, this allows the construction of probes and steering vectors using counterfactual pairs. Experiments with LLaMA-2 demonstrate the existence of linear representations of concepts, the connection to interpretation and control, and the fundamental role of the choice of inner product.
Preserving Statistical Validity in Adaptive Data Analysis
A great deal of effort has been devoted to reducing the risk of spurious scientific discoveries, from the use of sophisticated validation techniques, to deep statistical methods for controlling the false discovery rate in multiple hypothesis testing. However, there is a fundamental disconnect between the theoretical results and the practice of data analysis: the theory of statistical inference assumes a fixed collection of hypotheses to be tested, or learning algorithms to be applied, selected non-adaptively before the data are gathered, whereas in practice data is shared and reused with hypotheses and new analyses being generated on the basis of data exploration and the outcomes of previous analyses. In this work we initiate a principled study of how to guarantee the validity of statistical inference in adaptive data analysis. As an instance of this problem, we propose and investigate the question of estimating the expectations of m adaptively chosen functions on an unknown distribution given n random samples. We show that, surprisingly, there is a way to estimate an exponential in n number of expectations accurately even if the functions are chosen adaptively. This gives an exponential improvement over standard empirical estimators that are limited to a linear number of estimates. Our result follows from a general technique that counter-intuitively involves actively perturbing and coordinating the estimates, using techniques developed for privacy preservation. We give additional applications of this technique to our question.
Diquark Correlations in Hadron Physics: Origin, Impact and Evidence
The last decade has seen a marked shift in how the internal structure of hadrons is understood. Modern experimental facilities, new theoretical techniques for the continuum bound-state problem and progress with lattice-regularised QCD have provided strong indications that soft quark+quark (diquark) correlations play a crucial role in hadron physics. For example, theory indicates that the appearance of such correlations is a necessary consequence of dynamical chiral symmetry breaking, viz. a corollary of emergent hadronic mass that is responsible for almost all visible mass in the universe; experiment has uncovered signals for such correlations in the flavour-separation of the proton's electromagnetic form factors; and phenomenology suggests that diquark correlations might be critical to the formation of exotic tetra- and penta-quark hadrons. A broad spectrum of such information is evaluated herein, with a view to consolidating the facts and therefrom moving toward a coherent, unified picture of hadron structure and the role that diquark correlations might play.
The Quest for the Origins of Ultra-High-Energy Cosmic Rays
Significant progress has been made over the past decades towards unveiling the sources of the most energetic particles in nature, the ultra-high-energy cosmic rays (UHECRs). Despite these advancements, the exact astrophysical sites capable of accelerating these particles to such extreme energies remain largely unknown. Moreover, the mechanisms by which they achieve these extreme energies are poorly understood. Here, I provide a concise overview of the theory underlying the acceleration and propagation of UHECRs. I then critically discuss three recent results that could help unveil their origins: the reported excess around Centaurus A, the correlation with starburst galaxies, and the efforts to jointly model the energy spectrum, composition, and arrival directions. Finally, I discuss strategies for advancing this field, emphasising the need for refined theoretical models, the challenges in building them, and the potential for new observatories to shed light on the mysteries of UHECRs.
How well do SOTA legal reasoning models support abductive reasoning?
We examine how well the state-of-the-art (SOTA) models used in legal reasoning support abductive reasoning tasks. Abductive reasoning is a form of logical inference in which a hypothesis is formulated from a set of observations, and that hypothesis is used to explain the observations. The ability to formulate such hypotheses is important for lawyers and legal scholars as it helps them articulate logical arguments, interpret laws, and develop legal theories. Our motivation is to consider the belief that deep learning models, especially large language models (LLMs), will soon replace lawyers because they perform well on tasks related to legal text processing. But to do so, we believe, requires some form of abductive hypothesis formation. In other words, while LLMs become more popular and powerful, we want to investigate their capacity for abductive reasoning. To pursue this goal, we start by building a logic-augmented dataset for abductive reasoning with 498,697 samples and then use it to evaluate the performance of a SOTA model in the legal field. Our experimental results show that although these models can perform well on tasks related to some aspects of legal text processing, they still fall short in supporting abductive reasoning tasks.
Lamarr: LHCb ultra-fast simulation based on machine learning models deployed within Gauss
About 90% of the computing resources available to the LHCb experiment has been spent to produce simulated data samples for Run 2 of the Large Hadron Collider at CERN. The upgraded LHCb detector will be able to collect larger data samples, requiring many more simulated events to analyze the data to be collected in Run 3. Simulation is a key necessity of analysis to interpret signal, reject background and measure efficiencies. The needed simulation will far exceed the pledged resources, requiring an evolution in technologies and techniques to produce these simulated data samples. In this contribution, we discuss Lamarr, a Gaudi-based framework to speed-up the simulation production parameterizing both the detector response and the reconstruction algorithms of the LHCb experiment. Deep Generative Models powered by several algorithms and strategies are employed to effectively parameterize the high-level response of the single components of the LHCb detector, encoding within neural networks the experimental errors and uncertainties introduced in the detection and reconstruction phases. Where possible, models are trained directly on real data, statistically subtracting any background components by applying appropriate reweighing procedures. Embedding Lamarr in the general LHCb Gauss Simulation framework allows to combine its execution with any of the available generators in a seamless way. The resulting software package enables a simulation process independent of the detailed simulation used to date.
Is your stochastic signal really detectable?
Separating a stochastic gravitational wave background (SGWB) from noise is a challenging statistical task. One approach to establishing a detection criterion for the SGWB is using Bayesian evidence. If the evidence ratio (Bayes factor) between models with and without the signal exceeds a certain threshold, the signal is considered detected. We present a formalism to compute the averaged Bayes factor, incorporating instrumental-noise and SGWB uncertainties. As an example, we consider the case of power-law-shaped SGWB in LISA and generate the corresponding bayesian sensitivity curve. Unlike existing methods in the literature, which typically neglect uncertainties in both the signal and noise, our approach provides a reliable and realistic alternative. This flexible framework opens avenues for more robust stochastic gravitational wave background detection across gravitational-wave experiments.
Weak localization in radiative transfer of acoustic waves in a randomly-fluctuating slab
This paper concerns the derivation of radiative transfer equations for acoustic waves propagating in a randomly fluctuating slab (between two parallel planes) in the weak-scattering regime, and the study of boundary effects through an asymptotic analysis of the Wigner transform of the wave solution. These radiative transfer equations allow to model the transport of wave energy density, taking into account the scattering by random heterogeneities. The approach builds on the method of images, where the slab is extended to a full-space, with a periodic map of mechanical properties and a series of sources located along a periodic pattern. Two types of boundary effects, both on the (small) scale of the wavelength, are observed: one at the boundaries of the slab, and one inside the domain. The former impact the entire energy density (coherent as well as incoherent) and is also observed in half-spaces. The latter, more specific to slabs, corresponds to the constructive interference of waves that have reflected at least twice on the boundaries of the slab and only impacts the coherent part of the energy density.
Bayesian Updates Compose Optically
Bayes' rule tells us how to invert a causal process in order to update our beliefs in light of new evidence. If the process is believed to have a complex compositional structure, we may ask whether composing the inversions of the component processes gives the same belief update as the inversion of the whole. We answer this question affirmatively, showing that the relevant compositional structure is precisely that of the lens pattern, and that we can think of Bayesian inversion as a particular instance of a state-dependent morphism in a corresponding fibred category. We define a general notion of (mixed) Bayesian lens, and discuss the (un)lawfulness of these lenses when their contravariant components are exact Bayesian inversions. We prove our main result both abstractly and concretely, for both discrete and continuous states, taking care to illustrate the common structures.
Large Language Models as Biomedical Hypothesis Generators: A Comprehensive Evaluation
The rapid growth of biomedical knowledge has outpaced our ability to efficiently extract insights and generate novel hypotheses. Large language models (LLMs) have emerged as a promising tool to revolutionize knowledge interaction and potentially accelerate biomedical discovery. In this paper, we present a comprehensive evaluation of LLMs as biomedical hypothesis generators. We construct a dataset of background-hypothesis pairs from biomedical literature, carefully partitioned into training, seen, and unseen test sets based on publication date to mitigate data contamination. Using this dataset, we assess the hypothesis generation capabilities of top-tier instructed models in zero-shot, few-shot, and fine-tuning settings. To enhance the exploration of uncertainty, a crucial aspect of scientific discovery, we incorporate tool use and multi-agent interactions in our evaluation framework. Furthermore, we propose four novel metrics grounded in extensive literature review to evaluate the quality of generated hypotheses, considering both LLM-based and human assessments. Our experiments yield two key findings: 1) LLMs can generate novel and validated hypotheses, even when tested on literature unseen during training, and 2) Increasing uncertainty through multi-agent interactions and tool use can facilitate diverse candidate generation and improve zero-shot hypothesis generation performance. However, we also observe that the integration of additional knowledge through few-shot learning and tool use may not always lead to performance gains, highlighting the need for careful consideration of the type and scope of external knowledge incorporated. These findings underscore the potential of LLMs as powerful aids in biomedical hypothesis generation and provide valuable insights to guide further research in this area.
Novel |V_{cb}| extraction method via boosted bc-tagging with in-situ calibration
We present a novel method for measuring |V_{cb}| at the LHC using an advanced boosted-jet tagger to identify "bc signatures". By associating boosted W rightarrow bc signals with bc-matched jets from top-quark decays, we enable an in-situ calibration of the tagger. This approach significantly suppresses backgrounds while reducing uncertainties in flavor tagging efficiencies -- key to improving measurement precision. Our study is enabled by the development of realistic, AI-based large- and small-radius taggers, Sophon and the newly introduced SophonAK4, validated to match ATLAS and CMS's state-of-the-art taggers. The method complements the conventional small radius jet approach and enables a ~30% improvement in |V_{cb}| precision under HL-LHC projections. As a byproduct, it enhances H^{pm} rightarrow bc search sensitivity by a factor of 2--5 over the recent ATLAS result based on Run 2 data. Our work offers a new perspective for the precision |V_{cb}| measurement and highlights the potential of using advanced tagging models to probe unexplored boosted regimes at the LHC.
Interpretation of excess in H to Z γ using a light axion-like particle
We interpret the recent excess in a rare decay of the Higgs boson, Hto Zgamma, using a light axion-like particle (ALP) in the massrange 0.05 - 0.1 GeV.The dominant decay of such a light ALP is into a pair of collimated photons, whose decay is required to happen before reaching the ECAL detector, such that it mimics a single photon in the detector. It can explain the excess with a coupling C^{rm eff}_{aZH} / Lambda sim 4 times 10^{-5};{rm GeV}^{-1}, while the decay of the ALP before reaching the ECAL requires the diphoton coupling C^{rm eff}_{gammagamma}/ Lambda ge 0.35 ,{rm TeV}^{-1} (0.1,{rm eV}/m_a)^2. A potential test would be the rare decay of the Z boson Z to a H^* to a (b bar b) at the Tera-Z option of the future FCC and CEPC. However, it has a branching ratio of only O(10^{-12}), and thus barely testable. The production cross section for pp to Z^* to a H via the same coupling C^{rm eff}_{aZH} / Lambda at the LHC is too small for detection.
mini-TimeCube as a Neutron Scatter Camera
We present Monte Carlo (MC) simulation results from a study of a compact plastic-scintillator detector suitable for imaging fast neutrons in the 1 -- 10 MeV energy range: the miniTimeCube (mTC). Originally designed for antineutrino detection, the mTC consists of 24 MultiChannel Plate (MCP) photodetectors surrounding a 13 cm cube of boron-doped plastic scintillator. Our simulation results show that waveform digitization of 1536 optically sensitive channels surrounding the scintillator should allow for spatiotemporal determination of individual neutron-proton scatters in the detector volume to thicksim100 picoseconds and thicksim5 mm. A Bayesian estimation framework is presented for multiple-scatter reconstruction, and is used to estimate the incoming direction and energy of simulated individual neutrons. Finally, we show how populations of reconstructed neutrons can be used to estimate the direction and energy spectrum of nearby simulated neutron sources.
Solar System Experiments in the Search for Dark Energy and Dark Matter
We reassess the realistic discovery reach of Solar-System experiments for dark energy (DE) and dark matter (DM), making explicit the bridge from cosmology-level linear responses to local, screened residuals. In scalar-tensor frameworks with a universal conformal coupling A(phi) and chameleon/Vainshtein screening, we map cosmological responses {mu(z,k),Sigma(z,k)} inferred by DESI and Euclid to thin-shell or Vainshtein residuals in deep Solar potentials Phi_N. We emphasize a two-branch strategy. In a detection-first branch, a verified local anomaly -- an Einstein equivalence principle (EEP) violation, a Shapiro-delay signal with |gamma-1|simfewtimes 10^{-6}, an AU-scale Yukawa tail, or a ultralight DM (ULDM) line in clocks/atom interferometers in space (AIS) -- triggers a joint refit of cosmology and Solar-System data under a common microphysical parameterization {V(phi),A(phi)}. In a guardrail branch, Solar-System tests enforce constraints (EEP; PPN parameters gamma,beta; and dot G/G) and close unscreened or weakly screened corners indicated by cosmology. We forecast, per conjunction, |gamma-1|lesssim (2-5)times 10^{-6} (Ka-/X-band or optical Shapiro), eta_{EEP}sim (1--10)times 10^{-17} (drag-free AIS), |dot G/G|sim(3-5)times10^{-15},yr^{-1} (sub-mm-class LLR), a uniform ~2x tightening of AU-scale Yukawa/DM-density bounds, and (3-10)times improved ULDM-coupling reach from clocks. For a conformal benchmark, mu_{ lin,0}=0.10 implies chisimeq mu_{lin,0/2} and a Sun thin shell Delta R/Rlesssim (1/3chi)|gamma-1|/2=2.4times 10^{-3} at |gamma-1|=5times 10^{-6}; Vainshtein screening at 1 AU yields |gamma-1|lesssim 10^{-11}, naturally below near-term reach. We recommend a cost-effective guardrail+discovery portfolio with explicit triggers for escalation to dedicated missions.
HYPRO: A Hybridly Normalized Probabilistic Model for Long-Horizon Prediction of Event Sequences
In this paper, we tackle the important yet under-investigated problem of making long-horizon prediction of event sequences. Existing state-of-the-art models do not perform well at this task due to their autoregressive structure. We propose HYPRO, a hybridly normalized probabilistic model that naturally fits this task: its first part is an autoregressive base model that learns to propose predictions; its second part is an energy function that learns to reweight the proposals such that more realistic predictions end up with higher probabilities. We also propose efficient training and inference algorithms for this model. Experiments on multiple real-world datasets demonstrate that our proposed HYPRO model can significantly outperform previous models at making long-horizon predictions of future events. We also conduct a range of ablation studies to investigate the effectiveness of each component of our proposed methods.
LLaVA-o1: Let Vision Language Models Reason Step-by-Step
Large language models have demonstrated substantial advancements in reasoning capabilities, particularly through inference-time scaling, as illustrated by models such as OpenAI's o1. However, current Vision-Language Models (VLMs) often struggle to perform systematic and structured reasoning, especially when handling complex visual question-answering tasks. In this work, we introduce LLaVA-o1, a novel VLM designed to conduct autonomous multistage reasoning. Unlike chain-of-thought prompting, LLaVA-o1 independently engages in sequential stages of summarization, visual interpretation, logical reasoning, and conclusion generation. This structured approach enables LLaVA-o1 to achieve marked improvements in precision on reasoning-intensive tasks. To accomplish this, we compile the LLaVA-o1-100k dataset, integrating samples from various visual question answering sources and providing structured reasoning annotations. Besides, we propose an inference-time stage-level beam search method, which enables effective inference-time scaling. Remarkably, with only 100k training samples and a simple yet effective inference time scaling method, LLaVA-o1 not only outperforms its base model by 8.9% on a wide range of multimodal reasoning benchmarks, but also surpasses the performance of larger and even closed-source models, such as Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct.
DeepA2: A Modular Framework for Deep Argument Analysis with Pretrained Neural Text2Text Language Models
In this paper, we present and implement a multi-dimensional, modular framework for performing deep argument analysis (DeepA2) using current pre-trained language models (PTLMs). ArgumentAnalyst -- a T5 model (Raffel et al. 2020) set up and trained within DeepA2 -- reconstructs argumentative texts, which advance an informal argumentation, as valid arguments: It inserts, e.g., missing premises and conclusions, formalizes inferences, and coherently links the logical reconstruction to the source text. We create a synthetic corpus for deep argument analysis, and evaluate ArgumentAnalyst on this new dataset as well as on existing data, specifically EntailmentBank (Dalvi et al. 2021). Our empirical findings vindicate the overall framework and highlight the advantages of a modular design, in particular its ability to emulate established heuristics (such as hermeneutic cycles), to explore the model's uncertainty, to cope with the plurality of correct solutions (underdetermination), and to exploit higher-order evidence.
PhysUniBench: An Undergraduate-Level Physics Reasoning Benchmark for Multimodal Models
Physics problem-solving is a challenging domain for large AI models, requiring integration of conceptual understanding, mathematical reasoning, and interpretation of physical diagrams. Current evaluation methodologies show notable limitations in capturing the breadth and complexity of undergraduate-level physics, underscoring the need for more rigorous assessments. To this end, we present PhysUniBench, a large-scale multimodal benchmark designed to evaluate and improve the reasoning capabilities of multimodal large language models (MLLMs) specifically on undergraduate-level physics problems. PhysUniBench consists of 3,304 physics questions spanning 8 major sub-disciplines of physics, each accompanied by one visual diagrams. The benchmark includes both open-ended and multiple-choice questions, systematically curated and difficulty-rated through an iterative model-in-the-loop process. The benchmark's construction involved a rigorous multi-stage process, including multiple roll-outs, expert-level evaluation, automated filtering of easily solved problems, and a nuanced difficulty grading system with five levels. Through extensive experiments, we observe that current state-of-the-art models encounter substantial challenges in physics reasoning. For example, GPT-4o mini achieves only about 34.2\% accuracy in the proposed PhysUniBench. These results highlight that current MLLMs struggle with advanced physics reasoning, especially on multi-step problems and those requiring precise diagram interpretation. By providing a broad and rigorous assessment tool, PhysUniBench aims to drive progress in AI for Science, encouraging the development of models with stronger physical reasoning, problem-solving skills, and multimodal understanding. The benchmark and evaluation scripts are available at https://prismax-team.github.io/PhysUniBenchmark/.
A Method to Simultaneously Facilitate All Jet Physics Tasks
Machine learning has become an essential tool in jet physics. Due to their complex, high-dimensional nature, jets can be explored holistically by neural networks in ways that are not possible manually. However, innovations in all areas of jet physics are proceeding in parallel. We show that specially constructed machine learning models trained for a specific jet classification task can improve the accuracy, precision, or speed of all other jet physics tasks. This is demonstrated by training on a particular multiclass generation and classification task and then using the learned representation for different generation and classification tasks, for datasets with a different (full) detector simulation, for jets from a different collision system (pp versus ep), for generative models, for likelihood ratio estimation, and for anomaly detection. We consider, our OmniLearn approach thus as a jet-physics foundation model. It is made publicly available for use in any area where state-of-the-art precision is required for analyses involving jets and their substructure.
Accurate and robust methods for direct background estimation in resonant anomaly detection
Resonant anomaly detection methods have great potential for enhancing the sensitivity of traditional bump hunt searches. A key component of these methods is a high quality background template used to produce an anomaly score. Using the LHC Olympics R&D dataset, we demonstrate that this background template can also be repurposed to directly estimate the background expectation in a simple cut and count setup. In contrast to a traditional bump hunt, no fit to the invariant mass distribution is needed, thereby avoiding the potential problem of background sculpting. Furthermore, direct background estimation allows working with large background rejection rates, where resonant anomaly detection methods typically show their greatest improvement in significance.
PhySense: Principle-Based Physics Reasoning Benchmarking for Large Language Models
Large language models (LLMs) have rapidly advanced and are increasingly capable of tackling complex scientific problems, including those in physics. Despite this progress, current LLMs often fail to emulate the concise, principle-based reasoning characteristic of human experts, instead generating lengthy and opaque solutions. This discrepancy highlights a crucial gap in their ability to apply core physical principles for efficient and interpretable problem solving. To systematically investigate this limitation, we introduce PhySense, a novel principle-based physics reasoning benchmark designed to be easily solvable by experts using guiding principles, yet deceptively difficult for LLMs without principle-first reasoning. Our evaluation across multiple state-of-the-art LLMs and prompt types reveals a consistent failure to align with expert-like reasoning paths, providing insights for developing AI systems with efficient, robust and interpretable principle-based scientific reasoning.
Two 100 TeV neutrinos coincident with the Seyfert galaxy NGC 7469
In 2013, the IceCube collaboration announced the detection of a diffuse high-energy astrophysical neutrino flux. The origin of this flux is still largely unknown. The most significant individual source is the close-by Seyfert galaxy NGC 1068 at 4.2-sigma level with a soft spectral index. To identify sources based on their counterpart, IceCube releases realtime alerts corresponding to neutrinos with a high probability of astrophysical origin. We report here the spatial coincidence of two neutrino alerts, IC220424A and IC230416A, with the Seyfert galaxy NGC 7469 at a distance of 70 Mpc. We evaluate, a-posteriori, the chance probability of such a coincidence and discuss this source as a potential neutrino emitter based on its multi-wavelength properties and in comparison to NGC 1068 by performing a Goodness-of-Fit test. The test statistic is derived from a likelihood ratio that includes the neutrino angular uncertainty and the source distance. We apply this test first to a catalog of AGN sources and second to a catalog of Seyfert galaxies only. Our a-posteriori evaluation excludes the possibility of an accidental spatial coincidence of both neutrinos with the Seyfert galaxy NGC 7469 at 3.2-sigma level, leaving open the possibility that either one or both neutrinos originated from the source. To be compatible with non-detections of TeV neutrinos, the source would need to have a hard spectral index.
On the Relationship Between Explanation and Prediction: A Causal View
Being able to provide explanations for a model's decision has become a central requirement for the development, deployment, and adoption of machine learning models. However, we are yet to understand what explanation methods can and cannot do. How do upstream factors such as data, model prediction, hyperparameters, and random initialization influence downstream explanations? While previous work raised concerns that explanations (E) may have little relationship with the prediction (Y), there is a lack of conclusive study to quantify this relationship. Our work borrows tools from causal inference to systematically assay this relationship. More specifically, we study the relationship between E and Y by measuring the treatment effect when intervening on their causal ancestors, i.e., on hyperparameters and inputs used to generate saliency-based Es or Ys. Our results suggest that the relationships between E and Y is far from ideal. In fact, the gap between 'ideal' case only increase in higher-performing models -- models that are likely to be deployed. Our work is a promising first step towards providing a quantitative measure of the relationship between E and Y, which could also inform the future development of methods for E with a quantitative metric.
Confidence-Weighted Token Set Cover for Early Hypothesis Pruning in Self-Consistency
Despite its simplicity and efficacy, the high token expenditure of self-consistency can limit its practical utility. Here we investigate if self-consistency can be made more token-efficient for long chain-of-thought reasoning tasks, while preserving its parallelism, through early hypothesis pruning. Concretely, we generate all solutions in parallel, but periodically prune intermediate hypotheses that are deemed unnecessary based on two lightweight indicators: (a) the model's own confidence in individual hypotheses, and (b) lexical coverage of all current hypotheses by candidate subsets that are under consideration for continued retention. We design a fast weighted set cover algorithm that utilizes the two indicators; our evaluation of five LLMs on three math benchmarks shows that this method can improve token efficiency for all models, by 10-35% in many cases.
Parallel Bayesian Optimization of Agent-based Transportation Simulation
MATSim (Multi-Agent Transport Simulation Toolkit) is an open source large-scale agent-based transportation planning project applied to various areas like road transport, public transport, freight transport, regional evacuation, etc. BEAM (Behavior, Energy, Autonomy, and Mobility) framework extends MATSim to enable powerful and scalable analysis of urban transportation systems. The agents from the BEAM simulation exhibit 'mode choice' behavior based on multinomial logit model. In our study, we consider eight mode choices viz. bike, car, walk, ride hail, driving to transit, walking to transit, ride hail to transit, and ride hail pooling. The 'alternative specific constants' for each mode choice are critical hyperparameters in a configuration file related to a particular scenario under experimentation. We use the 'Urbansim-10k' BEAM scenario (with 10,000 population size) for all our experiments. Since these hyperparameters affect the simulation in complex ways, manual calibration methods are time consuming. We present a parallel Bayesian optimization method with early stopping rule to achieve fast convergence for the given multi-in-multi-out problem to its optimal configurations. Our model is based on an open source HpBandSter package. This approach combines hierarchy of several 1D Kernel Density Estimators (KDE) with a cheap evaluator (Hyperband, a single multidimensional KDE). Our model has also incorporated extrapolation based early stopping rule. With our model, we could achieve a 25% L1 norm for a large-scale BEAM simulation in fully autonomous manner. To the best of our knowledge, our work is the first of its kind applied to large-scale multi-agent transportation simulations. This work can be useful for surrogate modeling of scenarios with very large populations.
The LHCb ultra-fast simulation option, Lamarr: design and validation
Detailed detector simulation is the major consumer of CPU resources at LHCb, having used more than 90% of the total computing budget during Run 2 of the Large Hadron Collider at CERN. As data is collected by the upgraded LHCb detector during Run 3 of the LHC, larger requests for simulated data samples are necessary, and will far exceed the pledged resources of the experiment, even with existing fast simulation options. An evolution of technologies and techniques to produce simulated samples is mandatory to meet the upcoming needs of analysis to interpret signal versus background and measure efficiencies. In this context, we propose Lamarr, a Gaudi-based framework designed to offer the fastest solution for the simulation of the LHCb detector. Lamarr consists of a pipeline of modules parameterizing both the detector response and the reconstruction algorithms of the LHCb experiment. Most of the parameterizations are made of Deep Generative Models and Gradient Boosted Decision Trees trained on simulated samples or alternatively, where possible, on real data. Embedding Lamarr in the general LHCb Gauss Simulation framework allows combining its execution with any of the available generators in a seamless way. Lamarr has been validated by comparing key reconstructed quantities with Detailed Simulation. Good agreement of the simulated distributions is obtained with two-order-of-magnitude speed-up of the simulation phase.
Large Language Models can Learn Rules
When prompted with a few examples and intermediate steps, large language models (LLMs) have demonstrated impressive performance in various reasoning tasks. However, prompting methods that rely on implicit knowledge in an LLM often generate incorrect answers when the implicit knowledge is wrong or inconsistent with the task. To tackle this problem, we present Hypotheses-to-Theories (HtT), a framework that learns a rule library for reasoning with LLMs. HtT contains two stages, an induction stage and a deduction stage. In the induction stage, an LLM is first asked to generate and verify rules over a set of training examples. Rules that appear and lead to correct answers sufficiently often are collected to form a rule library. In the deduction stage, the LLM is then prompted to employ the learned rule library to perform reasoning to answer test questions. Experiments on relational reasoning, numerical reasoning and concept learning problems show that HtT improves existing prompting methods, with an absolute gain of 10-30% in accuracy. The learned rules are also transferable to different models and to different forms of the same problem.
Planck 2018 results. V. CMB power spectra and likelihoods
This paper describes the 2018 Planck CMB likelihoods, following a hybrid approach similar to the 2015 one, with different approximations at low and high multipoles, and implementing several methodological and analysis refinements. With more realistic simulations, and better correction and modelling of systematics, we can now make full use of the High Frequency Instrument polarization data. The low-multipole 100x143 GHz EE cross-spectrum constrains the reionization optical-depth parameter tau to better than 15% (in combination with with the other low- and high-ell likelihoods). We also update the 2015 baseline low-ell joint TEB likelihood based on the Low Frequency Instrument data, which provides a weaker tau constraint. At high multipoles, a better model of the temperature-to-polarization leakage and corrections for the effective calibrations of the polarization channels (polarization efficiency or PE) allow us to fully use the polarization spectra, improving the constraints on the LambdaCDM parameters by 20 to 30% compared to TT-only constraints. Tests on the modelling of the polarization demonstrate good consistency, with some residual modelling uncertainties, the accuracy of the PE modelling being the main limitation. Using our various tests, simulations, and comparison between different high-ell implementations, we estimate the consistency of the results to be better than the 0.5sigma level. Minor curiosities already present before (differences between ell<800 and ell>800 parameters or the preference for more smoothing of the C_ell peaks) are shown to be driven by the TT power spectrum and are not significantly modified by the inclusion of polarization. Overall, the legacy Planck CMB likelihoods provide a robust tool for constraining the cosmological model and represent a reference for future CMB observations. (Abridged)
HoloBeam: Learning Optimal Beamforming in Far-Field Holographic Metasurface Transceivers
Holographic Metasurface Transceivers (HMTs) are emerging as cost-effective substitutes to large antenna arrays for beamforming in Millimeter and TeraHertz wave communication. However, to achieve desired channel gains through beamforming in HMT, phase-shifts of a large number of elements need to be appropriately set, which is challenging. Also, these optimal phase-shifts depend on the location of the receivers, which could be unknown. In this work, we develop a learning algorithm using a {\it fixed-budget multi-armed bandit framework} to beamform and maximize received signal strength at the receiver for far-field regions. Our algorithm, named \Algo exploits the parametric form of channel gains of the beams, which can be expressed in terms of two {\it phase-shifting parameters}. Even after parameterization, the problem is still challenging as phase-shifting parameters take continuous values. To overcome this, {\it\HB} works with the discrete values of phase-shifting parameters and exploits their unimodal relations with channel gains to learn the optimal values faster. We upper bound the probability of {\it\HB} incorrectly identifying the (discrete) optimal phase-shift parameters in terms of the number of pilots used in learning. We show that this probability decays exponentially with the number of pilot signals. We demonstrate that {\it\HB} outperforms state-of-the-art algorithms through extensive simulations.
The Machine Learning Landscape of Top Taggers
Based on the established task of identifying boosted, hadronically decaying top quarks, we compare a wide range of modern machine learning approaches. Unlike most established methods they rely on low-level input, for instance calorimeter output. While their network architectures are vastly different, their performance is comparatively similar. In general, we find that these new approaches are extremely powerful and great fun.
Experimental Estimation of Quantum State Properties from Classical Shadows
Full quantum tomography of high-dimensional quantum systems is experimentally infeasible due to the exponential scaling of the number of required measurements on the number of qubits in the system. However, several ideas were proposed recently for predicting the limited number of features for these states, or estimating the expectation values of operators, without the need for full state reconstruction. These ideas go under the general name of shadow tomography. Here we provide an experimental demonstration of property estimation based on classical shadows proposed in [H.-Y. Huang, R. Kueng, J. Preskill. Nat. Phys. https://doi.org/10.1038/s41567-020-0932-7 (2020)] and study its performance in the quantum optical experiment with high-dimensional spatial states of photons. We show on experimental data how this procedure outperforms conventional state reconstruction in fidelity estimation from a limited number of measurements.
Theoretical Physics Benchmark (TPBench) -- a Dataset and Study of AI Reasoning Capabilities in Theoretical Physics
We introduce a benchmark to evaluate the capability of AI to solve problems in theoretical physics, focusing on high-energy theory and cosmology. The first iteration of our benchmark consists of 57 problems of varying difficulty, from undergraduate to research level. These problems are novel in the sense that they do not come from public problem collections. We evaluate our data set on various open and closed language models, including o3-mini, o1, DeepSeek-R1, GPT-4o and versions of Llama and Qwen. While we find impressive progress in model performance with the most recent models, our research-level difficulty problems are mostly unsolved. We address challenges of auto-verifiability and grading, and discuss common failure modes. While currently state-of-the art models are still of limited use for researchers, our results show that AI assisted theoretical physics research may become possible in the near future. We discuss the main obstacles towards this goal and possible strategies to overcome them. The public problems and solutions, results for various models, and updates to the data set and score distribution, are available on the website of the dataset tpbench.org.
Neural Networks for cosmological model selection and feature importance using Cosmic Microwave Background data
The measurements of the temperature and polarisation anisotropies of the Cosmic Microwave Background (CMB) by the ESA Planck mission have strongly supported the current concordance model of cosmology. However, the latest cosmological data release from ESA Planck mission still has a powerful potential to test new data science algorithms and inference techniques. In this paper, we use advanced Machine Learning (ML) algorithms, such as Neural Networks (NNs), to discern among different underlying cosmological models at the angular power spectra level, using both temperature and polarisation Planck 18 data. We test two different models beyond LambdaCDM: a modified gravity model: the Hu-Sawicki model, and an alternative inflationary model: a feature-template in the primordial power spectrum. Furthermore, we also implemented an interpretability method based on SHAP values to evaluate the learning process and identify the most relevant elements that drive our architecture to certain outcomes. We find that our NN is able to distinguish between different angular power spectra successfully for both alternative models and LambdaCDM. We conclude by explaining how archival scientific data has still a strong potential to test novel data science algorithms that are interesting for the next generation of cosmological experiments.
Statistical selection of high-redshift, neutral-hydrogen-rich, lensed galaxies with the Square Kilometre Array
Deep wide spectral line surveys with the Square Kilometre Array (SKA) will expand the cosmic frontiers of neutral atomic hydrogen (HI) in galaxies. However, at cosmologically significant redshifts (z gtrsim 0.5), detections will typically be spatially unresolved and limited to the highest mass systems. Gravitational lensing could potentially alleviate these limitations, enabling lower mass systems to be studied at higher redshift and spatially resolved dynamical studies of some HI discs. Additionally, lensed HI systems would select foreground dark matter haloes using a different, more extended baryonic tracer compared to other lens surveys. This may result in a wider selected range of foreground dark matter halo properties, such as the concentration parameter. This paper uses the distortion of the observed HI mass function (HIMF) produced by strong gravitational lensing to find a flux density criterion for selecting lensed HI sources in future SKA-Mid spectral line surveys. This selection approach could yield lensed HI source densities in the range of sim 0.1--10 galaxies per square degree out to a redshift of z simeq 3 covered by SKA-MID Band 1. Although the sample sizes are modest, even with the proposed SKA-Mid surveys, the selection approach is straightforward and should have a 50% efficiency without any additional information, such as low-impact-factor or lower-redshift massive galaxies. The efficiency of selecting high-redshift, neutral-hydrogen-rich, lensed galaxies should then be greatly enhanced by using SKA-MID data in concert with the Vera C. Rubin Large Survey of Space and Time.
A Mathematical Approach to Constraining Neural Abstraction and the Mechanisms Needed to Scale to Higher-Order Cognition
Artificial intelligence has made great strides in the last decade but still falls short of the human brain, the best-known example of intelligence. Not much is known of the neural processes that allow the brain to make the leap to achieve so much from so little beyond its ability to create knowledge structures that can be flexibly and dynamically combined, recombined, and applied in new and novel ways. This paper proposes a mathematical approach using graph theory and spectral graph theory, to hypothesize how to constrain these neural clusters of information based on eigen-relationships. This same hypothesis is hierarchically applied to scale up from the smallest to the largest clusters of knowledge that eventually lead to model building and reasoning.
Evidence of Nonlinear Signatures in Solar Wind Proton Density at the L1 Lagrange point
The solar wind is a medium characterized by strong turbulence and significant field fluctuations on various scales. Recent observations have revealed that magnetic turbulence exhibits a self-similar behavior. Similarly, high-resolution measurements of the proton density have shown comparable characteristics, prompting several studies into the multifractal properties of these density fluctuations. In this work, we show that low-resolution observations of the solar wind proton density over time, recorded by various spacecraft at Lagrange point L1, also exhibit non-linear and multifractal structures. The novelty of our study lies in the fact that this is the first systematic analysis of solar wind proton density using low-resolution (hourly) data collected by multiple spacecraft at the L1 Lagrange point over a span of 17 years. Furthermore, we interpret our results within the framework of non-extensive statistical mechanics, which appears to be consistent with the observed nonlinear behavior. Based on the data, we successfully validate the q-triplet predicted by non-extensive statistical theory. To the best of our knowledge, this represents the most rigorous and systematic validation to date of the q-triplet in the solar wind.
Modeling the Machine Learning Multiverse
Amid mounting concern about the reliability and credibility of machine learning research, we present a principled framework for making robust and generalizable claims: the multiverse analysis. Our framework builds upon the multiverse analysis (Steegen et al., 2016) introduced in response to psychology's own reproducibility crisis. To efficiently explore high-dimensional and often continuous ML search spaces, we model the multiverse with a Gaussian Process surrogate and apply Bayesian experimental design. Our framework is designed to facilitate drawing robust scientific conclusions about model performance, and thus our approach focuses on exploration rather than conventional optimization. In the first of two case studies, we investigate disputed claims about the relative merit of adaptive optimizers. Second, we synthesize conflicting research on the effect of learning rate on the large batch training generalization gap. For the machine learning community, the multiverse analysis is a simple and effective technique for identifying robust claims, for increasing transparency, and a step toward improved reproducibility.
A Review of NEST Models for Liquid Xenon and Exhaustive Comparison to Other Approaches
This paper will discuss the microphysical simulation of interactions in liquid xenon, the active detector medium in many leading rare-event searches for new physics, and describe experimental observables useful for understanding detector performance. The scintillation and ionization yield distributions for signal and background will be presented using the Noble Element Simulation Technique (NEST), which is a toolkit based on experimental data and simple, empirical formulae, which mimic previous microphysics modeling, but are guided by data. The NEST models for light and charge production as a function of the particle type, energy, and electric field will be reviewed, as well as models for energy resolution and final pulse areas. NEST will be compared to other models or sets of models, and vetted against real data, with several specific examples pulled from XENON, ZEPLIN, LUX, LZ, PandaX, and table-top experiments used for calibrations.
Bounds on geometric wakefields in collimators and step transitions of arbitrary cross sections
We present the wakefield conformal mapping technique that can be readily applied to the analysis of the radiation generated by an ultra-relativistic particle in the step transition and a collimator. We derive simple analytical expressions for the lower and upper bounds of both longitudinal and transverse wake potentials. We test the derived expressions against well-known formulas in several representative examples. The proposed method can greatly simplify the optimization of collimating sections, as well as become a useful tool in the shape optimization problems.
Probing neural language models for understanding of words of estimative probability
Words of estimative probability (WEP) are expressions of a statement's plausibility (probably, maybe, likely, doubt, likely, unlikely, impossible...). Multiple surveys demonstrate the agreement of human evaluators when assigning numerical probability levels to WEP. For example, highly likely corresponds to a median chance of 0.90+-0.08 in Fagen-Ulmschneider (2015)'s survey. In this work, we measure the ability of neural language processing models to capture the consensual probability level associated to each WEP. Firstly, we use the UNLI dataset (Chen et al., 2020) which associates premises and hypotheses with their perceived joint probability p, to construct prompts, e.g. "[PREMISE]. [WEP], [HYPOTHESIS]." and assess whether language models can predict whether the WEP consensual probability level is close to p. Secondly, we construct a dataset of WEP-based probabilistic reasoning, to test whether language models can reason with WEP compositions. When prompted "[EVENTA] is likely. [EVENTB] is impossible.", a causal language model should not express that [EVENTA&B] is likely. We show that both tasks are unsolved by off-the-shelf English language models, but that fine-tuning leads to transferable improvement.
From Hypothesis to Publication: A Comprehensive Survey of AI-Driven Research Support Systems
Research is a fundamental process driving the advancement of human civilization, yet it demands substantial time and effort from researchers. In recent years, the rapid development of artificial intelligence (AI) technologies has inspired researchers to explore how AI can accelerate and enhance research. To monitor relevant advancements, this paper presents a systematic review of the progress in this domain. Specifically, we organize the relevant studies into three main categories: hypothesis formulation, hypothesis validation, and manuscript publication. Hypothesis formulation involves knowledge synthesis and hypothesis generation. Hypothesis validation includes the verification of scientific claims, theorem proving, and experiment validation. Manuscript publication encompasses manuscript writing and the peer review process. Furthermore, we identify and discuss the current challenges faced in these areas, as well as potential future directions for research. Finally, we also offer a comprehensive overview of existing benchmarks and tools across various domains that support the integration of AI into the research process. We hope this paper serves as an introduction for beginners and fosters future research. Resources have been made publicly available at https://github.com/zkzhou126/AI-for-Research.
Theoretical Antineutrino Detection, Direction and Ranging at Long Distances
In this paper we introduce the concept of what we call "NUDAR" (NeUtrino Direction and Ranging), making the point that measurements of the observed energy and direction vectors can be employed to passively deduce the exact three-dimensional location and thermal power of geophysical and anthropogenic neutrino sources from even a single detector. We present the most precise background estimates to date, all handled in full three dimensions, as functions of depth and geographical location. For the present calculations, we consider a hypothetical 138 kiloton detector which can be transported to an ocean site and deployed to an operational depth. We present a Bayesian estimation framework to incorporate any a priori knowledge of the reactor that we are trying to detect, as well as the estimated uncertainty in the background and the oscillation parameters. Most importantly, we fully employ the knowledge of the reactor spectrum and the distance-dependent effects of neutrino oscillations on such spectra. The latter, in particular, makes possible determination of range from one location, given adequate signal statistics. Further, we explore the rich potential of improving detection with even modest improvements in individual neutrino direction determination. We conclude that a 300 MWth reactor can indeed be geolocated, and its operating power estimated with one or two detectors in the hundred kiloton class at ranges out to a few hundred kilometers. We note that such detectors would have natural and non-interfering utility for scientific studies of geo-neutrinos, neutrino oscillations, and astrophysical neutrinos. This motivates the development of cost effective methods of constructing and deploying such next generation detectors.
A new type of Neutrino Detector for Sterile Neutrino Search at Nuclear Reactors and Nuclear Nonproliferation Applications
We describe a new detector, called NuLat, to study electron anti-neutrinos a few meters from a nuclear reactor, and search for anomalous neutrino oscillations. Such oscillations could be caused by sterile neutrinos, and might explain the "Reactor Antineutrino Anomaly". NuLat, is made possible by a natural synergy between the miniTimeCube and mini-LENS programs described in this paper. It features a "Raghavan Optical Lattice" (ROL) consisting of 3375 boron or ^6Li loaded plastic scintillator cubical cells 6.3\,cm (2.500") on a side. Cell boundaries have a 0.127\,mm (0.005") air gap, resulting in total internal reflection guiding most of the light down the 3 cardinal directions. The ROL detector technology for NuLat gives excellent spatial and energy resolution and allows for in-depth event topology studies. These features allow us to discern inverse beta decay (IBD) signals and the putative oscillation pattern, even in the presence of other backgrounds. We discuss here test venues, efficiency, sensitivity and project status.
Lake- and Surface-Based Detectors for Forward Neutrino Physics
We propose two medium-baseline, kiloton-scale neutrino experiments to study neutrinos from LHC proton-proton collisions: SINE, a surface-based scintillator panel detector observing muon neutrinos from the CMS interaction point, and UNDINE, a water Cherenkov detector submerged in lake Geneva observing all-flavor neutrinos from LHCb. Using a Monte Carlo simulation, we estimate millions of neutrino interactions during the high-luminosity LHC era. We show that these datasets can constrain neutrino cross sections, charm production in pp collisions, and strangeness enhancement as a solution to the cosmic-ray muon puzzle. SINE and UNDINE thus offer a cost-effective medium-baseline complement to the proposed short-baseline forward physics facility.
Probing Invisible Decay of Z^prime at Muon Collider with Topological Data Analysis and Machine Learning
We explore the use of topological data analysis (TDA) combined with machine learning for discriminating standard model backgrounds from the invisible decay of the Z^prime boson associated with monophoton emission at a 3 TeV muon collider. Reconstructed events are mapped into a six-dimensional kinematic space and aggregated into bags of events, from which persistent homology is used to extract Betti number distributions. Within the Multiple Instance Learning paradigm, classifiers trained on these topological descriptors demonstrate significantly improved classification accuracy compared to the conventional ML approaches based on event-wise kinematic inputs. We also draw exclusion contours at 95\% CL in the (m_{Z^prime}, m_chi) parameter space, highlighting the potential of topological features to extend the discovery reach of future collider experiments.
Whispers that Shake Foundations: Analyzing and Mitigating False Premise Hallucinations in Large Language Models
Large Language Models (LLMs) have shown impressive capabilities but still suffer from the issue of hallucinations. A significant type of this issue is the false premise hallucination, which we define as the phenomenon when LLMs generate hallucinated text when confronted with false premise questions. In this paper, we perform a comprehensive analysis of the false premise hallucination and elucidate its internal working mechanism: a small subset of attention heads (which we designate as false premise heads) disturb the knowledge extraction process, leading to the occurrence of false premise hallucination. Based on our analysis, we propose FAITH (False premise Attention head constraIining for miTigating Hallucinations), a novel and effective method to mitigate false premise hallucinations. It constrains the false premise attention heads during the model inference process. Impressively, extensive experiments demonstrate that constraining only approximately 1% of the attention heads in the model yields a notable increase of nearly 20% of model performance.
Digital Discovery of interferometric Gravitational Wave Detectors
Gravitational waves, detected a century after they were first theorized, are spacetime distortions caused by some of the most cataclysmic events in the universe, including black hole mergers and supernovae. The successful detection of these waves has been made possible by ingenious detectors designed by human experts. Beyond these successful designs, the vast space of experimental configurations remains largely unexplored, offering an exciting territory potentially rich in innovative and unconventional detection strategies. Here, we demonstrate the application of artificial intelligence (AI) to systematically explore this enormous space, revealing novel topologies for gravitational wave (GW) detectors that outperform current next-generation designs under realistic experimental constraints. Our results span a broad range of astrophysical targets, such as black hole and neutron star mergers, supernovae, and primordial GW sources. Moreover, we are able to conceptualize the initially unorthodox discovered designs, emphasizing the potential of using AI algorithms not only in discovering but also in understanding these novel topologies. We've assembled more than 50 superior solutions in a publicly available Gravitational Wave Detector Zoo which could lead to many new surprising techniques. At a bigger picture, our approach is not limited to gravitational wave detectors and can be extended to AI-driven design of experiments across diverse domains of fundamental physics.
Disintegration and Bayesian Inversion via String Diagrams
The notions of disintegration and Bayesian inversion are fundamental in conditional probability theory. They produce channels, as conditional probabilities, from a joint state, or from an already given channel (in opposite direction). These notions exist in the literature, in concrete situations, but are presented here in abstract graphical formulations. The resulting abstract descriptions are used for proving basic results in conditional probability theory. The existence of disintegration and Bayesian inversion is discussed for discrete probability, and also for measure-theoretic probability --- via standard Borel spaces and via likelihoods. Finally, the usefulness of disintegration and Bayesian inversion is illustrated in several examples.
Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve
The widespread adoption of large language models (LLMs) makes it important to recognize their strengths and limitations. We argue that in order to develop a holistic understanding of these systems we need to consider the problem that they were trained to solve: next-word prediction over Internet text. By recognizing the pressures that this task exerts we can make predictions about the strategies that LLMs will adopt, allowing us to reason about when they will succeed or fail. This approach - which we call the teleological approach - leads us to identify three factors that we hypothesize will influence LLM accuracy: the probability of the task to be performed, the probability of the target output, and the probability of the provided input. We predict that LLMs will achieve higher accuracy when these probabilities are high than when they are low - even in deterministic settings where probability should not matter. To test our predictions, we evaluate two LLMs (GPT-3.5 and GPT-4) on eleven tasks, and we find robust evidence that LLMs are influenced by probability in the ways that we have hypothesized. In many cases, the experiments reveal surprising failure modes. For instance, GPT-4's accuracy at decoding a simple cipher is 51% when the output is a high-probability word sequence but only 13% when it is low-probability. These results show that AI practitioners should be careful about using LLMs in low-probability situations. More broadly, we conclude that we should not evaluate LLMs as if they are humans but should instead treat them as a distinct type of system - one that has been shaped by its own particular set of pressures.
Dark Matter Subhalos and Higher Order Catastrophes in Gravitational Wave Lensing
Gravitational lensing is an invaluable probe of the nature of dark matter, and the structures it forms. Lensed gravitational waves in particular allow for unparalleled sensitivity to small scale structures within the lenses, due to the precise time resolution in combination with the continuous monitoring of the entire sky. In this work, we show two distinct ways of using strongly lensed gravitational waves to identify the presence of dark matter subhalos: {i)} through higher order caustics generating high relative magnification (mu_r > 2), short time delay image pairs that break the caustic universality relations of single dark matter halos, which occur for sim 1-10 percent of strongly lensed events in our cold dark matter models, and ii) through the presence of more than three highly magnified images, which occur for sim 0.01-1 percent of the same simulated events. We find that these results are highly sensitive to the concentrations of subhalos in our simulations, and more mildly to their number densities. The presence of low-mass subhalos increases the probability of observing wave-optics lensing in lensed gravitational waves, which is studied by solving the diffraction integral with the stationary phase approximation, as well as numerically. We also report distinct quantitative and qualitative differences in the distributions of relative magnifications and time delays for subhalo populations with increased number densities or concentrations. With the upcoming detection of strongly lensed events by ground- and space- based detectors, comparisons against these simulated distributions will provide insight into the nature of dark matter.
Understanding Deep Networks via Extremal Perturbations and Smooth Masks
The problem of attribution is concerned with identifying the parts of an input that are responsible for a model's output. An important family of attribution methods is based on measuring the effect of perturbations applied to the input. In this paper, we discuss some of the shortcomings of existing approaches to perturbation analysis and address them by introducing the concept of extremal perturbations, which are theoretically grounded and interpretable. We also introduce a number of technical innovations to compute extremal perturbations, including a new area constraint and a parametric family of smooth perturbations, which allow us to remove all tunable hyper-parameters from the optimization problem. We analyze the effect of perturbations as a function of their area, demonstrating excellent sensitivity to the spatial properties of the deep neural network under stimulation. We also extend perturbation analysis to the intermediate layers of a network. This application allows us to identify the salient channels necessary for classification, which, when visualized using feature inversion, can be used to elucidate model behavior. Lastly, we introduce TorchRay, an interpretability library built on PyTorch.
What Has a Foundation Model Found? Using Inductive Bias to Probe for World Models
Foundation models are premised on the idea that sequence prediction can uncover deeper domain understanding, much like how Kepler's predictions of planetary motion later led to the discovery of Newtonian mechanics. However, evaluating whether these models truly capture deeper structure remains a challenge. We develop a technique for evaluating foundation models that examines how they adapt to synthetic datasets generated from some postulated world model. Our technique measures whether the foundation model's inductive bias aligns with the world model, and so we refer to it as an inductive bias probe. Across multiple domains, we find that foundation models can excel at their training tasks yet fail to develop inductive biases towards the underlying world model when adapted to new tasks. We particularly find that foundation models trained on orbital trajectories consistently fail to apply Newtonian mechanics when adapted to new physics tasks. Further analysis reveals that these models behave as if they develop task-specific heuristics that fail to generalize.
Light Schrödinger Bridge
Despite the recent advances in the field of computational Schr\"odinger Bridges (SB), most existing SB solvers are still heavy-weighted and require complex optimization of several neural networks. It turns out that there is no principal solver which plays the role of simple-yet-effective baseline for SB just like, e.g., k-means method in clustering, logistic regression in classification or Sinkhorn algorithm in discrete optimal transport. We address this issue and propose a novel fast and simple SB solver. Our development is a smart combination of two ideas which recently appeared in the field: (a) parameterization of the Schr\"odinger potentials with sum-exp quadratic functions and (b) viewing the log-Schr\"odinger potentials as the energy functions. We show that combined together these ideas yield a lightweight, simulation-free and theoretically justified SB solver with a simple straightforward optimization objective. As a result, it allows solving SB in moderate dimensions in a matter of minutes on CPU without a painful hyperparameter selection. Our light solver resembles the Gaussian mixture model which is widely used for density estimation. Inspired by this similarity, we also prove an important theoretical result showing that our light solver is a universal approximator of SBs. Furthemore, we conduct the analysis of the generalization error of our light solver. The code for our solver can be found at https://github.com/ngushchin/LightSB
Advancing AI-Scientist Understanding: Making LLM Think Like a Physicist with Interpretable Reasoning
Large Language Models (LLMs) are playing an expanding role in physics research by enhancing reasoning, symbolic manipulation, and numerical computation. However, ensuring the reliability and interpretability of their outputs remains a significant challenge. In our framework, we conceptualize the collaboration between AI and human scientists as a dynamic interplay among three modules: the reasoning module, the interpretation module, and the AI-scientist interaction module. Recognizing that effective physics reasoning demands rigorous logical consistency, quantitative precision, and deep integration with established theoretical models, we introduce the interpretation module to improve the understanding of AI-generated outputs, which is not previously explored in the literature. This module comprises multiple specialized agents, including summarizers, model builders, UI builders, and testers, which collaboratively structure LLM outputs within a physically grounded framework, by constructing a more interpretable science model. A case study demonstrates that our approach enhances transparency, facilitates validation, and strengthens AI-augmented reasoning in scientific discovery.
HiPhO: How Far Are (M)LLMs from Humans in the Latest High School Physics Olympiad Benchmark?
Recently, the physical capabilities of (M)LLMs have garnered increasing attention. However, existing benchmarks for physics suffer from two major gaps: they neither provide systematic and up-to-date coverage of real-world physics competitions such as physics Olympiads, nor enable direct performance comparison with humans. To bridge these gaps, we present HiPhO, the first benchmark dedicated to high school physics Olympiads with human-aligned evaluation. Specifically, HiPhO highlights three key innovations. (1) Comprehensive Data: It compiles 13 latest Olympiad exams from 2024-2025, spanning both international and regional competitions, and covering mixed modalities that encompass problems spanning text-only to diagram-based. (2) Professional Evaluation: We adopt official marking schemes to perform fine-grained grading at both the answer and step level, fully aligned with human examiners to ensure high-quality and domain-specific evaluation. (3) Comparison with Human Contestants: We assign gold, silver, and bronze medals to models based on official medal thresholds, thereby enabling direct comparison between (M)LLMs and human contestants. Our large-scale evaluation of 30 state-of-the-art (M)LLMs shows that: across 13 exams, open-source MLLMs mostly remain at or below the bronze level; open-source LLMs show promising progress with occasional golds; closed-source reasoning MLLMs can achieve 6 to 12 gold medals; and most models still have a significant gap from full marks. These results highlight a substantial performance gap between open-source models and top students, the strong physical reasoning capabilities of closed-source reasoning models, and the fact that there is still significant room for improvement. HiPhO, as a rigorous, human-aligned, and Olympiad-focused benchmark for advancing multimodal physical reasoning, is open-source and available at https://github.com/SciYu/HiPhO.
HALoGEN: Fantastic LLM Hallucinations and Where to Find Them
Despite their impressive ability to generate high-quality and fluent text, generative large language models (LLMs) also produce hallucinations: statements that are misaligned with established world knowledge or provided input context. However, measuring hallucination can be challenging, as having humans verify model generations on-the-fly is both expensive and time-consuming. In this work, we release HALoGEN, a comprehensive hallucination benchmark consisting of: (1) 10,923 prompts for generative models spanning nine domains including programming, scientific attribution, and summarization, and (2) automatic high-precision verifiers for each use case that decompose LLM generations into atomic units, and verify each unit against a high-quality knowledge source. We use this framework to evaluate ~150,000 generations from 14 language models, finding that even the best-performing models are riddled with hallucinations (sometimes up to 86% of generated atomic facts depending on the domain). We further define a novel error classification for LLM hallucinations based on whether they likely stem from incorrect recollection of training data (Type A errors), or incorrect knowledge in training data (Type B errors), or are fabrication (Type C errors). We hope our framework provides a foundation to enable the principled study of why generative models hallucinate, and advances the development of trustworthy large language models.
To Believe or Not to Believe Your LLM
We explore uncertainty quantification in large language models (LLMs), with the goal to identify when uncertainty in responses given a query is large. We simultaneously consider both epistemic and aleatoric uncertainties, where the former comes from the lack of knowledge about the ground truth (such as about facts or the language), and the latter comes from irreducible randomness (such as multiple possible answers). In particular, we derive an information-theoretic metric that allows to reliably detect when only epistemic uncertainty is large, in which case the output of the model is unreliable. This condition can be computed based solely on the output of the model obtained simply by some special iterative prompting based on the previous responses. Such quantification, for instance, allows to detect hallucinations (cases when epistemic uncertainty is high) in both single- and multi-answer responses. This is in contrast to many standard uncertainty quantification strategies (such as thresholding the log-likelihood of a response) where hallucinations in the multi-answer case cannot be detected. We conduct a series of experiments which demonstrate the advantage of our formulation. Further, our investigations shed some light on how the probabilities assigned to a given output by an LLM can be amplified by iterative prompting, which might be of independent interest.
LLMs Will Always Hallucinate, and We Need to Live With This
As Large Language Models become more ubiquitous across domains, it becomes important to examine their inherent limitations critically. This work argues that hallucinations in language models are not just occasional errors but an inevitable feature of these systems. We demonstrate that hallucinations stem from the fundamental mathematical and logical structure of LLMs. It is, therefore, impossible to eliminate them through architectural improvements, dataset enhancements, or fact-checking mechanisms. Our analysis draws on computational theory and Godel's First Incompleteness Theorem, which references the undecidability of problems like the Halting, Emptiness, and Acceptance Problems. We demonstrate that every stage of the LLM process-from training data compilation to fact retrieval, intent classification, and text generation-will have a non-zero probability of producing hallucinations. This work introduces the concept of Structural Hallucination as an intrinsic nature of these systems. By establishing the mathematical certainty of hallucinations, we challenge the prevailing notion that they can be fully mitigated.
Do Language Models Know When They're Hallucinating References?
State-of-the-art language models (LMs) are notoriously susceptible to generating hallucinated information. Such inaccurate outputs not only undermine the reliability of these models but also limit their use and raise serious concerns about misinformation and propaganda. In this work, we focus on hallucinated book and article references and present them as the "model organism" of language model hallucination research, due to their frequent and easy-to-discern nature. We posit that if a language model cites a particular reference in its output, then it should ideally possess sufficient information about its authors and content, among other relevant details. Using this basic insight, we illustrate that one can identify hallucinated references without ever consulting any external resources, by asking a set of direct or indirect queries to the language model about the references. These queries can be considered as "consistency checks." Our findings highlight that while LMs, including GPT-4, often produce inconsistent author lists for hallucinated references, they also often accurately recall the authors of real references. In this sense, the LM can be said to "know" when it is hallucinating references. Furthermore, these findings show how hallucinated references can be dissected to shed light on their nature. Replication code and results can be found at https://github.com/microsoft/hallucinated-references.
Optimizing the L-σ Relation of HII Galaxies for Improving Cosmological Application
The basic premise of using HII starburst galaxies (HIIGs) as cosmic "standard candels" is that there is a significant correlation between the Hbeta luminosity (L) and the velocity dispersion (sigma) of the ionized gas from HIIGs measurements, which can be called as the empirical L - sigma relation. However, the scaling L - sigma relation well-calibrated with the lower-redshift HIIGs is unfitted for the higher-redshift HIIGs. To solve this problem, we explore new relational expression for the L - sigma relation which should be suitable for both lower-redshift and higher-redshift HIIGs. After reconstructing the Hubble diagram with the Gaussian process (GP) method from the Pantheon+ supernovae Ia sample, we examine and compare six different revised formulas of L - sigma relation. Furthermore, we use the Bayesian evidence to compare the revised L - sigma relations with the analysis of a joint sample of 36 giant extragalactic HII regions (GEHRs) and 145 HIIGs. It turns out that the redshift-dependent bilinear correction and the quadratic sigma based correction are significantly better than the others. Moreover, a quadratic sigma based correction is the most supported one. It suggests that the appropriate corrections to the L - sigma relation should be considered when the HIIGs are used as a kind of cosmological probes.
A category theory framework for Bayesian learning
Inspired by the foundational works by Spivak and Fong and Cruttwell et al., we introduce a categorical framework to formalize Bayesian inference and learning. The two key ideas at play here are the notions of Bayesian inversions and the functor GL as constructed by Cruttwell et al.. In this context, we find that Bayesian learning is the simplest case of the learning paradigm. We then obtain categorical formulations of batch and sequential Bayes updates while also verifying that the two coincide in a specific example.
Why is AI hard and Physics simple?
We discuss why AI is hard and why physics is simple. We discuss how physical intuition and the approach of theoretical physics can be brought to bear on the field of artificial intelligence and specifically machine learning. We suggest that the underlying project of machine learning and the underlying project of physics are strongly coupled through the principle of sparsity, and we call upon theoretical physicists to work on AI as physicists. As a first step in that direction, we discuss an upcoming book on the principles of deep learning theory that attempts to realize this approach.
Spectral properties of bottomonium at high temperature: a systematic investigation
We investigate spectral features of bottomonium at high temperature, in particular the thermal mass shift and width of ground state S-wave and P-wave state. We employ and compare a range of methods for determining these features from lattice NRQCD correlators, including direct correlator analyses (multi-exponential fits and moments of spectral functions), linear methods (Backus-Gilbert, Tikhonov and HLT methods), and Bayesian methods for spectral function reconstruction (MEM and BR). We comment on the reliability and limitations of the various methods.
Response: Emergent analogical reasoning in large language models
In their recent Nature Human Behaviour paper, "Emergent analogical reasoning in large language models," (Webb, Holyoak, and Lu, 2023) the authors argue that "large language models such as GPT-3 have acquired an emergent ability to find zero-shot solutions to a broad range of analogy problems." In this response, we provide counterexamples of the letter string analogies. In our tests, GPT-3 fails to solve even the easiest variants of the problems presented in the original paper. Zero-shot reasoning is an extraordinary claim that requires extraordinary evidence. We do not see that evidence in our experiments. To strengthen claims of humanlike reasoning such as zero-shot reasoning, it is important that the field develop approaches that rule out data memorization.
Cosmic Calipers: Precise and Accurate Neutron Star Radius Measurements with Next-Generation Gravitational Wave Detectors
Gravitational waves from merging binary neutron stars carry characteristic information about their astrophysical properties, including masses and tidal deformabilities, that are needed to infer their radii. In this study, we use Bayesian inference to quantify the precision with which radius can inferred with upgrades in the current gravitational wave detectors and next-generation observatories such as the Einstein Telescope and Cosmic Explorer. We assign evidences for a set of plausible equations of state, which are then used as weights to obtain radius posteriors. We find that prior choices and the loudness of observed signals limit the precision and accuracy of inferred radii by current detectors. In contrast, next-generation observatories can resolve the radius precisely and accurately, across most of the mass range to within lesssim 5% for both soft and stiff equations of state. We also explore how the choice of the neutron star mass prior can influence the inferred masses and potentially affect radii measurements, finding that choosing an astrophysically motivated prior does not notably impact an individual neutron star's radius measurements.
Intensity statistics inside an open wave-chaotic cavity with broken time-reversal invariance
Using the supersymmetric method of random matrix theory within the Heidelberg approach framework we provide statistical description of stationary intensity sampled in locations inside an open wave-chaotic cavity, assuming that the time-reversal invariance inside the cavity is fully broken. In particular, we show that when incoming waves are fed via a finite number M of open channels the probability density {cal P}(I) for the single-point intensity I decays as a power law for large intensities: {cal P}(I)sim I^{-(M+2)}, provided there is no internal losses. This behaviour is in marked difference with the Rayleigh law {cal P}(I)sim exp(-I/I) which turns out to be valid only in the limit Mto infty. We also find the joint probability density of intensities I_1, ldots, I_L in L>1 observation points, and then extract the corresponding statistics for the maximal intensity in the observation pattern. For Lto infty the resulting limiting extreme value statistics (EVS) turns out to be different from the classical EVS distributions.
A Machine Learning Pipeline for Hunting Hidden Axion Signals in Pulsar Dispersion Measurements
In the axion model, electromagnetic waves interacting with axions induce frequency-dependent time delays, determined by the axion mass and decay constant. These small delays are difficult to detect, making traditional methods ineffective. To address this, we computed time delays for various parameters and found a prominent dispersion signal when the wave frequency equals half the axion mass. Based on this, we developed a machine learning-based pipeline, achieving 95\% classification accuracy and demonstrating strong detection capability in low signal-to-noise data. Applying this to PSR J1933-6211, we found no axion-induced delays within current sensitivity limits. While existing constraints are limited by atomic clock resolution in radio telescopes, future advances in optical clocks and broader bandwidths will enable more extensive searches. In particular, combining high-precision optical clocks with next-generation radio telescopes, such as the Qitai Radio Telescope, could improve decay constant constraints by four orders of magnitude for axion masses in the 10^{-6} sim 10^{-4} eV range.
Projections of Earth's Technosphere: Luminosity and Mass as Limits to Growth
Earth remains the only known example of a planet with technology, and future projections of Earth's trajectory provide a basis and motivation for approaching the search for extraterrestrial technospheres. Conventional approaches toward projecting Earth's technosphere include applications of the Kardashev scale, which suggest the possibility that energy-intensive civilizations may expand to harness the entire energy output available to their planet, host star, or even the entire galaxy. In this study, we argue that the Kardashev scale is better understood as a "luminosity limit" that describes the maximum capacity for a civilization to harvest luminous stellar energy across a given spatial domain, and we note that thermodynamic efficiency will always keep a luminosity-limited technosphere from actually reaching this theoretical limit. We suggest the possibility that an advanced technosphere might evolve beyond this luminosity limit to draw its energy directly from harvesting stellar mass, and we also discuss possible trajectories that could exist between Earth today and such hypothetical "stellivores." We develop a framework to describe trajectories for long-lived technospheres that optimize their growth strategies between exploration and exploitation, unlike Earth today. We note that analyses of compact accreting stars could provide ways to test the stellivore hypothesis, and we more broadly suggest an expansion of technosignature search strategies beyond those that reside exactly at the luminosity limit.
PhysicsArena: The First Multimodal Physics Reasoning Benchmark Exploring Variable, Process, and Solution Dimensions
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in diverse reasoning tasks, yet their application to complex physics reasoning remains underexplored. Physics reasoning presents unique challenges, requiring grounding in physical conditions and the interpretation of multimodal information. Current physics benchmarks are limited, often focusing on text-only inputs or solely on problem-solving, thereby overlooking the critical intermediate steps of variable identification and process formulation. To address these limitations, we introduce PhysicsArena, the first multimodal physics reasoning benchmark designed to holistically evaluate MLLMs across three critical dimensions: variable identification, physical process formulation, and solution derivation. PhysicsArena aims to provide a comprehensive platform for assessing and advancing the multimodal physics reasoning abilities of MLLMs.
The Test of Tests: A Framework For Differentially Private Hypothesis Testing
We present a generic framework for creating differentially private versions of any hypothesis test in a black-box way. We analyze the resulting tests analytically and experimentally. Most crucially, we show good practical performance for small data sets, showing that at epsilon = 1 we only need 5-6 times as much data as in the fully public setting. We compare our work to the one existing framework of this type, as well as to several individually-designed private hypothesis tests. Our framework is higher power than other generic solutions and at least competitive with (and often better than) individually-designed tests.
Cosmic Evolution Early Release Science (CEERS) survey: The colour evolution of galaxies in the distant Universe
The wavelength-coverage and sensitivity of JWST now enables us to probe the rest-frame UV - optical spectral energy distributions (SEDs) of galaxies at high-redshift (z>4). From these SEDs it is, in principle, through SED fitting possible to infer key physical properties, including stellar masses, star formation rates, and dust attenuation. These in turn can be compared with the predictions of galaxy formation simulations allowing us to validate and refine the incorporated physics. However, the inference of physical properties, particularly from photometry alone, can lead to large uncertainties and potential biases. Instead, it is now possible, and common, for simulations to be forward-modelled to yield synthetic observations that can be compared directly to real observations. In this work, we measure the JWST broadband fluxes and colours of a robust sample of 5<z<10 galaxies using the Cosmic Evolution Early Release Science (CEERS) Survey. We then analyse predictions from a variety of models using the same methodology and compare the NIRCam/F277W magnitude distribution and NIRCam colours with observations. We find that the predicted and observed magnitude distributions are similar, at least at 5<z<8. At z>8 the distributions differ somewhat, though our observed sample size is small and thus susceptible to statistical fluctuations. Likewise, the predicted and observed colour evolution show broad agreement, at least at 5<z<8. There is however some disagreement between the observed and modelled strength of the strong line contribution. In particular all the models fails to reproduce the F410M-F444W colour at z>8, though, again, the sample size is small here.
Deep Learning and genetic algorithms for cosmological Bayesian inference speed-up
In this paper, we present a novel approach to accelerate the Bayesian inference process, focusing specifically on the nested sampling algorithms. Bayesian inference plays a crucial role in cosmological parameter estimation, providing a robust framework for extracting theoretical insights from observational data. However, its computational demands can be substantial, primarily due to the need for numerous likelihood function evaluations. Our proposed method utilizes the power of deep learning, employing feedforward neural networks to approximate the likelihood function dynamically during the Bayesian inference process. Unlike traditional approaches, our method trains neural networks on-the-fly using the current set of live points as training data, without the need for pre-training. This flexibility enables adaptation to various theoretical models and datasets. We perform simple hyperparameter optimization using genetic algorithms to suggest initial neural network architectures for learning each likelihood function. Once sufficient accuracy is achieved, the neural network replaces the original likelihood function. The implementation integrates with nested sampling algorithms and has been thoroughly evaluated using both simple cosmological dark energy models and diverse observational datasets. Additionally, we explore the potential of genetic algorithms for generating initial live points within nested sampling inference, opening up new avenues for enhancing the efficiency and effectiveness of Bayesian inference methods.
Search for dark matter subhalos among unassociated Fermi-LAT sources in presence of dataset shift
We search for dark matter (DM) annihilating subhalos of the Milky Way halo among the Fermi Large Area Telescope (LAT) unassociated sources. We construct, for the first time, a statistical model of the unassociated sources at latitudes above 10 degrees. The latter is built as a combination of both DM annihilation subhalos as well as Galactic and extragalactic astrophysical components. The astrophysical components are constructed based on distributions of associated sources, while the distribution of DM subhalos is derived from Monte Carlo simulations. In this model we take into account the differences in the distributions of associated and unassociated sources including both covariate and prior probability shifts (both being forms of ``dataset shifts''). Previous searches of DM subhalos were based on classify-and-count strategies, while the approach adopted in this work is based on quantification learning, which allows one to determine a well-defined statistical interpretation of the contribution of a population of DM subhalos to the unassociated Fermi-LAT sources. In the bb annihilation channel and for a range of DM masses from 10 GeV to 1 TeV, we don't find a significant contribution from DM subhalos and derive a statistical 95% confidence upper limit on the DM annihilation cross section in this channel. While the derived limits are consistent with previous classify-and-count approaches, our generative statistical model opens new avenues for population studies of Fermi-LAT sources and, more generally, for searches of anomalies on top of backgrounds in presence of statistical and systematic uncertainties.
PathFinder: Guided Search over Multi-Step Reasoning Paths
With recent advancements in large language models, methods like chain-of-thought prompting to elicit reasoning chains have been shown to improve results on reasoning tasks. However, tasks that require multiple steps of reasoning still pose significant challenges to state-of-the-art models. Drawing inspiration from the beam search algorithm, we propose PathFinder, a tree-search-based reasoning path generation approach. It enhances diverse branching and multi-hop reasoning through the integration of dynamic decoding, enabled by varying sampling methods and parameters. Using constrained reasoning, PathFinder integrates novel quality constraints, pruning, and exploration methods to enhance the efficiency and the quality of generation. Moreover, it includes scoring and ranking features to improve candidate selection. Our approach outperforms competitive baselines on three complex arithmetic and commonsense reasoning tasks by 6% on average. Our model generalizes well to longer, unseen reasoning chains, reflecting similar complexities to beam search with large branching factors.
Calibrated Language Models Must Hallucinate
Recent language models have a mysterious tendency to generate false but plausible-sounding text. Such "hallucinations" are an obstacle to the usability of language-based AI systems and can harm people who rely upon their outputs. This work shows shows that there is an inherent statistical reason that pretrained language models hallucinate certain types of facts, having nothing to do with the transformer LM architecture or data quality. For "arbitrary" facts whose veracity cannot be determined from the training data, we show that hallucination is necessary for language models that satisfy a statistical calibration condition appropriate for generative language models. Specifically, if the maximum probability of any fact is bounded, we show that the probability of generating a hallucination is close to the fraction of facts that occur exactly once in the training data (a "Good-Turing" estimate), even assuming ideal training data without errors. One conclusion is that models pretrained to be sufficiently good predictors (i.e., calibrated) may require post-training to mitigate hallucinations on the type of arbitrary facts that tend to appear once in the training set. However, our analysis also suggests that there is no statistical reason that pretraining will lead to hallucination on facts that tend to appear more than once in the training data (like references to publications such as articles and books, whose hallucinations have been particularly notable and problematic) or on systematic facts (like arithmetic calculations). Therefore, different architectures and learning algorithms may mitigate these latter types of hallucinations.
WALLABY Pilot Survey & ASymba: Comparing HI Detection Asymmetries to the SIMBA Simulation
An avenue for understanding cosmological galaxy formation is to compare morphometric parameters in observations and simulations of galaxy assembly. In this second paper of the ASymba: Asymmetries of HI in SIMBA Galaxies series, we measure atomic gas HI asymmetries in spatially-resolved detections from the untargetted WALLABY survey, and compare them to realizations of WALLABY-like mock samples from the SIMBA cosmological simulations. We develop a Scanline Tracing method to create mock galaxy HI datacubes which minimizes shot noise along the spectral dimension compared to particle-based methods, and therefore spurious asymmetry contributions. We compute 1D and 3D asymmetries for spatially-resolved WALLABY Pilot Survey detections, and find that the highest 3D asymmetries A3D>0.5 stem from interacting systems or detections with strong bridges or tails. We then construct a series of WALLABY-like mock realizations drawn from the SIMBA 50 Mpc simulation volume, and compare their asymmetry distributions. We find that the incidence of high A3D detections is higher in WALLABY than in the SIMBA mocks, but that difference is not statistically significant (p-value = 0.05). The statistical power of quantitative comparisons of asymmetries such as the one presented here will improve as the WALLABY survey progresses, and as simulation volumes and resolutions increase.
Forecasting Thermoacoustic Instabilities in Liquid Propellant Rocket Engines Using Multimodal Bayesian Deep Learning
The 100 MW cryogenic liquid oxygen/hydrogen multi-injector combustor BKD operated by the DLR Institute of Space Propulsion is a research platform that allows the study of thermoacoustic instabilities under realistic conditions, representative of small upper stage rocket engines. We use data from BKD experimental campaigns in which the static chamber pressure and fuel-oxidizer ratio are varied such that the first tangential mode of the combustor is excited under some conditions. We train an autoregressive Bayesian neural network model to forecast the amplitude of the dynamic pressure time series, inputting multiple sensor measurements (injector pressure/ temperature measurements, static chamber pressure, high-frequency dynamic pressure measurements, high-frequency OH* chemiluminescence measurements) and future flow rate control signals. The Bayesian nature of our algorithms allows us to work with a dataset whose size is restricted by the expense of each experimental run, without making overconfident extrapolations. We find that the networks are able to accurately forecast the evolution of the pressure amplitude and anticipate instability events on unseen experimental runs 500 milliseconds in advance. We compare the predictive accuracy of multiple models using different combinations of sensor inputs. We find that the high-frequency dynamic pressure signal is particularly informative. We also use the technique of integrated gradients to interpret the influence of different sensor inputs on the model prediction. The negative log-likelihood of data points in the test dataset indicates that predictive uncertainties are well-characterized by our Bayesian model and simulating a sensor failure event results as expected in a dramatic increase in the epistemic component of the uncertainty.
PhyX: Does Your Model Have the "Wits" for Physical Reasoning?
Existing benchmarks fail to capture a crucial aspect of intelligence: physical reasoning, the integrated ability to combine domain knowledge, symbolic reasoning, and understanding of real-world constraints. To address this gap, we introduce PhyX: the first large-scale benchmark designed to assess models capacity for physics-grounded reasoning in visual scenarios. PhyX includes 3K meticulously curated multimodal questions spanning 6 reasoning types across 25 sub-domains and 6 core physics domains: thermodynamics, electromagnetism, mechanics, modern physics, optics, and wave\&acoustics. In our comprehensive evaluation, even state-of-the-art models struggle significantly with physical reasoning. GPT-4o, Claude3.7-Sonnet, and GPT-o4-mini achieve only 32.5\%, 42.2\%, and 45.8\% accuracy respectively-performance gaps exceeding 29\% compared to human experts. Our analysis exposes critical limitations in current models: over-reliance on memorized disciplinary knowledge, excessive dependence on mathematical formulations, and surface-level visual pattern matching rather than genuine physical understanding. We provide in-depth analysis through fine-grained statistics, detailed case studies, and multiple evaluation paradigms to thoroughly examine physical reasoning capabilities. To ensure reproducibility, we implement a compatible evaluation protocol based on widely-used toolkits such as VLMEvalKit, enabling one-click evaluation.
Applications of Machine Learning to Lattice Quantum Field Theory
There is great potential to apply machine learning in the area of numerical lattice quantum field theory, but full exploitation of that potential will require new strategies. In this white paper for the Snowmass community planning process, we discuss the unique requirements of machine learning for lattice quantum field theory research and outline what is needed to enable exploration and deployment of this approach in the future.
Dynamical Model of J/Ψ photo-production on the nucleon
A dynamical model based on a phenomenological charm quark-nucleon(c-N) potential v_{cN} and the Pomeron-exchange mechanism is constructed to investigate the J/Psi photo-production on the nucleon from threshold to invariant mass W=300 GeV. The J/Psi-N potential,V_{J/Psi N}(r),is constructed by folding v_{cN} into the wavefunction Phi_{J/Psi}(cc) of J/Psi within a Constituent Quark Model(CQM) of Ref.[43]. A photo-production amplitude is also generated by v_{cN} by a cc-loop integration over the gammarightarrow cc vertex function and Phi_{J/Psi}(cc). No commonly used Vector Meson Dominance assumption is used to define this photo-production amplitude which is needed to describe the data near the threshold. The potential v_{cN}(r) is parameterized in a form such that the predicted V_{J/Psi N}(r) at large distances has the same Yukawa potential form extracted from a Lattice QCD(LQCD) calculation of Ref.[18]. The parameters of v_{cN} are determined by fitting the total cross section data of JLab by performing calculations that include J/Psi-N final state interactions(FSI). The resulting differential cross sections are found in good agreements with the data. It is shown that the FSI effects dominate the cross section in the very near threshold region, allowing for sensitive testing of the predicted J/Psi-N scattering amplitudes. By imposing the constraints of J/Psi-N potential extracted from the LQCD calculation, we have obtained three J/Psi-N potentials which fit the JLab data equally well. The resulting J/Psi-N scattering lengths are in the range of a=(-0.05 fm sim -0.25 fm). With the determined v_{cN}(r) and the wavefunctions generated from the same CQM, the constructed model is used to predict the cross sections of photo-production of eta_c(1S) and Psi(2S) mesons for future experimental tests.
I Don't Know: Explicit Modeling of Uncertainty with an [IDK] Token
Large Language Models are known to capture real-world knowledge, allowing them to excel in many downstream tasks. Despite recent advances, these models are still prone to what are commonly known as hallucinations, causing them to emit unwanted and factually incorrect text. In this work, we propose a novel calibration method that can be used to combat hallucinations. We add a special [IDK] ("I don't know") token to the model's vocabulary and introduce an objective function that shifts probability mass to the [IDK] token for incorrect predictions. This approach allows the model to express uncertainty in its output explicitly. We evaluate our proposed method across multiple model architectures and factual downstream tasks. We find that models trained with our method are able to express uncertainty in places where they would previously make mistakes while suffering only a small loss of encoded knowledge. We further perform extensive ablation studies of multiple variations of our approach and provide a detailed analysis of the precision-recall tradeoff of our method.
Distinguishing Ignorance from Error in LLM Hallucinations
Large language models (LLMs) are susceptible to hallucinations-outputs that are ungrounded, factually incorrect, or inconsistent with prior generations. We focus on close-book Question Answering (CBQA), where previous work has not fully addressed the distinction between two possible kinds of hallucinations, namely, whether the model (1) does not hold the correct answer in its parameters or (2) answers incorrectly despite having the required knowledge. We argue that distinguishing these cases is crucial for detecting and mitigating hallucinations. Specifically, case (2) may be mitigated by intervening in the model's internal computation, as the knowledge resides within the model's parameters. In contrast, in case (1) there is no parametric knowledge to leverage for mitigation, so it should be addressed by resorting to an external knowledge source or abstaining. To help distinguish between the two cases, we introduce Wrong Answer despite having Correct Knowledge (WACK), an approach for constructing model-specific datasets for the second hallucination type. Our probing experiments indicate that the two kinds of hallucinations are represented differently in the model's inner states. Next, we show that datasets constructed using WACK exhibit variations across models, demonstrating that even when models share knowledge of certain facts, they still vary in the specific examples that lead to hallucinations. Finally, we show that training a probe on our WACK datasets leads to better hallucination detection of case (2) hallucinations than using the common generic one-size-fits-all datasets. The code is available at https://github.com/technion-cs-nlp/hallucination-mitigation .
Tree-of-Debate: Multi-Persona Debate Trees Elicit Critical Thinking for Scientific Comparative Analysis
With the exponential growth of research facilitated by modern technology and improved accessibility, scientific discoveries have become increasingly fragmented within and across fields. This makes it challenging to assess the significance, novelty, incremental findings, and equivalent ideas between related works, particularly those from different research communities. Large language models (LLMs) have recently demonstrated strong quantitative and qualitative reasoning abilities, and multi-agent LLM debates have shown promise in handling complex reasoning tasks by exploring diverse perspectives and reasoning paths. Inspired by this, we introduce Tree-of-Debate (ToD), a framework which converts scientific papers into LLM personas that debate their respective novelties. To emphasize structured, critical reasoning rather than focusing solely on outcomes, ToD dynamically constructs a debate tree, enabling fine-grained analysis of independent novelty arguments within scholarly articles. Through experiments on scientific literature across various domains, evaluated by expert researchers, we demonstrate that ToD generates informative arguments, effectively contrasts papers, and supports researchers in their literature review.
Interpretable Physics Reasoning and Performance Taxonomy in Vision-Language Models
As Vision-Language Models (VLMs) grow in sophistication, their ability to perform reasoning is coming under increasing supervision. While they excel at many tasks, their grasp of fundamental scientific principles, such as physics, remains an underexplored frontier. To reflect the advancements in these capabilities, we introduce a novel and accessible framework designed to rigorously evaluate VLMs on their understanding of 2D physics. Our framework features a pragmatic scenario generator that creates a diverse testbed of over 400 problems across four core domains: Projectile Motion, Collision Dynamics, Mechanics, and Fluid Dynamics. Through comprehensive evaluation of four state-of-the-art VLMs, we demonstrate a strong correlation between model scale and reasoning ability, with our top-performing model, Qwen2.5-VL-7B, achieving an overall score of 0.815. We find that while models excel at formulaic problems, they struggle significantly with domains requiring abstract spatial reasoning. By designing this framework, we aim to democratize the study of scientific reasoning in VLMs and foster deeper insights into their capabilities and limitations.
Orbits and Dynamical Masses for Six Binary Systems in the Hyades Cluster
We report long baseline interferometric observations with the CHARA Array that resolve six previously known double-lined spectroscopic binary systems in the Hyades cluster, with orbital periods ranging from 3 to 358 days: HD 27483, HD 283882, HD 26874, HD 27149, HD 30676, and HD 28545. We combine those observations with new and existing radial-velocity measurements, to infer the dynamical masses for the components as well as the orbital parallaxes. For most stars the masses are determined to better than 1%. Our work significantly increases the number of systems with mass determinations in the cluster. We find that while current models of stellar evolution for the age and metallicity of the Hyades are able to reproduce the overall shape of the empirical mass-luminosity relation, they overestimate the V-band fluxes by about 0.1 mag between 0.5 and 1.4 M_{odot}. The disagreement is smaller in H, and near zero in K, and depends somewhat on the model. We also make use of the TESS light curves to estimate rotation periods for our targets, and detect numerous flares in one of them (HD 283882), estimating an average flaring rate of 0.44 events per day.
God(s) Know(s): Developmental and Cross-Cultural Patterns in Children Drawings
This paper introduces a novel approach to data analysis designed for the needs of specialists in psychology of religion. We detect developmental and cross-cultural patterns in children's drawings of God(s) and other supernatural agents. We develop methods to objectively evaluate our empirical observations of the drawings with respect to: (1) the gravity center, (2) the average intensities of the colors green and yellow, (3) the use of different colors (palette) and (4) the visual complexity of the drawings. We find statistically significant differences across ages and countries in the gravity centers and in the average intensities of colors. These findings support the hypotheses of the experts and raise new questions for further investigation.
MAGIC: Near-Optimal Data Attribution for Deep Learning
The goal of predictive data attribution is to estimate how adding or removing a given set of training datapoints will affect model predictions. In convex settings, this goal is straightforward (i.e., via the infinitesimal jackknife). In large-scale (non-convex) settings, however, existing methods are far less successful -- current methods' estimates often only weakly correlate with ground truth. In this work, we present a new data attribution method (MAGIC) that combines classical methods and recent advances in metadifferentiation to (nearly) optimally estimate the effect of adding or removing training data on model predictions.
Elliptical orbits in the phase-space quantization
The energy levels of hydrogen-like atoms are obtained from the phase-space quantization, one of the pillars of the old quantum theory, by three different methods - (i) direct integration, (ii) Sommerfeld's original method, and (iii) complex integration. The difficulties come from the imposition of elliptical orbits to the electron, resulting in a variable radial component of the linear momentum. Details of the calculation, which constitute a recurrent gap in textbooks that deal with phase-space quantization, are shown in depth in an accessible fashion for students of introductory quantum mechanics courses.
A differentiable binary microlensing model using adaptive contour integration method
We present microlux, which is a Jax-based code that can compute the binary microlensing light curve and its derivatives both efficiently and accurately. The key feature of microlux is the implementation of a modified version of the adaptive sampling algorithm that was originally proposed by V. Bozza to account for the finite-source effect most efficiently. The efficiency and accuracy of microlux have been verified across the relevant parameter space for binary microlensing. As a differentiable code, microlux makes it possible to apply gradient-based algorithms to the search and posterior estimation of the microlensing modeling. As an example, we use microlux to model a real microlensing event and infer the model posterior via both Fisher information matrix and Hamiltonian Monte Carlo, neither of which would have been possible without the access to accurate model gradients.
A search for extremely-high-energy neutrinos and first constraints on the ultra-high-energy cosmic-ray proton fraction with IceCube
We present a search for the diffuse extremely-high-energy neutrino flux using 12.6 years of IceCube data. The non-observation of neutrinos with energies well above 10 , PeV constrains the all-flavor neutrino flux at 10^{18} , eV to a level of E^2 Phi_{nu_e + nu_mu + nu_tau} simeq 10^{-8} , GeV , cm^{-2} , s^{-1} , sr^{-1}, the most stringent limit to date. Using this data, we constrain the proton fraction of ultra-high-energy cosmic rays (UHECRs) above simeq 30 , EeV to be lesssim 70,% (at 90,% CL) if the cosmological evolution of the sources is comparable to or stronger than the star formation rate. This result complements direct air-shower measurements by being insensitive to uncertainties associated with hadronic interaction models. It is the first such result to disfavor the ``proton-only" hypothesis for UHECRs using neutrino data.
Predictable Compression Failures: Why Language Models Actually Hallucinate
Large language models perform near-Bayesian inference yet violate permutation invariance on exchangeable data. We resolve this by showing transformers minimize expected conditional description length (cross-entropy) over orderings, E_pi[ell(Y mid Gamma_pi(X))], which admits a Kolmogorov-complexity interpretation up to additive constants, rather than the permutation-invariant description length ell(Y mid X). This makes them Bayesian in expectation, not in realization. We derive (i) a Quantified Martingale Violation bound showing order-induced deviations scale as O(log n) with constants; (ii) the Expectation-level Decompression Law linking information budgets to reliability for Bernoulli predicates; and (iii) deployable planners (B2T/RoH/ISR) for answer/abstain decisions. Empirically, permutation dispersion follows a+bln n (Qwen2-7B b approx 0.377, Llama-3.1-8B b approx 0.147); permutation mixtures improve ground-truth likelihood/accuracy; and randomized dose-response shows hallucinations drop by sim 0.13 per additional nat. A pre-specified audit with a fixed ISR=1.0 achieves near-0\% hallucinations via calibrated refusal at 24\% abstention. The framework turns hallucinations into predictable compression failures and enables principled information budgeting.
Dependent Bayesian Lenses: Categories of Bidirectional Markov Kernels with Canonical Bayesian Inversion
We generalise an existing construction of Bayesian Lenses to admit lenses between pairs of objects where the backwards object is dependent on states on the forwards object (interpreted as probability distributions). This gives a natural setting for studying stochastic maps with Bayesian inverses restricted to the points supported by a given prior. In order to state this formally we develop a proposed definition by Fritz of a support object in a Markov category and show that these give rise to a section into the category of dependent Bayesian lenses encoding a more canonical notion of Bayesian inversion.
Approaching an unknown communication system by latent space exploration and causal inference
This paper proposes a methodology for discovering meaningful properties in data by exploring the latent space of unsupervised deep generative models. We combine manipulation of individual latent variables to extreme values with methods inspired by causal inference into an approach we call causal disentanglement with extreme values (CDEV) and show that this method yields insights for model interpretability. With this, we can test for what properties of unknown data the model encodes as meaningful, using it to glean insight into the communication system of sperm whales (Physeter macrocephalus), one of the most intriguing and understudied animal communication systems. The network architecture used has been shown to learn meaningful representations of speech; here, it is used as a learning mechanism to decipher the properties of another vocal communication system in which case we have no ground truth. The proposed methodology suggests that sperm whales encode information using the number of clicks in a sequence, the regularity of their timing, and audio properties such as the spectral mean and the acoustic regularity of the sequences. Some of these findings are consistent with existing hypotheses, while others are proposed for the first time. We also argue that our models uncover rules that govern the structure of units in the communication system and apply them while generating innovative data not shown during training. This paper suggests that an interpretation of the outputs of deep neural networks with causal inference methodology can be a viable strategy for approaching data about which little is known and presents another case of how deep learning can limit the hypothesis space. Finally, the proposed approach can be extended to other architectures and datasets.
Annotation Artifacts in Natural Language Inference Data
Large-scale datasets for natural language inference are created by presenting crowd workers with a sentence (premise), and asking them to generate three new sentences (hypotheses) that it entails, contradicts, or is logically neutral with respect to. We show that, in a significant portion of such data, this protocol leaves clues that make it possible to identify the label by looking only at the hypothesis, without observing the premise. Specifically, we show that a simple text categorization model can correctly classify the hypothesis alone in about 67% of SNLI (Bowman et. al, 2015) and 53% of MultiNLI (Williams et. al, 2017). Our analysis reveals that specific linguistic phenomena such as negation and vagueness are highly correlated with certain inference classes. Our findings suggest that the success of natural language inference models to date has been overestimated, and that the task remains a hard open problem.
Trend-Based SAC Beam Control Method with Zero-Shot in Superconducting Linear Accelerator
The superconducting linear accelerator is a highly flexiable facility for modern scientific discoveries, necessitating weekly reconfiguration and tuning. Accordingly, minimizing setup time proves essential in affording users with ample experimental time. We propose a trend-based soft actor-critic(TBSAC) beam control method with strong robustness, allowing the agents to be trained in a simulated environment and applied to the real accelerator directly with zero-shot. To validate the effectiveness of our method, two different typical beam control tasks were performed on China Accelerator Facility for Superheavy Elements (CAFe II) and a light particle injector(LPI) respectively. The orbit correction tasks were performed in three cryomodules in CAFe II seperately, the time required for tuning has been reduced to one-tenth of that needed by human experts, and the RMS values of the corrected orbit were all less than 1mm. The other transmission efficiency optimization task was conducted in the LPI, our agent successfully optimized the transmission efficiency of radio-frequency quadrupole(RFQ) to over 85% within 2 minutes. The outcomes of these two experiments offer substantiation that our proposed TBSAC approach can efficiently and effectively accomplish beam commissioning tasks while upholding the same standard as skilled human experts. As such, our method exhibits potential for future applications in other accelerator commissioning fields.
The Compositional Structure of Bayesian Inference
Bayes' rule tells us how to invert a causal process in order to update our beliefs in light of new evidence. If the process is believed to have a complex compositional structure, we may observe that the inversion of the whole can be computed piecewise in terms of the component processes. We study the structure of this compositional rule, noting that it relates to the lens pattern in functional programming. Working in a suitably general axiomatic presentation of a category of Markov kernels, we see how we can think of Bayesian inversion as a particular instance of a state-dependent morphism in a fibred category. We discuss the compositional nature of this, formulated as a functor on the underlying category and explore how this can used for a more type-driven approach to statistical inference.
Verif.ai: Towards an Open-Source Scientific Generative Question-Answering System with Referenced and Verifiable Answers
In this paper, we present the current progress of the project Verif.ai, an open-source scientific generative question-answering system with referenced and verified answers. The components of the system are (1) an information retrieval system combining semantic and lexical search techniques over scientific papers (PubMed), (2) a fine-tuned generative model (Mistral 7B) taking top answers and generating answers with references to the papers from which the claim was derived, and (3) a verification engine that cross-checks the generated claim and the abstract or paper from which the claim was derived, verifying whether there may have been any hallucinations in generating the claim. We are reinforcing the generative model by providing the abstract in context, but in addition, an independent set of methods and models are verifying the answer and checking for hallucinations. Therefore, we believe that by using our method, we can make scientists more productive, while building trust in the use of generative language models in scientific environments, where hallucinations and misinformation cannot be tolerated.
Neural network emulator to constrain the high-z IGM thermal state from Lyman-α forest flux auto-correlation function
We present a neural network emulator to constrain the thermal parameters of the intergalactic medium (IGM) at 5.4z6.0 using the Lyman-displaystylealpha (Lydisplaystylealpha) forest flux auto-correlation function. Our auto-differentiable JAX-based framework accelerates the surrogate model generation process using approximately 100 sparsely sampled Nyx hydrodynamical simulations with varying combinations of thermal parameters, i.e., the temperature at mean density T_{{0}}, the slope of the temperaturedisplaystyle-density relation displaystylegamma, and the mean transmission flux langle{F}{rangle}. We show that this emulator has a typical accuracy of 1.0% across the specified redshift range. Bayesian inference of the IGM thermal parameters, incorporating emulator uncertainty propagation, is further expedited using NumPyro Hamiltonian Monte Carlo. We compare both the inference results and computational cost of our framework with the traditional nearest-neighbor interpolation approach applied to the same set of mock Lyalpha flux. By examining the credibility contours of the marginalized posteriors for T_{{0}},gamma,and{langle}{F}{rangle} obtained using the emulator, the statistical reliability of measurements is established through inference on 100 realistic mock data sets of the auto-correlation function.
Shaping Laser Pulses with Reinforcement Learning
High Power Laser (HPL) systems operate in the attoseconds regime -- the shortest timescale ever created by humanity. HPL systems are instrumental in high-energy physics, leveraging ultra-short impulse durations to yield extremely high intensities, which are essential for both practical applications and theoretical advancements in light-matter interactions. Traditionally, the parameters regulating HPL optical performance have been manually tuned by human experts, or optimized using black-box methods that can be computationally demanding. Critically, black box methods rely on stationarity assumptions overlooking complex dynamics in high-energy physics and day-to-day changes in real-world experimental settings, and thus need to be often restarted. Deep Reinforcement Learning (DRL) offers a promising alternative by enabling sequential decision making in non-static settings. This work explores the feasibility of applying DRL to HPL systems, extending the current research by (1) learning a control policy relying solely on non-destructive image observations obtained from readily available diagnostic devices, and (2) retaining performance when the underlying dynamics vary. We evaluate our method across various test dynamics, and observe that DRL effectively enables cross-domain adaptability, coping with dynamics' fluctuations while achieving 90\% of the target intensity in test environments.
Machine learning-driven Anomaly Detection and Forecasting for Euclid Space Telescope Operations
State-of-the-art space science missions increasingly rely on automation due to spacecraft complexity and the costs of human oversight. The high volume of data, including scientific and telemetry data, makes manual inspection challenging. Machine learning offers significant potential to meet these demands. The Euclid space telescope, in its survey phase since February 2024, exemplifies this shift. Euclid's success depends on accurate monitoring and interpretation of housekeeping telemetry and science-derived data. Thousands of telemetry parameters, monitored as time series, may or may not impact the quality of scientific data. These parameters have complex interdependencies, often due to physical relationships (e.g., proximity of temperature sensors). Optimising science operations requires careful anomaly detection and identification of hidden parameter states. Moreover, understanding the interactions between known anomalies and physical quantities is crucial yet complex, as related parameters may display anomalies with varied timing and intensity. We address these challenges by analysing temperature anomalies in Euclid's telemetry from February to August 2024, focusing on eleven temperature parameters and 35 covariates. We use a predictive XGBoost model to forecast temperatures based on historical values, detecting anomalies as deviations from predictions. A second XGBoost model predicts anomalies from covariates, capturing their relationships to temperature anomalies. We identify the top three anomalies per parameter and analyse their interactions with covariates using SHAP (Shapley Additive Explanations), enabling rapid, automated analysis of complex parameter relationships. Our method demonstrates how machine learning can enhance telemetry monitoring, offering scalable solutions for other missions with similar data challenges.
Shadow Cones: A Generalized Framework for Partial Order Embeddings
Hyperbolic space has proven to be well-suited for capturing hierarchical relations in data, such as trees and directed acyclic graphs. Prior work introduced the concept of entailment cones, which uses partial orders defined by nested cones in the Poincar\'e ball to model hierarchies. Here, we introduce the ``shadow cones" framework, a physics-inspired entailment cone construction. Specifically, we model partial orders as subset relations between shadows formed by a light source and opaque objects in hyperbolic space. The shadow cones framework generalizes entailment cones to a broad class of formulations and hyperbolic space models beyond the Poincar\'e ball. This results in clear advantages over existing constructions: for example, shadow cones possess better optimization properties over constructions limited to the Poincar\'e ball. Our experiments on datasets of various sizes and hierarchical structures show that shadow cones consistently and significantly outperform existing entailment cone constructions. These results indicate that shadow cones are an effective way to model partial orders in hyperbolic space, offering physically intuitive and novel insights about the nature of such structures.
Optical Spectroscopy of Classical Be Stars in Old Open Clusters
We performed the optical spectroscopy of 16 classical Be stars in 11 open clusters older than 100 Myr. Ours is the first spectroscopic study of classical Be stars in open clusters older than 100 Myr. We found that the H alpha emission strength of most of the stars is less than 40 Angstrom, in agreement with previous studies. Our analysis further suggests that one of the stars, KW97 35 12, might be a weak H alpha emitter in nature, showing H alpha equivalent width of negative 0.5 Angstrom. Interestingly, we also found that the newly detected classical Be star LS III 47 37b might be a component of the possible visual binary system LS III 47 37, where the other companion is also a classical Be star. Hence, the present study indicates the possible detection of a binary Be system. Moreover, it is observed that all 16 stars exhibit a lesser number of emission lines compared to classical Be stars younger than 100 Myr. Furthermore, the spectral type distribution analysis of B type and classical Be stars for the selected clusters points out that the existence of CBe stars can depend on the spectral type distribution of B type stars present in these clusters.
Multi-marginal Schrödinger Bridges with Iterative Reference Refinement
Practitioners frequently aim to infer an unobserved population trajectory using sample snapshots at multiple time points. For instance, in single-cell sequencing, scientists would like to learn how gene expression evolves over time. But sequencing any cell destroys that cell. So we cannot access any cell's full trajectory, but we can access snapshot samples from many cells. Stochastic differential equations are commonly used to analyze systems with full individual-trajectory access; since here we have only sample snapshots, these methods are inapplicable. The deep learning community has recently explored using Schr\"odinger bridges (SBs) and their extensions to estimate these dynamics. However, these methods either (1) interpolate between just two time points or (2) require a single fixed reference dynamic within the SB, which is often just set to be Brownian motion. But learning piecewise from adjacent time points can fail to capture long-term dependencies. And practitioners are typically able to specify a model class for the reference dynamic but not the exact values of the parameters within it. So we propose a new method that (1) learns the unobserved trajectories from sample snapshots across multiple time points and (2) requires specification only of a class of reference dynamics, not a single fixed one. In particular, we suggest an iterative projection method inspired by Schr\"odinger bridges; we alternate between learning a piecewise SB on the unobserved trajectories and using the learned SB to refine our best guess for the dynamics within the reference class. We demonstrate the advantages of our method via a well-known simulated parametric model from ecology, simulated and real data from systems biology, and real motion-capture data.
Comment on The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
Shojaee et al. (2025) report that Large Reasoning Models (LRMs) exhibit "accuracy collapse" on planning puzzles beyond certain complexity thresholds. We demonstrate that their findings primarily reflect experimental design limitations rather than fundamental reasoning failures. Our analysis reveals three critical issues: (1) Tower of Hanoi experiments systematically exceed model output token limits at reported failure points, with models explicitly acknowledging these constraints in their outputs; (2) The authors' automated evaluation framework fails to distinguish between reasoning failures and practical constraints, leading to misclassification of model capabilities; (3) Most concerningly, their River Crossing benchmarks include mathematically impossible instances for N > 5 due to insufficient boat capacity, yet models are scored as failures for not solving these unsolvable problems. When we control for these experimental artifacts, by requesting generating functions instead of exhaustive move lists, preliminary experiments across multiple models indicate high accuracy on Tower of Hanoi instances previously reported as complete failures. These findings highlight the importance of careful experimental design when evaluating AI reasoning capabilities.
Foundation Models for Scientific Discovery: From Paradigm Enhancement to Paradigm Transition
Foundation models (FMs), such as GPT-4 and AlphaFold, are reshaping the landscape of scientific research. Beyond accelerating tasks such as hypothesis generation, experimental design, and result interpretation, they prompt a more fundamental question: Are FMs merely enhancing existing scientific methodologies, or are they redefining the way science is conducted? In this paper, we argue that FMs are catalyzing a transition toward a new scientific paradigm. We introduce a three-stage framework to describe this evolution: (1) Meta-Scientific Integration, where FMs enhance workflows within traditional paradigms; (2) Hybrid Human-AI Co-Creation, where FMs become active collaborators in problem formulation, reasoning, and discovery; and (3) Autonomous Scientific Discovery, where FMs operate as independent agents capable of generating new scientific knowledge with minimal human intervention. Through this lens, we review current applications and emerging capabilities of FMs across existing scientific paradigms. We further identify risks and future directions for FM-enabled scientific discovery. This position paper aims to support the scientific community in understanding the transformative role of FMs and to foster reflection on the future of scientific discovery. Our project is available at https://github.com/usail-hkust/Awesome-Foundation-Models-for-Scientific-Discovery.
Baryonic Effects on Lagrangian Clustering and Angular Momentum Reconstruction
Recent studies illustrate the correlation between the angular momenta of cosmic structures and their Lagrangian properties. However, only baryons are observable and it is unclear whether they reliably trace the cosmic angular momenta. We study the Lagrangian mass distribution, spin correlation, and predictability of dark matter, gas, and stellar components of galaxy-halo systems using IllustrisTNG, and show that the primordial segregations between components are typically small. Their protoshapes are also similar in terms of the statistics of moment of inertia tensors. Under the common gravitational potential they are expected to exert the same tidal torque and the strong spin correlations are not destroyed by the nonlinear evolution and complicated baryonic effects, as confirmed by the high-resolution hydrodynamic simulations. We further show that their late-time angular momenta traced by total gas, stars, or the central galaxies, can be reliably reconstructed by the initial perturbations. These results suggest that baryonic angular momenta can potentially be used in reconstructing the parameters and models related to the initial perturbations.
Constructor Theory of Life
Neo-Darwinian evolutionary theory explains how the appearance of purposive design in the sophisticated adaptations of living organisms can have come about without their intentionally being designed. The explanation relies crucially on the possibility of certain physical processes: mainly, gene replication and natural selection. In this paper I show that for those processes to be possible without the design of biological adaptations being encoded in the laws of physics, those laws must have certain other properties. The theory of what these properties are is not part of evolution theory proper, and has not been developed, yet without it the neo-Darwinian theory does not fully achieve its purpose of explaining the appearance of design. To this end I apply Constructor Theory's new mode of explanation to provide an exact formulation of the appearance of design, of no-design laws, and of the logic of self-reproduction and natural selection, within fundamental physics. I conclude that self-reproduction, replication and natural selection are possible under no-design laws, the only non-trivial condition being that they allow digital information to be physically instantiated. This has an exact characterisation in the constructor theory of information. I also show that under no-design laws an accurate replicator requires the existence of a "vehicle" constituting, together with the replicator, a self-reproducer.
A 2.4% Determination of the Local Value of the Hubble Constant
We use the Wide Field Camera 3 (WFC3) on the Hubble Space Telescope (HST) to reduce the uncertainty in the local value of the Hubble constant (H_0) from 3.3% to 2.4%. Improvements come from new, near-infrared observations of Cepheid variables in 11 new hosts of recent SNe~Ia, more than doubling the sample of SNe~Ia having a Cepheid-calibrated distance for a total of 19; these leverage the magnitude-z relation based on 300 SNe~Ia at z<0.15. All 19 hosts and the megamaser system NGC4258 were observed with WFC3, thus nullifying cross-instrument zeropoint errors. Other improvements include a 33% reduction in the systematic uncertainty in the maser distance to NGC4258, more Cepheids and a more robust distance to the LMC from late-type DEBs, HST observations of Cepheids in M31, and new HST-based trigonometric parallaxes for Milky Way (MW) Cepheids. We consider four geometric distance calibrations of Cepheids: (i) megamasers in NGC4258, (ii) 8 DEBs in the LMC, (iii) 15 MW Cepheids with parallaxes, and (iv) 2 DEBs in M31. H_0 from each is 72.25+/-2.51, 72.04+/-2.67, 76.18+/-2.37, and 74.50+/-3.27 km/sec/Mpc, respectively. Our best estimate of 73.24+/-1.74 km/sec/Mpc combines the anchors NGC4258, MW, and LMC, and includes systematic errors for a final uncertainty of 2.4%. This value is 3.4 sigma higher than 66.93+/-0.62 km/sec/Mpc predicted by LambdaCDM with 3 neutrinos with mass 0.06 eV and the Planck data, but reduces to 2.1 sigma relative to the prediction of 69.3+/-0.7 km/sec/Mpc with the combination of WMAP+ACT+SPT+BAO, suggesting systematic uncertainties in CMB measurements may play a role in the tension. If we take the conflict between Planck and H_0 at face value, one plausible explanation could involve an additional source of dark radiation in the early Universe in the range of Delta N_eff=0.4-1. We anticipate significant improvements in H_0 from upcoming parallax measurements.
O1 Replication Journey -- Part 3: Inference-time Scaling for Medical Reasoning
Building upon our previous investigations of O1 replication (Part 1: Journey Learning [Qin et al., 2024] and Part 2: Distillation [Huang et al., 2024]), this work explores the potential of inference-time scaling in large language models (LLMs) for medical reasoning tasks, ranging from diagnostic decision-making to treatment planning. Through extensive experiments on medical benchmarks of varying complexity (MedQA, Medbullets, and JAMA Clinical Challenges), our investigation reveals several key insights: (1) Increasing inference time does lead to improved performance. With a modest training set of 500 samples, our model yields substantial performance improvements of 6%-11%. (2) Task complexity directly correlates with the required length of reasoning chains, confirming the necessity of extended thought processes for challenging problems. (3) The differential diagnoses generated by our model adhere to the principles of the hypothetico-deductive method, producing a list of potential conditions that may explain a patient's symptoms and systematically narrowing these possibilities by evaluating the evidence. These findings demonstrate the promising synergy between inference-time scaling and journey learning in advancing LLMs' real-world clinical reasoning capabilities.
Blackbox Model Provenance via Palimpsestic Membership Inference
Suppose Alice trains an open-weight language model and Bob uses a blackbox derivative of Alice's model to produce text. Can Alice prove that Bob is using her model, either by querying Bob's derivative model (query setting) or from the text alone (observational setting)? We formulate this question as an independence testing problem--in which the null hypothesis is that Bob's model or text is independent of Alice's randomized training run--and investigate it through the lens of palimpsestic memorization in language models: models are more likely to memorize data seen later in training, so we can test whether Bob is using Alice's model using test statistics that capture correlation between Bob's model or text and the ordering of training examples in Alice's training run. If Alice has randomly shuffled her training data, then any significant correlation amounts to exactly quantifiable statistical evidence against the null hypothesis, regardless of the composition of Alice's training data. In the query setting, we directly estimate (via prompting) the likelihood Bob's model gives to Alice's training examples and order; we correlate the likelihoods of over 40 fine-tunes of various Pythia and OLMo base models ranging from 1B to 12B parameters with the base model's training data order, achieving a p-value on the order of at most 1e-8 in all but six cases. In the observational setting, we try two approaches based on estimating 1) the likelihood of Bob's text overlapping with spans of Alice's training examples and 2) the likelihood of Bob's text with respect to different versions of Alice's model we obtain by repeating the last phase (e.g., 1%) of her training run on reshuffled data. The second approach can reliably distinguish Bob's text from as little as a few hundred tokens; the first does not involve any retraining but requires many more tokens (several hundred thousand) to achieve high power.
Modeling Performance of Data Collection Systems for High-Energy Physics
Exponential increases in scientific experimental data are outstripping the rate of progress in silicon technology. As a result, heterogeneous combinations of architectures and process or device technologies are increasingly important to meet the computing demands of future scientific experiments. However, the complexity of heterogeneous computing systems requires systematic modeling to understand performance. We present a model which addresses this need by framing key aspects of data collection pipelines and constraints, and combines them with the important vectors of technology that shape alternatives, computing metrics that allow complex alternatives to be compared. For instance, a data collection pipeline may be characterized by parameters such as sensor sampling rates, amount of data collected, and the overall relevancy of retrieved samples. Alternatives to this pipeline are enabled by hardware development vectors including advancing CMOS, GPUs, neuromorphic computing, and edge computing. By calculating metrics for each alternative such as overall F1 score, power, hardware cost, and energy expended per relevant sample, this model allows alternate data collection systems to be rigorously compared. To demonstrate this model's capability, we apply it to the CMS experiment (and planned HL-LHC upgrade) to evaluate and compare the application of novel technologies in the data acquisition system (DAQ). We demonstrate that improvements to early stages in the DAQ are highly beneficial, greatly reducing the resources required at later stages of processing (such as a 60% power reduction) and increasing the amount of relevant data retrieved from the experiment per unit power (improving from 0.065 to 0.31 samples/kJ) However, we predict further advances will be required in order to meet overall power and cost constraints for the DAQ.
Knowledge Graph in Astronomical Research with Large Language Models: Quantifying Driving Forces in Interdisciplinary Scientific Discovery
Identifying and predicting the factors that contribute to the success of interdisciplinary research is crucial for advancing scientific discovery. However, there is a lack of methods to quantify the integration of new ideas and technological advancements in astronomical research and how these new technologies drive further scientific breakthroughs. Large language models, with their ability to extract key concepts from vast literature beyond keyword searches, provide a new tool to quantify such processes. In this study, we extracted concepts in astronomical research from 297,807 publications between 1993 and 2024 using large language models, resulting in a set of 24,939 concepts. These concepts were then used to form a knowledge graph, where the link strength between any two concepts was determined by their relevance through the citation-reference relationships. By calculating this relevance across different time periods, we quantified the impact of numerical simulations and machine learning on astronomical research. The knowledge graph demonstrates two phases of development: a phase where the technology was integrated and another where the technology was explored in scientific discovery. The knowledge graph reveals that despite machine learning has made much inroad in astronomy, there is currently a lack of new concept development at the intersection of AI and Astronomy, which may be the current bottleneck preventing machine learning from further transforming the field of astronomy.
Topological Obstructions to Autoencoding
Autoencoders have been proposed as a powerful tool for model-independent anomaly detection in high-energy physics. The operating principle is that events which do not belong to the space of training data will be reconstructed poorly, thus flagging them as anomalies. We point out that in a variety of examples of interest, the connection between large reconstruction error and anomalies is not so clear. In particular, for data sets with nontrivial topology, there will always be points that erroneously seem anomalous due to global issues. Conversely, neural networks typically have an inductive bias or prior to locally interpolate such that undersampled or rare events may be reconstructed with small error, despite actually being the desired anomalies. Taken together, these facts are in tension with the simple picture of the autoencoder as an anomaly detector. Using a series of illustrative low-dimensional examples, we show explicitly how the intrinsic and extrinsic topology of the dataset affects the behavior of an autoencoder and how this topology is manifested in the latent space representation during training. We ground this analysis in the discussion of a mock "bump hunt" in which the autoencoder fails to identify an anomalous "signal" for reasons tied to the intrinsic topology of n-particle phase space.
Audio Entailment: Assessing Deductive Reasoning for Audio Understanding
Recent literature uses language to build foundation models for audio. These Audio-Language Models (ALMs) are trained on a vast number of audio-text pairs and show remarkable performance in tasks including Text-to-Audio Retrieval, Captioning, and Question Answering. However, their ability to engage in more complex open-ended tasks, like Interactive Question-Answering, requires proficiency in logical reasoning -- a skill not yet benchmarked. We introduce the novel task of Audio Entailment to evaluate an ALM's deductive reasoning ability. This task assesses whether a text description (hypothesis) of audio content can be deduced from an audio recording (premise), with potential conclusions being entailment, neutral, or contradiction, depending on the sufficiency of the evidence. We create two datasets for this task with audio recordings sourced from two audio captioning datasets -- AudioCaps and Clotho -- and hypotheses generated using Large Language Models (LLMs). We benchmark state-of-the-art ALMs and find deficiencies in logical reasoning with both zero-shot and linear probe evaluations. Finally, we propose "caption-before-reason", an intermediate step of captioning that improves the zero-shot and linear-probe performance of ALMs by an absolute 6% and 3%, respectively.
Stellar evolution and axion-like particles: new constraints and hints from globular clusters in the GAIA DR3 data
Axion-like particles (ALPs) are hypothetical pseudoscalar bosons, natural in extensions of the Standard Model. Their interactions with ordinary matter and radiation are suppressed, making it challenging to detect them in laboratory experiments. However, these particles, produced within stellar interiors, can provide an additional mechanism for energy loss, potentially influencing stellar evolution. Prominent methods for searching for such effects involve measuring the properties of red giants and helium-burning stars in globular clusters (GCs). Here we use published catalogs of stars selected as members of seven GCs on the basis of parallaxes and proper motions measured by Gaia (Data Realease 3). Making use of previously derived theoretical relations and the new data, we find the upper limit on the ALP-electron coupling, g_{ae}<5.2*10^{-14} (95% CL), and an indication (3.3 sigma) to nonzero ALP-photon coupling, g_{a\gamma}=(6.5+1.1-1.3)*10^{-11} GeV^{-1}. Given the precision of contemporary observational data, it is imperative to refine ALP constraints through more sophisticated analyses, which will be explored in detail elsewhere.
MSDiagnosis: An EMR-based Dataset for Clinical Multi-Step Diagnosis
Clinical diagnosis is critical in medical practice, typically requiring a continuous and evolving process that includes primary diagnosis, differential diagnosis, and final diagnosis. However, most existing clinical diagnostic tasks are single-step processes, which does not align with the complex multi-step diagnostic procedures found in real-world clinical settings. In this paper, we propose a multi-step diagnostic task and annotate a clinical diagnostic dataset (MSDiagnosis). This dataset includes primary diagnosis, differential diagnosis, and final diagnosis questions. Additionally, we propose a novel and effective framework. This framework combines forward inference, backward inference, reflection, and refinement, enabling the LLM to self-evaluate and adjust its diagnostic results. To assess the effectiveness of our proposed method, we design and conduct extensive experiments. The experimental results demonstrate the effectiveness of the proposed method. We also provide a comprehensive experimental analysis and suggest future research directions for this task.
Counterfactual Probing for Hallucination Detection and Mitigation in Large Language Models
Large Language Models have demonstrated remarkable capabilities across diverse tasks, yet they frequently generate hallucinations outputs that are fluent but factually incorrect or unsupported. We propose Counterfactual Probing, a novel approach for detecting and mitigating hallucinations in LLM outputs. Our method dynamically generates counterfactual statements that appear plausible but contain subtle factual errors, then evaluates the model's sensitivity to these perturbations. We hypothesize that genuine knowledge exhibits robustness to counterfactual variations, while hallucinated content shows inconsistent confidence patterns when confronted with plausible alternatives. Our comprehensive evaluation on TruthfulQA, factual statement datasets, and curated hallucination examples demonstrates that counterfactual probing achieves superior detection performance compared to baseline methods, while our adaptive mitigation strategies reduce hallucination scores by an average of 24.5%. The approach requires no model retraining and can be integrated into existing LLM pipelines as a realtime verification mechanism.
Latent Field Discovery In Interacting Dynamical Systems With Neural Fields
Systems of interacting objects often evolve under the influence of field effects that govern their dynamics, yet previous works have abstracted away from such effects, and assume that systems evolve in a vacuum. In this work, we focus on discovering these fields, and infer them from the observed dynamics alone, without directly observing them. We theorize the presence of latent force fields, and propose neural fields to learn them. Since the observed dynamics constitute the net effect of local object interactions and global field effects, recently popularized equivariant networks are inapplicable, as they fail to capture global information. To address this, we propose to disentangle local object interactions -- which are SE(n) equivariant and depend on relative states -- from external global field effects -- which depend on absolute states. We model interactions with equivariant graph networks, and combine them with neural fields in a novel graph network that integrates field forces. Our experiments show that we can accurately discover the underlying fields in charged particles settings, traffic scenes, and gravitational n-body problems, and effectively use them to learn the system and forecast future trajectories.
Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation
Large language models (LLMs) are rapidly transforming social science research by enabling the automation of labor-intensive tasks like data annotation and text analysis. However, LLM outputs vary significantly depending on the implementation choices made by researchers (e.g., model selection, prompting strategy, or temperature settings). Such variation can introduce systematic biases and random errors, which propagate to downstream analyses and cause Type I, Type II, Type S, or Type M errors. We call this LLM hacking. We quantify the risk of LLM hacking by replicating 37 data annotation tasks from 21 published social science research studies with 18 different models. Analyzing 13 million LLM labels, we test 2,361 realistic hypotheses to measure how plausible researcher choices affect statistical conclusions. We find incorrect conclusions based on LLM-annotated data in approximately one in three hypotheses for state-of-the-art models, and in half the hypotheses for small language models. While our findings show that higher task performance and better general model capabilities reduce LLM hacking risk, even highly accurate models do not completely eliminate it. The risk of LLM hacking decreases as effect sizes increase, indicating the need for more rigorous verification of findings near significance thresholds. Our extensive analysis of LLM hacking mitigation techniques emphasizes the importance of human annotations in reducing false positive findings and improving model selection. Surprisingly, common regression estimator correction techniques are largely ineffective in reducing LLM hacking risk, as they heavily trade off Type I vs. Type II errors. Beyond accidental errors, we find that intentional LLM hacking is unacceptably simple. With few LLMs and just a handful of prompt paraphrases, anything can be presented as statistically significant.
Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification
Chain-of-Thought (CoT) prompting has become the de facto method to elicit reasoning capabilities from large language models (LLMs). However, to mitigate hallucinations in CoT that are notoriously difficult to detect, current methods such as process reward models (PRMs) or self-consistency operate as opaque boxes and do not provide checkable evidence for their judgments, possibly limiting their effectiveness. To address this issue, we draw inspiration from the idea that "the gold standard for supporting a mathematical claim is to provide a proof". We propose a retrospective, step-aware formal verification framework Safe. Rather than assigning arbitrary scores, we strive to articulate mathematical claims in formal mathematical language Lean 4 at each reasoning step and provide formal proofs to identify hallucinations. We evaluate our framework Safe across multiple language models and various mathematical datasets, demonstrating a significant performance improvement while offering interpretable and verifiable evidence. We also propose FormalStep as a benchmark for step correctness theorem proving with 30,809 formal statements. To the best of our knowledge, our work represents the first endeavor to utilize formal mathematical language Lean 4 for verifying natural language content generated by LLMs, aligning with the reason why formal mathematical languages were created in the first place: to provide a robust foundation for hallucination-prone human-written proofs.
MLR-Copilot: Autonomous Machine Learning Research based on Large Language Models Agents
Machine learning research, crucial for technological advancements and innovation, often faces significant challenges due to its inherent complexity, slow pace of experimentation, and the necessity for specialized expertise. Motivated by this, we present a new systematic framework, autonomous Machine Learning Research with large language models (MLR-Copilot), designed to enhance machine learning research productivity through the automatic generation and implementation of research ideas using Large Language Model (LLM) agents. The framework consists of three phases: research idea generation, experiment implementation, and implementation execution. First, existing research papers are used to generate hypotheses and experimental plans vis IdeaAgent powered by LLMs. Next, the implementation generation phase translates these plans into executables with ExperimentAgent. This phase leverages retrieved prototype code and optionally retrieves candidate models and data. Finally, the execution phase, also managed by ExperimentAgent, involves running experiments with mechanisms for human feedback and iterative debugging to enhance the likelihood of achieving executable research outcomes. We evaluate our framework on five machine learning research tasks and the experimental results show the framework's potential to facilitate the research progress and innovations.
Can AI Validate Science? Benchmarking LLMs for Accurate Scientific Claim rightarrow Evidence Reasoning
Large language models (LLMs) are increasingly being used for complex research tasks such as literature review, idea generation, and scientific paper analysis, yet their ability to truly understand and process the intricate relationships within complex research papers, such as the logical links between claims and supporting evidence remains largely unexplored. In this study, we present CLAIM-BENCH, a comprehensive benchmark for evaluating LLMs' capabilities in scientific claim-evidence extraction and validation, a task that reflects deeper comprehension of scientific argumentation. We systematically compare three approaches which are inspired by divide and conquer approaches, across six diverse LLMs, highlighting model-specific strengths and weaknesses in scientific comprehension. Through evaluation involving over 300 claim-evidence pairs across multiple research domains, we reveal significant limitations in LLMs' ability to process complex scientific content. Our results demonstrate that closed-source models like GPT-4 and Claude consistently outperform open-source counterparts in precision and recall across claim-evidence identification tasks. Furthermore, strategically designed three-pass and one-by-one prompting approaches significantly improve LLMs' abilities to accurately link dispersed evidence with claims, although this comes at increased computational cost. CLAIM-BENCH sets a new standard for evaluating scientific comprehension in LLMs, offering both a diagnostic tool and a path forward for building systems capable of deeper, more reliable reasoning across full-length papers.
Lyra: Orchestrating Dual Correction in Automated Theorem Proving
Large Language Models (LLMs) present an intriguing avenue for exploration in the field of formal theorem proving. Nevertheless, their full potential, particularly concerning the mitigation of hallucinations and refinement through prover error messages, remains an area that has yet to be thoroughly investigated. To enhance the effectiveness of LLMs in the field, we introduce the Lyra, a new framework that employs two distinct correction mechanisms: Tool Correction (TC) and Conjecture Correction (CC). To implement Tool Correction in the post-processing of formal proofs, we leverage prior knowledge to utilize predefined prover tools (e.g., Sledgehammer) for guiding the replacement of incorrect tools. Tool Correction significantly contributes to mitigating hallucinations, thereby improving the overall accuracy of the proof. In addition, we introduce Conjecture Correction, an error feedback mechanism designed to interact with prover to refine formal proof conjectures with prover error messages. Compared to the previous refinement framework, the proposed Conjecture Correction refines generation with instruction but does not collect paired (generation, error & refinement) prompts. Our method has achieved state-of-the-art (SOTA) performance on both miniF2F validation (48.0% -> 55.3%) and test (45.5% -> 51.2%). We also present 3 IMO problems solved by Lyra. We believe Tool Correction (post-process for hallucination mitigation) and Conjecture Correction (subgoal adjustment from interaction with environment) could provide a promising avenue for future research in this field.
Causal Evidence for the Primordiality of Colors in Trans-Neptunian Objects
The origins of the colors of Trans-Neptunian Objects (TNOs) represent a crucial unresolved question, central to understanding the history of our Solar System. Recent observational surveys have revealed correlations between the eccentricity and inclination of TNOs and their colors. This has rekindled the long-standing debate on whether these colors reflect the conditions of TNO formation or their subsequent collisional evolution. In this study, we address this question with 98.7% certainty, using a model-agnostic, data-driven approach based on causal graphs. First, as a sanity check, we demonstrate how our model can replicate the currently accepted paradigms of TNOs' dynamical history, blindly and without any orbital modeling or physics-based assumptions. In fact, our causal model (with no knowledge of the existence of Neptune) predicts the existence of an unknown perturbing body, i.e., Neptune. We then show how this model predicts, with high certainty, that the color of TNOs is the root cause of their inclination distribution, rather than the other way around. This strongly suggests that the colors of TNOs reflect an underlying dynamical property, most likely their formation location. Moreover, our causal model excludes formation scenarios that invoke substantial color modification by subsequent irradiation. We therefore conclude that the colors of TNOs are predominantly primordial.
MAPS: Advancing Multi-Modal Reasoning in Expert-Level Physical Science
Pre-trained on extensive text and image corpora, current Multi-Modal Large Language Models (MLLM) have shown strong capabilities in general visual reasoning tasks. However, their performance is still lacking in physical domains that require understanding diagrams with complex physical structures and quantitative analysis based on multi-modal information. To address this, we develop a new framework, named Multi-Modal Scientific Reasoning with Physics Perception and Simulation (MAPS) based on an MLLM. MAPS decomposes expert-level multi-modal reasoning task into physical diagram understanding via a Physical Perception Model (PPM) and reasoning with physical knowledge via a simulator. The PPM module is obtained by fine-tuning a visual language model using carefully designed synthetic data with paired physical diagrams and corresponding simulation language descriptions. At the inference stage, MAPS integrates the simulation language description of the input diagram provided by PPM and results obtained through a Chain-of-Simulation process with MLLM to derive the underlying rationale and the final answer. Validated using our collected college-level circuit analysis problems, MAPS significantly improves reasoning accuracy of MLLM and outperforms all existing models. The results confirm MAPS offers a promising direction for enhancing multi-modal scientific reasoning ability of MLLMs. We will release our code, model and dataset used for our experiments upon publishing of this paper.
Sensitivity of BEACON to Ultra-High Energy Diffuse and Transient Neutrinos
Ultra-high energy neutrinos (E>10^{17} eV) can provide insight into the most powerful accelerators in the universe, however their flux is extremely low. The Beamforming Elevated Array for COsmic Neutrinos (BEACON) is a detector concept which efficiently achieves sensitivity to this flux by employing phased radio arrays on mountains, which search for the radio emission of up-going extensive air showers created by Earth-skimming tau neutrinos. Here, we calculate the point-source effective area of BEACON and characterize its sensitivity to transient neutrino fluences with both short (<15 min) and long (> 1 day) durations. Additionally, by integrating the effective area, we provide an updated estimate of the diffuse flux sensitivity. With just 100 stations, BEACON achieves sensitivity to short-duration transients such as nearby short gamma-ray bursts. With 1000 stations, BEACON achieves a sensitivity to long-duration transients, as well as the cosmogenic flux, ten times greater than existing experiments at 1 EeV. With an efficient design optimized for ultrahigh energy neutrinos, BEACON is capable of discovering the sources of neutrinos at the highest energies.
Searching for a Leptophilic Z' and a 3-3-1 symmetry at CLIC
We derive the discovery potential of a leptophilic Z', and a Z' rising from a SU(3) times SU(3)_L times U(1)_N symmetry at the Compact Linear Collider (CLIC), which is planned to host e^+e^- collisions with 3 TeV center-of-mass energy. We perform an optimized selection cut strategy on the transverse momentum, pseudorapidity, and invariant mass of the dileptons in order to enhance the collider sensitivity. We find that CLIC can potentially reach a 5sigma signal of a 1-3~TeV leptophilic Z' with less than 1fb^{-1} of integrated luminosity. As for the Z' belonging to a 3-3-1 symmetry, CLIC will offer a complementary probe with the potential to impose M_{Z^prime} > 3~TeV with L=2fb^{-1}.
Disentangling axion-like particle couplings to nucleons via a delayed signal in Super-Kamiokande from a future supernova
In this work, we show that, if axion-like particles (ALPs) from core-collapse supernovae (SNe) couple to protons, they would produce very characteristic signatures in neutrino water Cherenkov detectors through their scattering off free protons via a , p rightarrow p , gamma interactions. Specifically, sub-MeV ALPs would generate photons with energies sim 30 MeV, which could be observed by Super-Kamiokande and Hyper-Kamiokande as a delayed signal after a future detection of SN neutrinos. We apply this to a hypothetical neighbouring SN (at a maximum distance of 100 kpc) and demonstrate that the region in the parameter space with ALP masses between 10^{-4} MeV and 1 MeV and ALP-proton couplings in the range 3 times 10^{-6}-4 times 10^{-5} could be probed. We argue that this new signature, combined with the one expected at sim 7 MeV from oxygen de-excitation, would allow us to disentangle ALP-neutron and ALP-proton couplings.
Snowmass 2021 Computational Frontier CompF03 Topical Group Report: Machine Learning
The rapidly-developing intersection of machine learning (ML) with high-energy physics (HEP) presents both opportunities and challenges to our community. Far beyond applications of standard ML tools to HEP problems, genuinely new and potentially revolutionary approaches are being developed by a generation of talent literate in both fields. There is an urgent need to support the needs of the interdisciplinary community driving these developments, including funding dedicated research at the intersection of the two fields, investing in high-performance computing at universities and tailoring allocation policies to support this work, developing of community tools and standards, and providing education and career paths for young researchers attracted by the intellectual vitality of machine learning for high energy physics.
A Hierarchical Bayesian Model for Deep Few-Shot Meta Learning
We propose a novel hierarchical Bayesian model for learning with a large (possibly infinite) number of tasks/episodes, which suits well the few-shot meta learning problem. We consider episode-wise random variables to model episode-specific target generative processes, where these local random variables are governed by a higher-level global random variate. The global variable helps memorize the important information from historic episodes while controlling how much the model needs to be adapted to new episodes in a principled Bayesian manner. Within our model framework, the prediction on a novel episode/task can be seen as a Bayesian inference problem. However, a main obstacle in learning with a large/infinite number of local random variables in online nature, is that one is not allowed to store the posterior distribution of the current local random variable for frequent future updates, typical in conventional variational inference. We need to be able to treat each local variable as a one-time iterate in the optimization. We propose a Normal-Inverse-Wishart model, for which we show that this one-time iterate optimization becomes feasible due to the approximate closed-form solutions for the local posterior distributions. The resulting algorithm is more attractive than the MAML in that it is not required to maintain computational graphs for the whole gradient optimization steps per episode. Our approach is also different from existing Bayesian meta learning methods in that unlike dealing with a single random variable for the whole episodes, our approach has a hierarchical structure that allows one-time episodic optimization, desirable for principled Bayesian learning with many/infinite tasks. The code is available at https://github.com/minyoungkim21/niwmeta.
Scalable Oversight for Superhuman AI via Recursive Self-Critiquing
As AI capabilities increasingly surpass human proficiency in complex tasks, current alignment techniques including SFT and RLHF face fundamental challenges in ensuring reliable oversight. These methods rely on direct human assessment and become untenable when AI outputs exceed human cognitive thresholds. In response to this challenge, we explore two hypotheses: (1) critique of critique can be easier than critique itself, extending the widely-accepted observation that verification is easier than generation to the critique domain, as critique itself is a specialized form of generation; (2) this difficulty relationship is recursively held, suggesting that when direct evaluation is infeasible, performing high-order critiques (e.g., critique of critique of critique) offers a more tractable supervision pathway. To examine these hypotheses, we perform Human-Human, Human-AI, and AI-AI experiments across multiple tasks. Our results demonstrate encouraging evidence supporting these hypotheses and suggest that recursive self-critiquing is a promising direction for scalable oversight.
Analysis of Two Models for the Angular Structure of the Outflows Producing the Swift/XRT "Larger-Angle Emission" of Gamma-Ray Bursts
The instantaneous emission from a relativistic surface endowed with a Lorentz factor Gamma that decreases away from the outflow symmetry axis can naturally explain the three phases observed by Swift/XRT in GRBs and their afterglows (GRB tail, afterglow plateau and post-plateau). We expand the analytical formalism of the "Larger-Angle Emission" model previously developed for "Power-Law" outflows to "n-Exponential" outflows (e.g. exponential with n=1 and Gaussian with n=2) and compare their abilities to account for the X-ray emission of XRT afterglows. We assume power-law Gamma-dependences of two spectral characteristics (peak-energy and peak intensity) and find that, unlike Power-Law outflows, n-Exponential outflows cannot account for plateaus with a temporal dynamical range larger than 100. To include all information existing in the Swift/XRT measurements of X-ray aferglows (0.3-10 keV unabsorbed flux and effective spectral slope), we calculate 0.3 keV and 10 keV light-curves using a broken power-law emission spectrum of peak-energy and low-and high-energy slopes that are derived from the effective slope measured by XRT. This economical peak-energy determination is found to be consistent with more expensive spectral fits. The angular distributions of the Lorentz factor, comoving frame peak-energy, and peak-intensity (Gamma (theta), E'_p (theta), i'_p(theta)) constrain the (yet-to-be determined) convolution of various features of the production of relativistic jets by solar-mass black-holes and of their propagation through the progenitor/circumburst medium, while the E'_p (Gamma) and i'_p (Gamma) dependences may constrain the GRB dissipation mechanism and the GRB emission process.
Comparing Inferential Strategies of Humans and Large Language Models in Deductive Reasoning
Deductive reasoning plays a pivotal role in the formulation of sound and cohesive arguments. It allows individuals to draw conclusions that logically follow, given the truth value of the information provided. Recent progress in the domain of large language models (LLMs) has showcased their capability in executing deductive reasoning tasks. Nonetheless, a significant portion of research primarily assesses the accuracy of LLMs in solving such tasks, often overlooking a deeper analysis of their reasoning behavior. In this study, we draw upon principles from cognitive psychology to examine inferential strategies employed by LLMs, through a detailed evaluation of their responses to propositional logic problems. Our findings indicate that LLMs display reasoning patterns akin to those observed in humans, including strategies like supposition following or chain construction. Moreover, our research demonstrates that the architecture and scale of the model significantly affect its preferred method of reasoning, with more advanced models tending to adopt strategies more frequently than less sophisticated ones. Importantly, we assert that a model's accuracy, that is the correctness of its final conclusion, does not necessarily reflect the validity of its reasoning process. This distinction underscores the necessity for more nuanced evaluation procedures in the field.
Probing the shape of the Milky Way dark matter halo with hypervelocity stars: a new method
We propose a new method to determine the shape of the gravitational potential of the dark matter (DM) halo of the Milky Way (MW) with the galactocentric tangential velocities of a sample of hypervelocity stars (HVSs). We compute the trajectories of different samples of HVSs in a MW where the baryon distribution is axisymmetric and the DM potential either is spherical or is spheroidal or triaxial with radial-dependent axis ratios. We determine the shape of the DM potential with the distribution of the latitudinal velocity |v_{vartheta}| in axisymmetric Galactic potentials, or with the distribution of |v_{vartheta}| and of a function bar v_{varphi} of the azimuthal velocity in non-axisymmetric Galactic potentials. We recover the correct shape of the DM potential by comparing the distribution of |v_{vartheta}| and bar v_{varphi} against the corresponding distributions of mock samples of HVSs that traveled in DM halos of different shapes. We use the largest possible sample of sim 800 HVSs of 4~M_odot ejected with the Hills mechanism at a rate sim 10^{-4} yr^{-1}, currently outgoing, and located at more than 10 kpc from the Galactic center. In our ideal case of galactocentric velocities with null uncertainties and no observational limitations, our method recovers the correct shape of the DM potential with a success rate Sgtrsim 89% in axisymmetric Galactic potentials, and S > 96% in the explored non-axisymmetric cases. The unsuccessful cases yield axis ratios of the DM potential that are off by pm 0.1. The success rate decreases with decreasing sample size: for example, for a spherical DM halo, S drops from sim 98% to sim 38% when the sample size decreases from sim 800 to sim 40 HVSs. A robust determination of the shape of the DM potential thus requires the measure of the galactocentric velocity of a few hundred genuine HVSs.
PhysicsEval: Inference-Time Techniques to Improve the Reasoning Proficiency of Large Language Models on Physics Problems
The discipline of physics stands as a cornerstone of human intellect, driving the evolution of technology and deepening our understanding of the fundamental principles of the cosmos. Contemporary literature includes some works centered on the task of solving physics problems - a crucial domain of natural language reasoning. In this paper, we evaluate the performance of frontier LLMs in solving physics problems, both mathematical and descriptive. We also employ a plethora of inference-time techniques and agentic frameworks to improve the performance of the models. This includes the verification of proposed solutions in a cumulative fashion by other, smaller LLM agents, and we perform a comparative analysis of the performance that the techniques entail. There are significant improvements when the multi-agent framework is applied to problems that the models initially perform poorly on. Furthermore, we introduce a new evaluation benchmark for physics problems, {rm P{small HYSICS}E{small VAL}}, consisting of 19,609 problems sourced from various physics textbooks and their corresponding correct solutions scraped from physics forums and educational websites. Our code and data are publicly available at https://github.com/areebuzair/PhysicsEval.
From Street Views to Urban Science: Discovering Road Safety Factors with Multimodal Large Language Models
Urban and transportation research has long sought to uncover statistically meaningful relationships between key variables and societal outcomes such as road safety, to generate actionable insights that guide the planning, development, and renewal of urban and transportation systems. However, traditional workflows face several key challenges: (1) reliance on human experts to propose hypotheses, which is time-consuming and prone to confirmation bias; (2) limited interpretability, particularly in deep learning approaches; and (3) underutilization of unstructured data that can encode critical urban context. Given these limitations, we propose a Multimodal Large Language Model (MLLM)-based approach for interpretable hypothesis inference, enabling the automated generation, evaluation, and refinement of hypotheses concerning urban context and road safety outcomes. Our method leverages MLLMs to craft safety-relevant questions for street view images (SVIs), extract interpretable embeddings from their responses, and apply them in regression-based statistical models. UrbanX supports iterative hypothesis testing and refinement, guided by statistical evidence such as coefficient significance, thereby enabling rigorous scientific discovery of previously overlooked correlations between urban design and safety. Experimental evaluations on Manhattan street segments demonstrate that our approach outperforms pretrained deep learning models while offering full interpretability. Beyond road safety, UrbanX can serve as a general-purpose framework for urban scientific discovery, extracting structured insights from unstructured urban data across diverse socioeconomic and environmental outcomes. This approach enhances model trustworthiness for policy applications and establishes a scalable, statistically grounded pathway for interpretable knowledge discovery in urban and transportation studies.
Fate and detectability of rare gas hydride ions in nova ejecta: A case study with nova templates
HeH^+ was the first heteronuclear molecule to form in the metal-free Universe after the Big Bang. The molecule gained significant attention following its first circumstellar detection in the young and dense planetary nebula NGC 7027. We target some hydride ions associated with the noble gases (HeH^+, ArH^+, and NeH^+) to investigate their formation in harsh environments like the nova outburst region. We use a photoionization modeling (based on previously published best-fit physical parameters) of the moderately fast ONe type nova, QU Vulpeculae 1984, and the CO type novae, RS Ophiuchi and V1716 Scorpii. Our steady-state modeling reveals a convincing amount of HeH^+, especially in the dense clump of RS Ophiuchi and V1716 Scorpii. The calculated upper limit on the surface brightness of HeH^+ transitions suggests that the James Webb Space Telescope (JWST) could detect some of them, particularly in sources like RS Ophiuchi and V1716 Scorpii, which have similar physical and chemical conditions and evolution. It must be clearly noted that the sources studied are used as templates, and not as targets for observations. The detection of these lines could be useful for determining the physical conditions in similar types of systems and for validating our predictions based on new electron-impact ro-vibrational collisional data at temperatures of up to 20,000 K.
Visualizing Thought: Conceptual Diagrams Enable Robust Planning in LMMs
Human reasoning relies on constructing and manipulating mental models-simplified internal representations of situations that we use to understand and solve problems. Conceptual diagrams (for example, sketches drawn by humans to aid reasoning) externalize these mental models, abstracting irrelevant details to efficiently capture relational and spatial information. In contrast, Large Language Models (LLMs) and Large Multimodal Models (LMMs) predominantly reason through textual representations, limiting their effectiveness in complex multi-step combinatorial and planning tasks. In this paper, we propose a zero-shot fully automatic framework that enables LMMs to reason through multiple chains of self-generated intermediate conceptual diagrams, significantly enhancing their combinatorial planning capabilities. Our approach does not require any human initialization beyond a natural language description of the task. It integrates both textual and diagrammatic reasoning within an optimized graph-of-thought inference framework, enhanced by beam search and depth-wise backtracking. Evaluated on multiple challenging PDDL planning domains, our method substantially improves GPT-4o's performance (for example, from 35.5% to 90.2% in Blocksworld). On more difficult planning domains with solution depths up to 40, our approach outperforms even the o1-preview reasoning model (for example, over 13% improvement in Parking). These results highlight the value of conceptual diagrams as a complementary reasoning medium in LMMs.
First Light And Reionisation Epoch Simulations (FLARES) VI: The colour evolution of galaxies z=5-15
With its exquisite sensitivity, wavelength coverage, and spatial and spectral resolution, the James Webb Space Telescope is poised to revolutionise our view of the distant, high-redshift (z>5) Universe. While Webb's spectroscopic observations will be transformative for the field, photometric observations play a key role in identifying distant objects and providing more comprehensive samples than accessible to spectroscopy alone. In addition to identifying objects, photometric observations can also be used to infer physical properties and thus be used to constrain galaxy formation models. However, inferred physical properties from broadband photometric observations, particularly in the absence of spectroscopic redshifts, often have large uncertainties. With the development of new tools for forward modelling simulations it is now routinely possible to predict observational quantities, enabling a direct comparison with observations. With this in mind, in this work, we make predictions for the colour evolution of galaxies at z=5-15 using the FLARES: First Light And Reionisation Epoch Simulations cosmological hydrodynamical simulation suite. We predict a complex evolution, driven predominantly by strong nebular line emission passing through individual bands. These predictions are in good agreement with existing constraints from Hubble and Spitzer as well as some of the first results from Webb. We also contrast our predictions with other models in the literature: while the general trends are similar we find key differences, particularly in the strength of features associated with strong nebular line emission. This suggests photometric observations alone should provide useful discriminating power between different models.
Trust Me, I'm Wrong: High-Certainty Hallucinations in LLMs
Large Language Models (LLMs) often generate outputs that lack grounding in real-world facts, a phenomenon known as hallucinations. Prior research has associated hallucinations with model uncertainty, leveraging this relationship for hallucination detection and mitigation. In this paper, we challenge the underlying assumption that all hallucinations are associated with uncertainty. Using knowledge detection and uncertainty measurement methods, we demonstrate that models can hallucinate with high certainty even when they have the correct knowledge. We further show that high-certainty hallucinations are consistent across models and datasets, distinctive enough to be singled out, and challenge existing mitigation methods. Our findings reveal an overlooked aspect of hallucinations, emphasizing the need to understand their origins and improve mitigation strategies to enhance LLM safety. The code is available at https://github.com/technion-cs-nlp/Trust_me_Im_wrong .
ContPhy: Continuum Physical Concept Learning and Reasoning from Videos
We introduce the Continuum Physical Dataset (ContPhy), a novel benchmark for assessing machine physical commonsense. ContPhy complements existing physical reasoning benchmarks by encompassing the inference of diverse physical properties, such as mass and density, across various scenarios and predicting corresponding dynamics. We evaluated a range of AI models and found that they still struggle to achieve satisfactory performance on ContPhy, which shows that the current AI models still lack physical commonsense for the continuum, especially soft-bodies, and illustrates the value of the proposed dataset. We also introduce an oracle model (ContPRO) that marries the particle-based physical dynamic models with the recent large language models, which enjoy the advantages of both models, precise dynamic predictions, and interpretable reasoning. ContPhy aims to spur progress in perception and reasoning within diverse physical settings, narrowing the divide between human and machine intelligence in understanding the physical world. Project page: https://physical-reasoning-project.github.io.
pyhgf: A neural network library for predictive coding
Bayesian models of cognition have gained considerable traction in computational neuroscience and psychiatry. Their scopes are now expected to expand rapidly to artificial intelligence, providing general inference frameworks to support embodied, adaptable, and energy-efficient autonomous agents. A central theory in this domain is predictive coding, which posits that learning and behaviour are driven by hierarchical probabilistic inferences about the causes of sensory inputs. Biological realism constrains these networks to rely on simple local computations in the form of precision-weighted predictions and prediction errors. This can make this framework highly efficient, but its implementation comes with unique challenges on the software development side. Embedding such models in standard neural network libraries often becomes limiting, as these libraries' compilation and differentiation backends can force a conceptual separation between optimization algorithms and the systems being optimized. This critically departs from other biological principles such as self-monitoring, self-organisation, cellular growth and functional plasticity. In this paper, we introduce pyhgf: a Python package backed by JAX and Rust for creating, manipulating and sampling dynamic networks for predictive coding. We improve over other frameworks by enclosing the network components as transparent, modular and malleable variables in the message-passing steps. The resulting graphs can implement arbitrary computational complexities as beliefs propagation. But the transparency of core variables can also translate into inference processes that leverage self-organisation principles, and express structure learning, meta-learning or causal discovery as the consequence of network structural adaptation to surprising inputs. The code, tutorials and documentation are hosted at: https://github.com/ilabcode/pyhgf.
The HalluRAG Dataset: Detecting Closed-Domain Hallucinations in RAG Applications Using an LLM's Internal States
Detecting hallucinations in large language models (LLMs) is critical for enhancing their reliability and trustworthiness. Most research focuses on hallucinations as deviations from information seen during training. However, the opaque nature of an LLM's parametric knowledge complicates the understanding of why generated texts appear ungrounded: The LLM might not have picked up the necessary knowledge from large and often inaccessible datasets, or the information might have been changed or contradicted during further training. Our focus is on hallucinations involving information not used in training, which we determine by using recency to ensure the information emerged after a cut-off date. This study investigates these hallucinations by detecting them at sentence level using different internal states of various LLMs. We present HalluRAG, a dataset designed to train classifiers on these hallucinations. Depending on the model and quantization, MLPs trained on HalluRAG detect hallucinations with test accuracies ranging up to 75 %, with Mistral-7B-Instruct-v0.1 achieving the highest test accuracies. Our results show that IAVs detect hallucinations as effectively as CEVs and reveal that answerable and unanswerable prompts are encoded differently as separate classifiers for these categories improved accuracy. However, HalluRAG showed some limited generalizability, advocating for more diversity in datasets on hallucinations.
Kernel regression estimates of time delays between gravitationally lensed fluxes
Strongly lensed variable quasars can serve as precise cosmological probes, provided that time delays between the image fluxes can be accurately measured. A number of methods have been proposed to address this problem. In this paper, we explore in detail a new approach based on kernel regression estimates, which is able to estimate a single time delay given several datasets for the same quasar. We develop realistic artificial data sets in order to carry out controlled experiments to test of performance of this new approach. We also test our method on real data from strongly lensed quasar Q0957+561 and compare our estimates against existing results.
AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-Language Models
Large vision-language models (LVLMs) hallucinate: certain context cues in an image may trigger the language module's overconfident and incorrect reasoning on abnormal or hypothetical objects. Though a few benchmarks have been developed to investigate LVLM hallucinations, they mainly rely on hand-crafted corner cases whose fail patterns may hardly generalize, and finetuning on them could undermine their validity. These motivate us to develop the first automatic benchmark generation approach, AUTOHALLUSION, that harnesses a few principal strategies to create diverse hallucination examples. It probes the language modules in LVLMs for context cues and uses them to synthesize images by: (1) adding objects abnormal to the context cues; (2) for two co-occurring objects, keeping one and excluding the other; or (3) removing objects closely tied to the context cues. It then generates image-based questions whose ground-truth answers contradict the language module's prior. A model has to overcome contextual biases and distractions to reach correct answers, while incorrect or inconsistent answers indicate hallucinations. AUTOHALLUSION enables us to create new benchmarks at the minimum cost and thus overcomes the fragility of hand-crafted benchmarks. It also reveals common failure patterns and reasons, providing key insights to detect, avoid, or control hallucinations. Comprehensive evaluations of top-tier LVLMs, e.g., GPT-4V(ision), Gemini Pro Vision, Claude 3, and LLaVA-1.5, show a 97.7% and 98.7% success rate of hallucination induction on synthetic and real-world datasets of AUTOHALLUSION, paving the way for a long battle against hallucinations.
Large Language Models Hallucination: A Comprehensive Survey
Large language models (LLMs) have transformed natural language processing, achieving remarkable performance across diverse tasks. However, their impressive fluency often comes at the cost of producing false or fabricated information, a phenomenon known as hallucination. Hallucination refers to the generation of content by an LLM that is fluent and syntactically correct but factually inaccurate or unsupported by external evidence. Hallucinations undermine the reliability and trustworthiness of LLMs, especially in domains requiring factual accuracy. This survey provides a comprehensive review of research on hallucination in LLMs, with a focus on causes, detection, and mitigation. We first present a taxonomy of hallucination types and analyze their root causes across the entire LLM development lifecycle, from data collection and architecture design to inference. We further examine how hallucinations emerge in key natural language generation tasks. Building on this foundation, we introduce a structured taxonomy of detection approaches and another taxonomy of mitigation strategies. We also analyze the strengths and limitations of current detection and mitigation approaches and review existing evaluation benchmarks and metrics used to quantify LLMs hallucinations. Finally, we outline key open challenges and promising directions for future research, providing a foundation for the development of more truthful and trustworthy LLMs.
Halo: Estimation and Reduction of Hallucinations in Open-Source Weak Large Language Models
Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP). Although convenient for research and practical applications, open-source LLMs with fewer parameters often suffer from severe hallucinations compared to their larger counterparts. This paper focuses on measuring and reducing hallucinations in BLOOM 7B, a representative of such weaker open-source LLMs that are publicly available for research and commercial applications. We introduce HaloCheck, a lightweight BlackBox knowledge-free framework designed to quantify the severity of hallucinations in LLMs. Additionally, we explore techniques like knowledge injection and teacher-student approaches to alleviate hallucinations in low-parameter LLMs. Our experiments effectively demonstrate the reduction of hallucinations in challenging domains for these LLMs.
Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning
Scientific reasoning, the process through which humans apply logic, evidence, and critical thinking to explore and interpret scientific phenomena, is essential in advancing knowledge reasoning across diverse fields. However, despite significant progress, current scientific reasoning models still struggle with generalization across domains and often fall short of multimodal perception. Multimodal Large Language Models (MLLMs), which integrate text, images, and other modalities, present an exciting opportunity to overcome these limitations and enhance scientific reasoning. Therefore, this position paper argues that MLLMs can significantly advance scientific reasoning across disciplines such as mathematics, physics, chemistry, and biology. First, we propose a four-stage research roadmap of scientific reasoning capabilities, and highlight the current state of MLLM applications in scientific reasoning, noting their ability to integrate and reason over diverse data types. Second, we summarize the key challenges that remain obstacles to achieving MLLM's full potential. To address these challenges, we propose actionable insights and suggestions for the future. Overall, our work offers a novel perspective on MLLM integration with scientific reasoning, providing the LLM community with a valuable vision for achieving Artificial General Intelligence (AGI).
Focus on conceptual ideas in quantum mechanics for teacher training
In this work, we describe strategies and provide case-study activities that can be used to examine the properties of superposition, entanglement, tagging, complementarity, and measurement in quantum curricula geared for teacher training. Having a solid foundation in these conceptual ideas is critical for educators who will be adopting quantum ideas within the classroom. Yet they are some of the most difficult concepts to master. We show how one can systematically develop these conceptual foundations with thought experiments on light and with thought experiments that employ the Stern-Gerlach experiment. We emphasize the importance of computer animations in aiding the instruction on these concepts.
Causal Inference by String Diagram Surgery
Extracting causal relationships from observed correlations is a growing area in probabilistic reasoning, originating with the seminal work of Pearl and others from the early 1990s. This paper develops a new, categorically oriented view based on a clear distinction between syntax (string diagrams) and semantics (stochastic matrices), connected via interpretations as structure-preserving functors. A key notion in the identification of causal effects is that of an intervention, whereby a variable is forcefully set to a particular value independent of any prior propensities. We represent the effect of such an intervention as an endofunctor which performs `string diagram surgery' within the syntactic category of string diagrams. This diagram surgery in turn yields a new, interventional distribution via the interpretation functor. While in general there is no way to compute interventional distributions purely from observed data, we show that this is possible in certain special cases using a calculational tool called comb disintegration. We demonstrate the use of this technique on a well-known toy example, where we predict the causal effect of smoking on cancer in the presence of a confounding common cause. After developing this specific example, we show this technique provides simple sufficient conditions for computing interventions which apply to a wide variety of situations considered in the causal inference literature.
From Neurons to Neutrons: A Case Study in Interpretability
Mechanistic Interpretability (MI) promises a path toward fully understanding how neural networks make their predictions. Prior work demonstrates that even when trained to perform simple arithmetic, models can implement a variety of algorithms (sometimes concurrently) depending on initialization and hyperparameters. Does this mean neuron-level interpretability techniques have limited applicability? We argue that high-dimensional neural networks can learn low-dimensional representations of their training data that are useful beyond simply making good predictions. Such representations can be understood through the mechanistic interpretability lens and provide insights that are surprisingly faithful to human-derived domain knowledge. This indicates that such approaches to interpretability can be useful for deriving a new understanding of a problem from models trained to solve it. As a case study, we extract nuclear physics concepts by studying models trained to reproduce nuclear data.
CP violation in the hyperon decays Σto Nπ
The study of CP violation in hyperon transitions has a long history. In the early 2000s the HyperCP experiment made a major effort to seek CP-odd signals in the decay sequence Xi^-toLambda pi^- and Lambdato ppi^-, which motivated more searches. Most recently the BESIII and LHCb Collaborations have acquired or improved the upper bounds on CP violation in a variety of hyperon nonleptonic processes, including Sigma^+to npi^+ and Sigma^+to ppi^0. These measurements have not reached the standard-model level yet, but have stimulated a renewed interest in CP-violating new physics in strange-quark decay beyond what is constrained by the parameters varepsilon and varepsilon^prime from the kaon sector. In this paper, after updating the standard-model expectations for CP-odd observables in the modes Sigma^pmto Npi, we revisit new-physics scenarios that could enhance the corresponding quantities in Lambdato Npi and XitoLambdapi and apply them to the Sigma^pm modes. We find that the CP asymmetries in the latter can be significantly increased over the standard-model expectations, at levels which may be tested in the ongoing BESIII experiment and in future endeavors such as PANDA and the Super Tau Charm Facility.
1FLAT: a Firmamento-based catalog of AGN in Fermi-LAT high Galactic latitude γ-ray sources
We present a systematic reassessment of 5,062 high-Galactic latitude gamma-ray sources from the Fermi-LAT 4FGL-DR4 catalog using Firmamento, a web-based platform for multi-frequency source discovery and analysis. Our goal is to provide an independent evaluation of LAT gamma-ray source associations through alternative spectral and spatial methods that combine recent and legacy survey data, supplemented by human supervision of spectral energy distributions (SEDs), source morphology, flux variability, and template-based comparisons. Firmamento confirms the 4FGL-DR4 and 4LAC-DR3 counterparts or unassociated sources in 4,493 cases (88.8%), demonstrating the robustness of both approaches. Beyond this general agreement, we identify 421 new blazar counterparts among previously unassociated sources, thereby reducing the fraction of unidentified extragalactic Fermi-LAT sources from 25% to 17%. In addition, in 64 cases we find alternative blazar associations, while in 49 instances we do not confirm the 4FGL-DR4 association. For all confirmed blazar counterparts we provide homogeneous estimates of synchrotron peak frequency and peak flux using machine-learning and template-based methods; these agree with 4LAC-DR3 values in most cases, though significant discrepancies appear for a few dozen sources, often due to improved X-ray coverage. The primary outcome of this work is the 1st Firmamento LAT AGN table (1FLAT), made publicly available through the Firmamento platform (https://firmamento.nyuad.nyu.edu), where all related multi-wavelength data and images are available. The project involved extensive manual validation and benefited from the active participation of graduate and undergraduate students, highlighting the platform's value for both research and education.
On the Higgs spectra of the 3-3-1 model with the sextet of scalars engendering the type II seesaw mechanism
In the 3-3-1 model with right-handed neutrinos, three triplets of scalars engender the correct sequence of symmetry breaking, SU(3)_C times SU(3)_L times U(1)_X rightarrow SU(3)_C times SU(2)_L times U(1)_Y rightarrow SU(3)_C times U(1)_{EM}, generating mass for all fermions, except neutrinos. Tiny neutrino masses may be achieved by adding one sextet of scalars to the original scalar content. As consequence, it emerges a very complex scalar sector, involving terms that violate lepton number explicitly, too. The main obstacle to the development of the phenomenology of such scenario is the knowledge of its spectrum of scalars since, now, there are 15 massive scalar particles on it. The proposal of this work is to do an exhaustive analysis of such scalar sector with lepton number being explicitly violated at low, electroweak and high energy scales by means of trilinear terms in the potential. The first case can be addressed analytically and, as a nice result, we have observed that the scalar content of such case is split into two categories: One belonging to the 331 energy scale and the other belonging to the EWSB energy scale, with the last recovering the well known THDM+triplet. For the other cases, the scalar sector can be addressed only numerically. Hence, we proposed a very general approach for the numerical study of the potential, avoiding simplifications that can make us reach conclusions without foundation. We show that, in the case of lepton number being explicitly violated at electroweak scale, it is possible to recover the same physics of the THDM+triplet, as the previous case. Among all the possibilities, we call the attention to one special case which generates the 3HDM+triplet scenario. For the last case, when lepton number is violated at high energy scale, the sextet become very massive and decouples from the original scalar content of the 3-3-1 model.
Why Language Models Hallucinate
Like students facing hard exam questions, large language models sometimes guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty. Such "hallucinations" persist even in state-of-the-art systems and undermine trust. We argue that language models hallucinate because the training and evaluation procedures reward guessing over acknowledging uncertainty, and we analyze the statistical causes of hallucinations in the modern training pipeline. Hallucinations need not be mysterious -- they originate simply as errors in binary classification. If incorrect statements cannot be distinguished from facts, then hallucinations in pretrained language models will arise through natural statistical pressures. We then argue that hallucinations persist due to the way most evaluations are graded -- language models are optimized to be good test-takers, and guessing when uncertain improves test performance. This "epidemic" of penalizing uncertain responses can only be addressed through a socio-technical mitigation: modifying the scoring of existing benchmarks that are misaligned but dominate leaderboards, rather than introducing additional hallucination evaluations. This change may steer the field toward more trustworthy AI systems.
How to Train Your HiPPO: State Space Models with Generalized Orthogonal Basis Projections
Linear time-invariant state space models (SSM) are a classical model from engineering and statistics, that have recently been shown to be very promising in machine learning through the Structured State Space sequence model (S4). A core component of S4 involves initializing the SSM state matrix to a particular matrix called a HiPPO matrix, which was empirically important for S4's ability to handle long sequences. However, the specific matrix that S4 uses was actually derived in previous work for a particular time-varying dynamical system, and the use of this matrix as a time-invariant SSM had no known mathematical interpretation. Consequently, the theoretical mechanism by which S4 models long-range dependencies actually remains unexplained. We derive a more general and intuitive formulation of the HiPPO framework, which provides a simple mathematical interpretation of S4 as a decomposition onto exponentially-warped Legendre polynomials, explaining its ability to capture long dependencies. Our generalization introduces a theoretically rich class of SSMs that also lets us derive more intuitive S4 variants for other bases such as the Fourier basis, and explains other aspects of training S4, such as how to initialize the important timescale parameter. These insights improve S4's performance to 86% on the Long Range Arena benchmark, with 96% on the most difficult Path-X task.
Demystifying Scientific Problem-Solving in LLMs by Probing Knowledge and Reasoning
Scientific problem solving poses unique challenges for LLMs, requiring both deep domain knowledge and the ability to apply such knowledge through complex reasoning. While automated scientific reasoners hold great promise for assisting human scientists, there is currently no widely adopted holistic benchmark for evaluating scientific reasoning, and few approaches systematically disentangle the distinct roles of knowledge and reasoning in these tasks. To address these gaps, we introduce SciReas, a diverse suite of existing benchmarks for scientific reasoning tasks, and SciReas-Pro, a selective subset that requires more complex reasoning. Our holistic evaluation surfaces insights about scientific reasoning performance that remain hidden when relying on individual benchmarks alone. We then propose KRUX, a probing framework for studying the distinct roles of reasoning and knowledge in scientific tasks. Combining the two, we conduct an in-depth analysis that yields several key findings: (1) Retrieving task-relevant knowledge from model parameters is a critical bottleneck for LLMs in scientific reasoning; (2) Reasoning models consistently benefit from external knowledge added in-context on top of the reasoning enhancement; (3) Enhancing verbalized reasoning improves LLMs' ability to surface task-relevant knowledge. Finally, we conduct a lightweight analysis, comparing our science-focused data composition with concurrent efforts on long CoT SFT, and release SciLit01, a strong 8B baseline for scientific reasoning.
Sets are all you need: Ultrafast jet classification on FPGAs for HL-LHC
We study various machine learning based algorithms for performing accurate jet flavor classification on field-programmable gate arrays and demonstrate how latency and resource consumption scale with the input size and choice of algorithm. These architectures provide an initial design for models that could be used for tagging at the CERN LHC during its high-luminosity phase. The high-luminosity upgrade will lead to a five-fold increase in its instantaneous luminosity for proton-proton collisions and, in turn, higher data volume and complexity, such as the availability of jet constituents. Through quantization-aware training and efficient hardware implementations, we show that O(100) ns inference of complex architectures such as deep sets and interaction networks is feasible at a low computational resource cost.
The Duality of Whittaker Potential Theory: Fundamental Representations of Electromagnetism and Gravity, and Their Orthogonality
E. T. Whittaker produced two papers in 1903 and 1904 that, although sometimes considered mere mathematical statements (Barrett, 1993), held important implications for physical theory. The Whittaker 1903 paper united electrostatic and gravitational attraction as resulting from longitudinal waves - waves whose wavefronts propagate parallel to their direction. The Whittaker 1904 paper showed that electromagnetic waves resulted from the interference of two such longitudinal waves or scalar potential functions. Although unexplored, the implications of these papers are profound: gravitational lensing, gravitational waves, the Aharonov-Bohm effect, the existence of a hyperspace above or behind normal space, the elimination of gravitational and point charge singularities, MOND, and the expansion of the universe. This last implication can be related to the recent finding that black holes with posited vacuum energy interior solutions alongside cosmological boundaries have a cosmological coupling constant of k=3, meaning that black holes gain mass-proportional to a3 in a parameterization equation within a Robertson-Walker cosmology and are a cosmological accelerated expansion species (Farrah et al., 2023). This expansion and many features of General Relativity can be explained by the mass-proportionality and preferred direction of the longitudinal waves within the two underlying non-local Whittaker potentials (Titleman, 2022). Whittaker potential theory also offers a simple explanation for expansion of the universe - it is produced as longitudinal motion within the Whittaker potentials only when dynamic electromagnetism is separate from time-static gravity in intergalactic space.
Precision measurement of the last bound states in H_2 and determination of the H + H scattering length
The binding energies of the five bound rotational levels J=0-4 in the highest vibrational level v=14 in the X^1Sigma_g^+ ground electronic state of H_2 were measured in a three-step ultraviolet-laser experiment. Two-photon UV-photolysis of H_2S produced population in these high-lying bound states, that were subsequently interrogated at high precision via Doppler-free spectroscopy of the F^1Sigma_g^+ - X^1Sigma_g^+ system. A third UV-laser was used for detection through auto-ionizing resonances. The experimentally determined binding energies were found to be in excellent agreement with calculations based on non-adiabatic perturbation theory, also including relativistic and quantum electrodynamical contributions. The s-wave scattering length of the H + H system is derived from the binding energy of the last bound J=0 level via a direct semi-empirical approach, yielding a value of a_s = 0.2724(5) a_0, in good agreement with a result from a previously followed theoretical approach. The subtle effect of the malpha^4 relativity contribution to a_s was found to be significant. In a similar manner a value for the p-wave scattering volume is determined via the J=1 binding energy yielding a_p = -134.0000(6) a_0^3. The binding energy of the last bound state in H_2, the (v=14, J=4) level, is determined at 0.023(4) cm^{-1}, in good agreement with calculation. The effect of the hyperfine substructure caused by the two hydrogen atoms at large internuclear separation, giving rise to three distinct dissociation limits, is discussed.
MIRAGE: Assessing Hallucination in Multimodal Reasoning Chains of MLLM
Multimodal hallucination in multimodal large language models (MLLMs) restricts the correctness of MLLMs. However, multimodal hallucinations are multi-sourced and arise from diverse causes. Existing benchmarks fail to adequately distinguish between perception-induced hallucinations and reasoning-induced hallucinations. This failure constitutes a significant issue and hinders the diagnosis of multimodal reasoning failures within MLLMs. To address this, we propose the {\dataset} benchmark, which isolates reasoning hallucinations by constructing questions where input images are correctly perceived by MLLMs yet reasoning errors persist. {\dataset} introduces multi-granular evaluation metrics: accuracy, factuality, and LLMs hallucination score for hallucination quantification. Our analysis reveals that (1) the model scale, data scale, and training stages significantly affect the degree of logical, fabrication, and factual hallucinations; (2) current MLLMs show no effective improvement on spatial hallucinations caused by misinterpreted spatial relationships, indicating their limited visual reasoning capabilities; and (3) question types correlate with distinct hallucination patterns, highlighting targeted challenges and potential mitigation strategies. To address these challenges, we propose {\method}, a method that combines curriculum reinforcement fine-tuning to encourage models to generate logic-consistent reasoning chains by stepwise reducing learning difficulty, and collaborative hint inference to reduce reasoning complexity. {\method} establishes a baseline on {\dataset}, and reduces the logical hallucinations in original base models.
Using Strong Lensing to Detect Subhalos with Steep Inner Density Profiles
The inner region of a subhalo's density distribution is particularly sensitive to dark matter microphysics, with alternative dark matter models leading to both cored and steeply-rising inner density profiles. This work investigates how the lensing signature and detectability of dark matter subhalos in mock HST-, Euclid-, and JWST-like strong lensing observations depends on the subhalo's radial density profile, especially with regards to the inner power-law slope, beta. We demonstrate that the minimum-mass subhalo detectable along the Einstein ring of a system is strongly dependent on beta. In particular, we show that subhalos with beta sim 2.2 can be detected down to masses over an order-of-magnitude lower than their Navarro-Frenk-White (NFW) counterparts with beta sim 1. Importantly, we find that the detectability of subhalos with steep inner profiles is minimally affected by increasing the complexity of the main lens galaxy's mass model. This is a unique characteristic of these subhalos, as those with NFW or shallower profiles become essentially undetectable when multipole perturbations are added to the lens model. The results of this work highlight how the underlying dark matter physics can significantly impact the expected number of subhalo detections from strong gravitational lensing observations. This is important for testing Cold Dark Matter against alternatives, such as Self-Interacting Dark Matter, which predict the existence of subhalos with diverse inner density profiles.
LAMBADA: Backward Chaining for Automated Reasoning in Natural Language
Remarkable progress has been made on automated reasoning with natural text, by using Language Models (LMs) and methods such as Chain-of-Thought and Selection-Inference. These techniques search for proofs in the forward direction from axioms to the conclusion, which suffers from a combinatorial explosion of the search space, and thus high failure rates for problems requiring longer chains of reasoning. The classical automated reasoning literature has shown that reasoning in the backward direction (i.e. from the intended conclusion to supporting axioms) is significantly more efficient at proof-finding. Importing this intuition into the LM setting, we develop a Backward Chaining algorithm, called LAMBADA, that decomposes reasoning into four sub-modules. These sub-modules are simply implemented by few-shot prompted LM inference. We show that LAMBADA achieves sizable accuracy boosts over state-of-the-art forward reasoning methods on challenging logical reasoning datasets, particularly when deep and accurate proof chains are required.
