Naomi Saphra
  • Bio
  • Posts
  • Publications
  • Publications
    • How to visualize training dynamics in neural networks
    • PolyPythias: Stability and Outliers across Fifty Language Model Pre-Training Runs
    • Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon
    • Distributional Scaling Laws for Emergent Capabilities
    • Sometimes I am a Tree: Data drives fragile hierarchical generalization
    • Attribute Diversity Determines the Systematicity Gap in VQA
    • Benchmarks as Microscopes: A Call for Model Metrology
    • Causation Does Not Imply Correlation: A Study of Circuit Mechanisms and Model Behaviors
    • ChatGPT Doesn't Trust Chargers Fans: Guardrail Sensitivity in Context
    • Dynamic Masking Rate Schedules for MLM Pretraining
    • Fast Forwarding Low-Rank Training
    • First Tragedy, then Parse: History Repeats Itself in the New Era of Large Language Models
    • Loss in the Crowd: Hidden Breakthroughs in Language Model Training
    • Mechanistic?
    • Sudden Drops in the Loss: Syntax Acquisition, Phase Transitions, and Simplicity Bias in MLMs
    • TRAM: Bridging Trust Regions and Sharpness Aware Minimization
    • Transcendence: Generative Models Can Outperform The Experts That Train Them
    • Understanding biological active sensing behaviors by interpreting learned artificial agent policies
    • Delays, Detours, and Forks in the Road: Latent State Models of Training Dynamics
    • Interpretability Creationism
    • Linear Connectivity Reveals Generalization Strategies
    • Shapley Interactions for Complex Feature Attribution
    • State-of-the-art generalisation research in NLP: a taxonomy and review
    • Towards out-of-distribution generalization in large-scale astronomical surveys: robust networks learn similar representations
    • Learning Transductions to Test Systematic Compositionality
    • One Venue, Two Conferences: The Separation of Chinese and American Citation Networks
    • The MultiBERTs: BERT Reproductions for Robustness Analysis
    • A Non-Linear Structural Probe
    • LSTMs Compose---and Learn---Bottom-Up
    • Pareto Probing: Trading Off Accuracy for Complexity
    • Understanding Privacy-Related Questions on Stack Overflow
    • Carbon AI and the Concentration of Computational Work
    • Sparsity Emerges Naturally in Neural Language Models
    • Understanding Learning Dynamics Of Language Models with SVCCA
    • DyNet: The Dynamic Neural Network Toolkit
    • Evaluating Informal-Domain Word Representations with UrbanDictionary
    • AMRICA: an AMR Inspector for Cross-language Alignments
    • A framework for (under) specifying dependency syntax without overloading annotators
    • An Algerian Arabic-French Code-Switched Corpus
    • Understanding Objects in Detail with Fine-grained Attributes
  • Blog
    • The AI Researcher's Guide to a Non-Boring Bluesky Feed
    • The Parable of the Prinia's Egg: An Allegory for AI Science
    • Interpretability Creationism
    • Against Monodomainism
    • What Does a Coder Do If They Can't Type?
    • Model Scheduling
    • Understanding Latent Dirichlet Allocation

ChatGPT Doesn't Trust Chargers Fans: Guardrail Sensitivity in Context

Jan 1, 2024·
Victoria R. Li
,
Yida Chen
Naomi Saphra
Naomi Saphra
· 0 min read
Cite URL
Type
Conference paper
Publication
Empirical Methods in Natural Language Processing (EMNLP)
Last updated on Jan 1, 2024
Large Language Models Fairness
Naomi Saphra
Authors
Naomi Saphra
Research Fellow

← Causation Does Not Imply Correlation: A Study of Circuit Mechanisms and Model Behaviors Jan 1, 2024
Dynamic Masking Rate Schedules for MLM Pretraining Jan 1, 2024 →

© 2025 Me. This work is licensed under CC BY NC ND 4.0

Published with Hugo Blox Builder — the free, open source website builder that empowers creators.