Skip to content

Commit

Permalink
Rename files to hubble_paperclip; version sent out for comments
Browse files Browse the repository at this point in the history
  • Loading branch information
smsharma committed Mar 7, 2024
1 parent 68e3de5 commit ea65b53
Show file tree
Hide file tree
Showing 3 changed files with 9 additions and 10 deletions.
File renamed without changes.
Binary file added paper/hubble_paperclip.pdf
Binary file not shown.
19 changes: 9 additions & 10 deletions paper/main.tex → paper/hubble_paperclip.tex
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@
\newcommand{\eqrefb}[1]{(\ref{#1})}


\def\preprintno{XXXX \SM{Requested from Charles}} % Insert correct preprint number
\def\preprintno{XXXX} % Insert correct preprint number

\usepackage[
pdfnewwindow=true, % links in new window
Expand All @@ -76,8 +76,7 @@

% Define a new fancy page style
\fancypagestyle{firstpage}{
\rhead{MIT-CTP/\preprintno}
% Define other header and footer elements if necessary
% \rhead{MIT-CTP/\preprintno}
}

\lstdefinestyle{mystyle}{
Expand Down Expand Up @@ -192,7 +191,7 @@ \section{Introduction}
%
\textsc{AstroLLaMA}~\citep{nguyen2023astrollama,perkowski2024astrollama} is another recent effort to fine-tune a publicly-available model (\textsc{Llama-2}) on astrophysics-specific textual data from the arXiv.

In this paper, we describe \text{PAPERCLIP} (Proposal Abstracts Provide an Effective Representation \SM{I like `Representation' rather than `Refinement', I think the former emphasizes the central abstracts-as-captions aspect a bit more} for Contrastive Language-Image Pre-training\footnote{Technically, we fine tune rather than pre train, but ``PAPERCLIFT'' was rejected by the senior author of this paper.}), a method that connects astronomical image observations with natural language by leveraging the association between abstracts of successful observing proposals and images corresponding to downstream observations.
In this paper, we describe \text{PAPERCLIP} (Proposal Abstracts Provide an Effective Representation for Contrastive Language-Image Pre-training\footnote{Technically, we fine tune rather than pre train, but ``PAPERCLIFT'' was rejected by the senior author of this paper.}), a method that connects astronomical image observations with natural language by leveraging the association between abstracts of successful observing proposals and images corresponding to downstream observations.
%
Concretely, here we showcase the method using observations imaged by the \hubble Space Telescope (HST).
%
Expand Down Expand Up @@ -372,7 +371,7 @@ \subsection{Abstract Summarization via Guided Generation}
\input{\thedatafolder/id1_0.txt} & {\scriptsize \input{\thedatafolder/abs1_0.txt}} & {\scriptsize \input{\thedatafolder/obj1_0.txt}} & {\scriptsize \input{\thedatafolder/sci1_0.txt}} \tabularnewline
\bottomrule
\end{tabular}
\caption{Examples of the clipped \hubble proposal abstracts (second column) and LLM (\textsc{Mixtral-8x7B})-extracted summaries (right-most two columns), separately extracting objects and phenomena as well as potential downstream science use cases. \SM{Don't rotate?}}
\caption{Examples of the clipped \hubble proposal abstracts (second column) and LLM (\textsc{Mixtral-8x7B})-extracted summaries (right-most two columns), separately extracting objects and phenomena as well as potential downstream science use cases.}
\label{tab:datasetsumm}
\end{table}
\end{landscape}
Expand Down Expand Up @@ -676,7 +675,7 @@ \subsubsection*{Software}

\subsubsection*{Broader Impact}

This work relies on using abstracts from successful \hubble Space Telescope observing proposals as part of a dataset for training and evaluating machine learning models. While these abstracts are publicly accessible, the authors likely did not anticipate their text being used in this manner, raising questions around consent and appropriate use of data. Since this research intends to develop methods to aid astronomical research and does not use sensitive personal information or target commercial gain, we believe that the scientific benefits outweigh the potential concerns in this case while acknowledging good-faith arguments to the contrary. As the use of foundation models in the sciences increases, it will be important for the astronomy community to consider norms and guidelines around the appropriate use and attribution of various data sources for model training and evaluation, including qualitative textual data, to ensure transparency and maintain trust in the community.
This work relies on using abstracts from successful \hubble Space Telescope observing proposals as part of a dataset for training and evaluating machine learning models. While these abstracts are publicly available, the authors likely did not anticipate their text being used in this manner, raising questions around consent, attribution, and appropriate use of data. Since this research intends to develop methods to aid astronomical research and does not use sensitive personal information or target commercial gain, we believe that the scientific benefits outweigh the potential concerns in this case while acknowledging good-faith arguments to the contrary. As the use of foundation models in the sciences increases, it will be important for the community to consider norms and guidelines around the appropriate use and attribution of various data sources for model training and evaluation, including qualitative textual data, to ensure transparency and maintain trust.

\subsubsection*{Acknowledgments}

Expand All @@ -699,7 +698,7 @@ \subsubsection*{Acknowledgments}
Based on observations made with the NASA/ESA Hubble Space Telescope, and obtained from the Hubble Legacy Archive, which is a collaboration between the Space Telescope Science Institute (STScI/NASA), the Space Telescope European Coordinating Facility (ST-ECF/ESAC/ESA) and the Canadian Astronomy Data Centre (CADC/NRC/CSA).


\bibliography{main}
\bibliography{hubble_paperclip}
\bibliographystyle{tmlr}

\appendix
Expand Down Expand Up @@ -727,7 +726,7 @@ \subsection{Prompts and Schema Used for Summarization}

We list here the prompts and schema (i.e., desired output formats) used for guided text generation via \package{Outlines} package interfacing with the \textsc{Mixtral-8x7B-Instruct} open-weights LLM.

The following schema is used to guide the generation of the summaries, intended to produce between one and five objects and hypotheses, as well as science use cases.
The following schema is used to guide the generation of the summaries, intended to produce between one and five objects and hypotheses, as well as science use cases. \\

\begin{lstlisting}[language=Python]
from pydantic import BaseModel, conlist
Expand All @@ -737,7 +736,7 @@ \subsection{Prompts and Schema Used for Summarization}
science_use_cases: conlist(str, min_length=1, max_length=5)
\end{lstlisting}

The following prompt function is used to produce a list of possible objects and phenomena shown in HST observations downstream of a proposal abstract, as well as one to five possible science use cases.
The following prompt function is used to produce a list of possible objects and phenomena shown in HST observations downstream of a proposal abstract, as well as one to five possible science use cases. \\

\begin{lstlisting}[language=Python]
import outlines
Expand Down Expand Up @@ -857,7 +856,7 @@ \section{List of Categories for Text Retrieval Task}

The following curated categories are used in the text retrieval experiment in Sec.~\ref{sec:results}.
%
These are derived by initially prompting \textsc{Claude 2}, having attached a subsample of 30 proposal abstracts in the online interface to be used as context, to produce a list of categories corresponding to typical HST observations. The list is then manually curated to remove similar entries and ensure a representative sample of categories. \SM{I realized when looking up the prompt that I did actually attach a sample of abstracts as context}
These are derived by initially prompting \textsc{Claude 2}, having attached a subsample of 30 proposal abstracts in the online interface to be used as context, to produce a list of categories corresponding to typical HST observations. The list is then manually curated to remove similar entries and ensure a representative sample of categories. \\

\begin{lstlisting}[language=Python]
["star forming galaxies", "lyman alpha", "dust", "crowded stellar field", "core-collapse supernova", "cosmology", "gravitational lensing", "supernovae", "diffuse galaxies", "globular clusters", "stellar populations", "interstellar medium", "black holes", "dark matter", "galaxy clusters", "galaxy evolution", "galaxy formation", "quasars", "circumstellar disks", "exoplanets", "Kuiper Belt objects", "solar system objects", "cosmic web structure", "distant galaxies", "galaxy mergers", "galaxy interactions", "star formation", "stellar winds", "brown dwarfs", "white dwarfs", "nebulae", "star clusters", "galaxy archeology", "galactic structure", "active galactic nuclei", "gamma-ray bursts", "stellar nurseries", "intergalactic medium", "dark energy", "dwarf galaxies", "barred spiral galaxies", "irregular galaxies", "starburst galaxies", "low surface brightness galaxies", "ultra diffuse galaxies", "circumgalactic medium", "intracluster medium", "cosmic dust", "interstellar chemistry", "star formation histories", "initial mass function", "stellar proper motions", "binary star systems", "open clusters", "pre-main sequence stars", "protostars", "protoplanetary disks", "jets and outflows", "interstellar shocks", "planetary nebulae", "supernova remnants", "red giants", "Cepheid variables", "RR Lyrae variables", "stellar abundances", "stellar dynamics", "compact stellar remnants", "Einstein rings", "trans-Neptunian objects", "cosmic microwave background", "reionization epoch", "first stars", "first galaxies", "high-redshift quasars", "primordial black holes", "resolved binaries", "binary stars"]
Expand Down

0 comments on commit ea65b53

Please sign in to comment.