Issues with Replicating the Palomar Transients Studies

But if there are other tings (or just one other thing) that could also explain the observations, and those/that thing/s are more Occam-friendly, as it were, that's good enough for me -- the onus is on Team Alien to show that it CAN'T be the other things if they want to prove the extraordinary claim that in spite of all the immense difficulties caused by physics, Aliens are in fact flying around here.

Isn't Skepticism at its core about NOT magically jumping from hypothesis to belief? It's fine if you just don't care about the truth. But if you want to know, and not just believe, then it should not be good enough for you.
 
Last edited:
If you test 20 different null hypotheses at p < 0.05
I don't
you'd expect one to appear "significant" just by chance - that's the multiple testing problem.
you're confusing the hypothesis and the null hypothesis

The null hypothesis doesn't need testing. I already know it's true. In the vaccine example, I know that people naturally get infected or not infected. With the jet aircraft, I know it's there.

It's the person who defends the hypothesis who must show that it can't have been anything else we know. They need to get that "anything else" probability below the threshold (usually 5%).

Without pre-specification, hard to know if C is the real cause or just the one that happened to fit
it doesn't matter if C is the real cause. what matters is whether the person who wants the hypothesis to be true can show that it's not the cause.

Article:
The null hypothesis is a default hypothesis that a quantity to be measured is zero (null).

If the hypothesis is, "there's a shadow deficit caused by orbital objects", then that defines the null hypothesis to be, "there is no shadow deficit caused by orbital objects".

We don't need to pre-register that because it's already a given.

For "it's a UFO", the null hypothesis is "it's not a UFO", and as @JMartJr explained, it's enough to find one other explanation that can't be rejected to destroy the support for the hypothesis.

(It doesn't mean that the hypothesis is false, necessarily. But it means the data is not sufficient to prove it. I could invent a medicine that helps, but not a lot; the clinical trials would show an effect, but it might not be convincing enough.)

If the question is "to be or not to be", then if we can't reject "not to be", "to be" is not proven. ;)
 
Isn't Skepticism at its core about NOT magically jumping from hypothesis to belief? It's fine if you just don't care about the truth. But if you want to know, and not just believe, then it should not be good enough for you.
I am not following your point here at all. Perhaps you were unclear?

Can you explain how believing a claim requires sufficient proof is "magically jumping from hypothesis to belief? It is the opposite.

How is wanting proof of a claim somehow evidence of not caring about the truth?

I "want to know," but how is requiring proof somehow a problem?

Or perhaps I was unclear? So let me say this as clearly as I can -- in cases where an extraordinary claim is made, such as "this is (or might be) alien spacecraft," and evidence is presented to support that claim, and where the evidence could also be explained by other things like planes or Venus, I am not willing to "magically jump from hypothesis to belief" and embrace the claim that aliens are here.

Would you agree with that? If not, can you explain WHY not?

Edit: to correct an incorrect spelling of "believing" that was even worse than my normal i-e swap! ^_^
 
@Mendel
The null hypothesis doesn't need testing. I already know it's true
This misunderstands what null hypothesis testing is. You don't "know" the null is true. You assume it's true provisionally, then calculate the probability of observing your data under that assumption. If that probability is very low (p < 0.05), you reject the null.

Failing to reject the null doesn't prove it's true - it just means the data don't provide strong evidence against it. You never "know" the null is true through hypothesis testing.

With the jet aircraft, I know it's there.
This isn't hypothesis testing, it's deterministic identification. Flight 1168 either was or wasn't at position X at time T. That's a y/n question with a definitive answer. The shadow deficit is different: "What statistical model best explains the observed pattern in noisy data?" Multiple models might partially explain it. Which is why which model you test and in what order, affects the inference.

If the hypothesis is, "there's a shadow deficit caused by orbital objects", then that defines the null hypothesis to be, "there is no shadow deficit caused by orbital objects"
The null hypothesis needs to be more specific. It should be, "The observed deficit is explained by [specific coverage model], not orbital objects." But there are multiple possible coverage models - uniform random distribution, plate-edge artifacts, exposure variations, etc.

If you test Coverage Model A (doesn't explain deficit), then Model B (doesn't explain it), then Model C (explains it), you've inflated your false positive rate. At p < 0.05, you expect 1 in 20 tests to show "significance" by chance. Testing multiple models without correction (or pre-specification) means finding one that works could be chance rather than explanation. This is formally known as the family-wise error rate (FWER) problem - the probability of at least one false positive increases with the number of tests.

With the jet, you're checking whether a specific object was at a specific location (deterministic). With the shadow deficit, you're asking which of several possible statistical models best explains variance in noisy data (probabilistic). These are fundamentally different types of questions.

Your edge-cutting analysis is valuable. It shows Villarroel's uniform coverage assumption was probably wrong. When you test alternative coverage models, the question is: Were these models pre-specified based on theoretical reasons, or are you trying models post-hoc until one fits? The first provides stronger evidence than the second.

Pre-registration addresses this by requiring you to specify which model you'll test and why, before seeing whether it works. This distinguishes "we predicted Model C would explain it" from "we tried Models A-C and C happened to work."
 
This misunderstands what null hypothesis testing is. You don't "know" the null is true. You assume it's true provisionally, then calculate the probability of observing your data under that assumption. I
I can only calculate that probability because the relationship I'm using to calculate it is already known and accepted.
It's the hypothesis that is new and unproven.
The null hypothesis needs to be more specific.
I have supported my take with a source. Can you support yours with a source?

This is formally known as the family-wise error rate (FWER) problem - the probability of at least one false positive increases with the number of tests.
Yes. "False positive" meaning "the hypothesis is declared true, although the null hypothesis is actually true". Wikipedia:
Article:
V is the number of false positives (Type I error) (also called "false discoveries"). [...] The FWER is the probability of making at least one type I error in the family,

The FWER is not concerned with a type II error, where a null hypothesis is accidentally accepted. See https://en.wikipedia.org/wiki/Family-wise_error_rate .

Please start sourcing your claims about how statistics work.
 
I am not following your point here at all. Perhaps you were unclear?
You implied the simple alternative is good enough for you. A skeptic may consider it a more likely to be true explanation, but not good enough. If you don't care to know, then good enough is fine. If you care to know, then good enough isn't.
 
Last edited:
Pre-registration addresses this by requiring you to specify which model you'll test and why, before seeing whether it works. This distinguishes "we predicted Model C would explain it" from "we tried Models A-C and C happened to work."
Yeah, I think you have raised your valid concerns enough times now. Thank you for the head-up, wish you had had the possibility to do the same with Villaroel et. al., maybe you could start to do it now, in case they want to publish another paper. Bye bye.
 
You implied the simple alternative is good enough for you. A skeptic may consider it a more likely to be true explanation, but not good enough. If you don't care to know, then good enough is fine. If you care to know, then good enough isn't.
The simple alternative is to be preferred unless it can be falsified or another alternative can be shown to be simpler. Google "Occam's Razor."

Applying this does not show that the simpler alternative must be correct, but it requires the more improbable answer (it's aliens) to have better evidence than the simpler answer (it's a plane that can be shown to have been right there) if we are to embrace a more unlikely conclusion as being correct.

I'd LOVE to be able to be absolutely sure about everything (well, maybe not, life might get boring... ^_^) but that is not possible, especially in cases where the evidence lacking, but if we can't be absolutely sure I think it is better to go with the most likely explanation that requires the least amount of unsupported multiplication of entities as being THE MOST LIKELY answer. If you are saying you prefer to embrace the less likely answer, the one requiring a greater level of multiplication of entities, I'd be interested in knowing your thoughts about why that is preferable.

But perhaps this is not the thread for that, as we seem to be getting far afield from "issues with replicating the Palomar transients studies," so perhaps if you want to continue this you might start a thread about whether the more or less likely hypothesis that covers the data should be preferred...
 
I'd LOVE to be able to be absolutely sure about everything (well, maybe not, life might get boring... ^_^) but that is not possible, especially in cases where the evidence lacking, but if we can't be absolutely sure I think it is better to go with the most likely explanation that requires the least amount of unsupported multiplication of entities as being THE MOST LIKELY answer. If you are saying you prefer to embrace the less likely answer, the one requiring a greater level of multiplication of entities, I'd be interested in knowing your thoughts about why that is preferable.

The assertion that it's the most likely, or even has the least multiplication of entities is unsubstantiated to begin with. It's an opinion, which you are welcome to have. I've explained previously my own opinions about the likelihoods of different explanations. But for me, we have not yet found a "good enough" answer to this unresolved mystery, and various possible answers remain on the table.

Skeptical and don't care -> settle for low effort heuristic based guessing.
Skeptical and care -> don't settle for low effort heuristic based guessing.
 
Last edited:
It seems to me that the guess 'it's aliens' is almost always the lowest effort guess.

There are countless explanations for unidentified aerial phenomena; far too many people leap to the easy 'alien' explanation. To find a more likely and more mundane explanation takes much more effort (in many or most cases).
 
Please start sourcing your claims about how statistics work.

Multiple testing and inflated false-positive rates
  • Ioannidis (2005). Why Most Published Research Findings Are False. PLoS Medicine.
  • Gelman & Loken (2014). The Statistical Crisis in Science. American Scientist.
  • Any introductory statistics textbook section on multiple comparisons.
Pre-registration and model pre-specification
  • Nosek et al. (2018). The Preregistration Revolution. PNAS.
  • Benjamin et al. (2018). Redefine Statistical Significance. Nature Human Behaviour.
The issue isn't whether FWER involves Type I versus Type II errors - we agree it's about Type I. The point is that when you test several explanatory models in sequence, each additional test raises the chance of a false positive unless you adjust or declare your plan in advance. That's why science distinguishes between confirmatory tests (pre-registered) and exploratory model-fitting (valid, but should be labeled as such).

The shadow-deficit analysis falls in that second category unless the coverage model was specified beforehand. My concern is about inflated significance, not about rejecting or accepting anyone's particular hypothesis.

As for the null: to be testable it has to describe a specific, measurable state—something like "coverage pattern X produces no deficit." "Orbital objects don't exist" isn't a null you can evaluate statistically.

I think the distinction's clear enough at this point.

(And for anyone curious who has read to the bottom of this post, kudos to you, the Gelman & Loken article is a fascinating read. It touches on Daryl Bem's ESP study and why seemingly significant findings can appear from pure noise.)
 
Isn't Skepticism at its core about NOT magically jumping from hypothesis to belief? It's fine if you just don't care about the truth. But if you want to know, and not just believe, then it should not be good enough for you.
Do you want to check every one of a herd of hundreds of horses just on the off chance that one of them turns out to be a zebra? There is a point at which enough evidence is enough. If your personal point is too low, you're not very skeptical, but if your point is higher than almost anyone else, you begin to look too desperate for "the truth" to be "the truth that satisfies my existing point of view, and I'll accept no other".

You're taking advantage of the fact that we cannot conclusively prove a negative, because — take your pick — the evidence is too sparse or misleading, memory is fallible, the object is in the LIZ, witnesses contradict each other, or a hundred more. That's not our job to do so, no matter how much you want "the truth". The people who make the claim have to prove the affirmative, but until they do, we are crowd-sourcing plausible alternatives in the full realization that we will never have all the data.
 
Do you want to check every one of a herd of hundreds of horses just on the off chance that one of them turns out to be a zebra? There is a point at which enough evidence is enough. If your personal point is too low, you're not very skeptical, but if your point is higher than almost anyone else, you begin to look too desperate for "the truth" to be "the truth that satisfies my existing point of view, and I'll accept no other".

You're taking advantage of the fact that we cannot conclusively prove a negative, because — take your pick — the evidence is too sparse or misleading, memory is fallible, the object is in the LIZ, witnesses contradict each other, or a hundred more. That's not our job to do so, no matter how much you want "the truth". The people who make the claim have to prove the affirmative, but until they do, we are crowd-sourcing plausible alternatives in the full realization that we will never have all the data.

A lot of people are simply seeing things with their eyes or recording things on sensor systems and reporting their observations. It's not on them to prove they saw what they saw or recorded what they claim they recorded on classified sensor systems. It's up to anyone who is interested in figuring it out to participate in investigating what is happening. And it is up to anyone who wants to know the truth, rather than just settle for one belief or another, to stay open minded.
 
It's not on them to prove they saw what they saw or recorded what they claim they recorded on classified sensor systems.
I disagree. It's not up to anyone who attempts a scientific analysis to accept a visual sighting (along with the unproven breathless statements about shape, size, distance, or speed) as factually true, especially as we have seen so many times that they are incorrect. Science starts with FACTS, and we would be remiss to accept an excited utterance as a "fact" without a good deal more confirmatory evidence. They don't have to provide that evidence, but we don't have to accept it.

I'm suddenly reminded of an event more than half a century ago when my small son came running into the house gasping "There's a ...there's a ... there's a ... I don't know WHAT it is!", as he got his first view of the Goodyear blimp floating slowly over our house at a low altitude. Not enough UFO spotters are willing to admit that they don't know what they saw.
 
The issue isn't whether FWER involves Type I versus Type II errors - we agree it's about Type I.
Then please understand that a null hypothesis is not a prediction, and re-read Nosek in that light.
You cannot commit a type I error by trying to make the null hypothesis true.
That's why none of your sources apply.
We are talking about type II errors, but none of your sources do.

H1 is the prediction, it posits a relationship.
"Unknown objects in orbit cause a shadow deficit."
The H0 is the absence of that relationship.
"There is no shadow deficit caused by unknown objects."

External Quote:
Progress in science relies in part on generating hypotheses with existing observations and testing hypotheses with new observations. This distinction between postdiction and prediction is appreciated conceptually but is not respected in practice.
1) Villarroel does not make this distinction.
2) We can't, because a) we're not predicting a relationship, we'rd tryinv to disprove it, and b) we're using the existing observations.

External Quote:
The problem with this is understood as post hoc theorizing or hypothesizing after the results are known (12). It is an example of circular reasoning––generating a hypothesis based on observing data, and then evaluating the validity of the hypothesis based on the same data.
"validity of the hypothesis", not "validity of the H0". Looking at the H0 is the "evaluation".

External Quote:
In NHST, one usually compares a null hypothesis of no relationship among the variables and an alternate hypothesis in which the variables are related. Data are then observed that lead to rejection or not of the null hypothesis. Rejection of the null hypothesis at P < 0.05 is a claim about the likelihood that data as extreme or more extreme than the observed data would have occurred if the null hypothesis were true. It is underappreciated that the presence of "hypothesis testing" in the name of NHST is consequential for constraining its appropriate use to testing predictions.
We are doing that. The prediction is written down, published by Villarroel. There can be no "Garden of forking Paths" because Villarroel has pinned her prediction/postdiction down for us, flawed as it may be.

External Quote:
Even if there are no relationships to find, some of those tests will elicit apparent evidence––positive results––by chance (27). If researchers selectively report positive results more frequently than negative results, then the likelihood of false positives will increase (38–40).
False negatives are not a concern in this context.

(All quotes from Nosek.)

Please ask if anything is unclear.
 
I didn't say that.
Oops, I failed to quote the succeeding sentence:
External Quote:
It's up to anyone who is interested in figuring it out to participate in investigating what is happening.
Figuring WHAT out? Figuring out what somebody observed? We can't do that if we don't have verifiable facts. All we can provide are possibilities.
 
Figuring WHAT out? Figuring out what somebody observed? We can't do that if we don't have verifiable facts. All we can provide are possibilities.
The truth. You don't get verifiable facts without investigation. And you can't very well conduct a proper investigation without considering possibilities. Maybe finding the truth about the UFO phenomenon is too difficult and we will never succeed. Also maybe we will never know what dark matter is. It's not a problem that is unique to the UFO phenomenon.
 
@Mendel - You're misunderstanding which part of the analysis I'm talking about, and conflating several distinct issues.

What I'm NOT talking about--> Testing Villarroel's hypothesis ("orbital objects cause deficit") against her null ("no orbital objects"). You're correct that she published this claim and you're evaluating it. No argument there.

What I AM talking about-->When you test alternative explanations for the observed deficit, you're proposing multiple competing models. "It's coverage pattern A," "it's coverage pattern B," "it's plate edge artifacts," "it's exposure variation," etc. Each of these is a distinct hypothesis about what explains the data.

You claim:
"You cannot commit a type I error by trying to make the null hypothesis true."
This is wrong. If you test multiple alternative explanations sequentially and report the one that fits, you're doing multiple hypothesis tests. Each test has a false positive rate. Testing 20 alternatives at p < 0.05 means ~64% chance of at least one false positive. This IS Type I error inflation - falsely concluding an explanation works when it's just noise.

You claim:
"There can be no 'Garden of forking Paths' because Villarroel has pinned her prediction down"
This is wrong. The garden of forking paths applies to YOUR analytical choices, not hers. Gelman & Loken's point is that even with a fixed dataset and research question, the analyst has many degrees of freedom: which models to test, which covariates to include, how to handle outliers, which statistical tests to use, etc. Each choice is a fork in the path. Villarroel's published claim doesn't constrain YOUR analysis choices when testing alternatives.

You claim:
"We are talking about type II errors"
This is wrong. Type II error is failing to reject a false null (missing a real effect). Type I error is rejecting a true null (false positive - claiming an effect exists when it doesn't). When you test multiple alternative explanations and one shows significance, the concern is Type I error - that you've found an apparent pattern that's actually noise. This is exactly what FWER addresses, and exactly what your own Wikipedia quote says: "V is the number of false positives (Type I error)."

The Nosek quotes you provided actually support my point:

"Even if there are no relationships to find, some of those tests will elicit apparent evidence––positive results––by chance"
This is describing exactly what I'm warning about: testing multiple models, finding one that appears to work, when it's actually chance. Nosek's solution is pre-registration to distinguish confirmatory testing (pre-specified) from exploratory analysis (post-hoc model fitting).

I'm going to repeat my previous point that your technical analysis of Villarroel's edge-cutting test is valuable. When you move to testing alternative coverage models to explain the deficit, you'll be doing multiple hypothesis tests. Pre-specifying which model you'll test (and why) before running the analysis is how you avoid the multiple testing problem. Without that, you risk doing exactly what you're critiquing Villarroel for - fitting explanations to observed patterns.

If you want to continue this discussion, feel free to DM me - I think we've taken enough of the public thread at this point and it doesn't seem like others are interested.
 
If you want to continue this discussion, feel free to DM me - I think we've taken enough of the public thread at this point
if you don't have sources, I'm not interested

we're in "your" thread, so we're really not in other people's way—they can always tell the forum to ignore this thread for them.
 
Last edited:
The assertion that it's the most likely, or even has the least multiplication of entities is unsubstantiated to begin with. It's an opinion, which you are welcome to have.
It's not just an opinion, it's a posterior informed by almost everyone's entire histories, and therefore a well-informed prior to take forwards.
 
Then please understand that a null hypothesis is not a prediction, and re-read Nosek in that light.

Logically, the null hypothesis can be considered: reflective objects in geosynchronous orbit are not causing a significant proportion of the transients.

However, in order to do statistics you need more than that, you need to estimate the distribution given the null hypothesis. In order to do that, you need to have a theory what those sources actually are if not glints, or at least what statistical properties they should have with respect to the Earth's shadow at GEO. If you assume they would be mostly astronomical objects as opposed to plate defects, or something else, and further if you have different theories what distribution those explanations would produce, then that can affect how you model the null distribution. If you keep estimating the null distribution and comparing with the actual distribution based on different assumptions or parameters, then you risk falling into the multiple comparison trap.

I think the authors assumptions were that the null distribution should look approximately like a Poisson distribution given the parameters? Maybe that assumption is too weak, or somehow faulty. That is worth looking into.

If you try to reproduce their results using their assumptions and methodology, that is one thing, and not much you can do wrong except make a logic or coding error. If you want to see if the signal goes away when other factors are accounted for or other assumptions are made, then you should be careful not to fall into the multiple comparison trap. I don't know the risk of falling into that trap in this specific case. But it doesn't hurt to be extra rigorous, so I don't know what all of the fuss is about orianda's input.
 
Last edited:
you need to have a theory what those sources actually are if not glints, or at least what statistical properties they should have...
@beku-mant - Appreciate you explaining that clearly. That's exactly the concern. And as you note, being rigorous about this wouldn't be difficult - standard pre-registration in a GitHub repo is typically just a page or so of methodology specification.
 
Multiple testing and inflated false-positive rates
  • Ioannidis (2005). Why Most Published Research Findings Are False. PLoS Medicine.
  • Gelman & Loken (2014). The Statistical Crisis in Science. American Scientist.
  • Any introductory statistics textbook section on multiple comparisons.
Pre-registration and model pre-specification
  • Nosek et al. (2018). The Preregistration Revolution. PNAS.
  • Benjamin et al. (2018). Redefine Statistical Significance. Nature Human Behaviour.
Please use proper sourcing in accordance with the Posting Guidelines.

Article:
Posting source links to back up statements is a must on Metabunk. Statements made without a linked source, and especial statements that paraphrase a source, can be very misleading and will likely be deleted.

But while links are very important, they must be treated as additional references and not stand-alone content, so any content in the link that you refer to must also be in your post, quoted using "ex" tags.

If the information is visual, then screen grabs of relevant images must also be included in your comment.

A brief explanation for why you feel the quote you are quoting is relevant is also required.

The above also applies to video links. Timestamps (ie hour:minute:second), in text, are required also for
video links even if you "copy url at current time" in a video.

Do not paraphrase links unless you are commenting on something you have fully quoted in context.

Do not quote more than is necessary, the more focussed you are then more likely it is that someone will read what is there, and the more useful your post will be.

More details:

Links

The reader should not have to click on a link in order to understand what the post is about. When you link to something to back up something you are discussing then:
Describe what is in the link, and why it is relevant to the thread topic.
Quote relevant excerpts using EX tags,
Include images and screen-grabs from the link.
Do not use URL shorteners. They are unnecessary, hide the source, and may break.
Links themselves are not content, they are references.
 
Logically, the null hypothesis can be considered: reflective objects in geosynchronous orbit are not causing a significant proportion of the transients.

However, in order to do statistics you need more than that, you need to estimate the distribution given the null hypothesis. In order to do that, you need to have a theory what those sources actually are if not glints

Not really. If the data doesn't support the conclusion that there are reflective objects in equatorial orbit, there is no need to have a theory about what those objects are.
If the data isn't accurate, e.g. the identification of marks on the images being studied as "true" transients is flawed, the conclusion will not be reliable.
Hambly and Blair (2024) question Villarroel et al.'s identification of transients.

We can have a hypothesis that there are no blue-eyed unicorns without sister hypotheses that unicorns have brown or green eyes.
 
Not really. If the data doesn't support the conclusion that there are reflective objects in equatorial orbit, there is no need to have a theory about what those objects are.
If the data isn't accurate, e.g. the identification of marks on the images being studied as "true" transients is flawed, the conclusion will not be reliable.
Hambly and Blair (2024) question Villarroel et al.'s identification of transients.

We can have a hypothesis that there are no blue-eyed unicorns without sister hypotheses that unicorns have brown or green eyes.
The unicorn analogy doesn't quite work here because we're not looking for something that's absent - we're explaining something that's present.

There ARE transients in the POSS data. The question is: what are they? Villarroel says they're orbital glints, evidenced by shadow deficit pattern. To test this, you need a null hypothesis that specifies: What the transients are if NOT orbital glints, and what pattern that alternative would produce.
Different alternatives predict different patterns:
  • If they're plate edge artifacts → expect concentration at plate edges
  • If they're random defects → expect uniform distribution
  • If they're exposure variations → expect correlation with exposure parameters

@beku-mant is correct - you can't just say "not orbital objects" - you need to specify what you think they ARE, because different explanations predict different patterns. Testing multiple alternatives without pre-specification is the multiple comparison problem.

The unicorn example works for absence of evidence (looking for something not there). But when you have an observed pattern (shadow deficit), you need a specific alternative hypothesis to explain it.
 
There ARE transients in the POSS data.

There are almost certainly large numbers of transients captured on the NGS-POSS-1 plates.
Villarroel et al.'s data was gathered by studying copies of those plates, processing artefacts may have proliferated yet been classified as transients.

Hambly and Blair believe that some of the proposed transients of greatest interest to Villarroel et al. are processing/ reproduction artefacts.
(As a cheeky aside, Hambly and Blair's 2024 paper was published, but perhaps we should be allowed to critically discuss it without formally replicating it and pre-registering our intent to do so).
 
Not really. If the data doesn't support the conclusion that there are reflective objects in equatorial orbit, there is no need to have a theory about what those objects are. If the data isn't accurate, e.g. the identification of marks on the images being studied as "true" transients is flawed, the conclusion will not be reliable.

Hambly and Blair believe that some of the proposed transients of greatest interest to Villarroel et al. are processing/ reproduction artefacts.

They created a ML classifier with 3 classes, "highly reliable star", "highly reliable galaxy", and "likely spurious plate detections".

For reliable stars the selection criteria is listed as,

A selection of highly reliable, isolated stars covering each field was created using Gaia DR3 (Gaia Collaboration et al. 2023). We made an astrometric reliability cut following Lindegren et al. (2021) via the 'renormalised unit–weight error' statistic (ruwe < 1.4; again, an example query is provided in the Appendix).

For reliable galaxies,

A selection of highly reliable galaxies covering each field was created using PanSTARRS PS1–DR2 (Flewelling et al. 2020). Tables ObjectThin and StackObjectThin were joined to a provide multiple–detection, multi–band catalogue including PSF and Kron magnitudes and source flags. Our selection required detection in both PS1 and and we applied the recommended star–galaxy separation

And for likely spurious detections,

Our approach to defining a training set of high likelihood spurious plate detections was to use highly complete star and galaxy catalogues as defined above but without the quality criteria, and negate the pair association with a relaxed proximity criterion. We defined a plate catalogue entry as likely spurious if there was no associated Gaia stellar or PS1 galaxy entry of any kind within 5 arcsec of its measured position.

https://arxiv.org/pdf/2402.00497

Then they use this model to predict class labels for a handful of VASCO's transients (9 of them). And they find the classifier predicts the "likely spurious plate detections" label at pretty high probabilities for most of the 9.

I wonder why they didn't simply try to distinguish and compare arbitrary entries with cross matches vs arbitrary entries without cross matches? The selection criteria for each class may introduce confounding factors and their machine learning model may be learning how to separate objects with "'renormalised unit–weight error' statistic (ruwe < 1.4)" from "arbitrary entries of any kind without a Gaia stellar or PS1 galaxy cross-match" based on the confounding factors associated with those selection criteria that have little to do with if it is an artifact or not.

For their hyperparameter tuning they also don't specify if they did this only on the training data or not. If not, it would be probablematic. The fact they only performed single train-test split, instead of cross validation is suspect, and indicative of a lack of experience and expertise in ML.

Also, Villarroel addresses Hambly & Blair's paper, saying,

Narrower FWHMs and rounder profiles. Hambly & Blair (2024) interpret slightly more concentrated, round profiles as signs of spurious detections and makes an example with Villarroel et al. (2021). However, atmospheric seeing and short-lived (sub-second to few-second) optical events are also expected to produce narrower FWHMs than long-exposed stars (Tokovinin 2002; Villarroel et al. 2025a). Thus, profile sharpness alone cannot conclusively distinguish between artifact and astrophysical origin.
...
In addition, the "artifact" sample in their study was selected using criteria that mirror the VASCO project's transient selection pipeline, which may introduce circular reasoning.

https://iopscience.iop.org/article/10.1088/1538-3873/ae0afe

Which both seem like valid points, especially the second one.

Lastly, if we assume Hambly & Blair's hypothesis is correct, it would still contradict some of the arguments made here, that sources without cross-matches in the 3 major catelogs, but possible cross-matches in some other catalogs, must be mostly astronomical sources. Hambly would classify them as spurious detections/plate defects.
 
Last edited:
Logically, the null hypothesis can be considered: reflective objects in geosynchronous orbit are not causing a significant proportion of the transients.
Yes. Thank you.
This encompasses all of the more specific null hypotheses I could come up with!
However, in order to do statistics you need more than that, you need to estimate the distribution given the null hypothesis. In order to do that, you need to have a theory what those sources actually are if not glints, or at least what statistical properties they should have with respect to the Earth's shadow at GEO. If you assume they would be mostly astronomical objects as opposed to plate defects, or something else, and further if you have different theories what distribution those explanations would produce, then that can affect how you model the null distribution. If you keep estimating the null distribution and comparing with the actual distribution based on different assumptions or parameters, then you risk falling into the multiple comparison trap.
The concepts are correct, but you're misapplying them.

I don't just "estimate" the null distribution, I show it.
Same as we don't "estimate" the aircraft (well, we do at first), we show it.

That's why I am qualifying some of the estimates I have made so far, as they contain elements of guesswork. They have to be shown to actually apply.

The null hypothesis is not the hypothesis. You can have as many as you want, but they must be based on what is known.

The hypothesis is the claim that's new. And we must make sure that it's not based on a random anomaly of the data, which is where the garden of forking paths comes in.

Villarroel has a choice of which altitude to consider. She picks two, but does not present any others, except to say that for low altitudes, there's an "overdensity", which is astounding and demands an explanation, but instead is handwaved away. She has a choice of which data set to use, and does not build on the Solano 2022 5399 data set, but rather expands it by removing steps from the process in that paper. She has a choice of how much to remove. These choices shape the hypothesis as presented by her paper, because it's nailing down the methodology to process the data, and the exact shadow she's considering.

And because the null hypothesis is the negation of the hypothesis, it is now nailed down as well. We don't have these choices.
 
There ARE transients in the POSS data. The question is: what are they? Villarroel says they're orbital glints, evidenced by shadow deficit pattern. To test this, you need a null hypothesis that specifies: What the transients are if NOT orbital glints, and what pattern that alternative would produce.
Different alternatives predict different patterns:
  • If they're plate edge artifacts → expect concentration at plate edges
  • If they're random defects → expect uniform distribution
  • If they're exposure variations → expect correlation with exposure parameters
These are not alternatives. We know that plate edge artifacts exist. We know that random defects exist. We know that misidentified astronomical objects exist. We know the plates are not uniformly distributed across the sky. The hypothesis is only true if all of these things couldn't collectively cause the shadow effect.
 
To be honest, we don't need to find out an alternative hypothesis to explain the shadow effect, because Villaroel's paper actively demonstrates that it is not an astronomical effect.

To quote Mendel, and (indirectly) to quote the paper itself:
Then she does a great thing:
External Quote:
As a quick check, nevertheless, we also test by masking edge transients (>2° from plate center) to remove all artifacts close to the plate edge. Removing the edge of the plate in the analysis, yields a similar ∼30% deficit in Earth's shadow,

This is motivated by the grid pattern in the data that we've also found (after Hambly & Blair pointed it out), see e.g. https://www.metabunk.org/threads/digitized-sky-survey-poss-1.14385/post-355943 . And it proves that the shadow effect is bogus.
Note that the paper is skimpy on numbers here. The 2⁰ cut removes about half of the plate area (4⁰ vs. 6⁰ diameter, aka 16:36), but many more plate defects, since the grid pattern is caused by plate defects. So if there are orbital objects, the data should still have about 50% of these, but maybe only 10% of the plate defects. (I don't really know the number, but it's substantially less than 50%.) But the shadow deficit shrunk from 39% to 30%! It should have done the opposite!

We now know that the edge half of the plates has a stronger shadow deficit than the center half. This falsifies Villarroel's finding.

The increase in 'shadow effect' in the (presumably spurious) edge anomalies means that the effect is caused by some other process than the location of the Earth's shadow. We can put forward any number of possible explanations for this increase, but they cannot be astronomical ones, so the effort is relatively pointless.
 
I've been reading this from the outside, as I'm (self-)studying statistics and this twists my brain.

I will just give my 2 cents.

It looks both possible to have the null being:

a. that the transients are not caused by objects in geosynchronous orbit

b. that the transients are random defects

To calculate the p-value of a, you need to calculate the probability of the data given it. But to calculate the probability of the data, you need a model. So this seems like it starts to fall into Bayesian statistics (as we establish priors for each model):

a.a. (P=0.5) they are random defects

a.b. (P=0.25) they're plate edge artifacts

a.c. (P=0.25) they're exposure variations

So, if P(data | random defects) = 0.1, and P(data | a.b OR a.c) = 0.0, P(data | null hypothesis) = 0.05. This is the p-value.

For b, we can consider various competing alternative hypotheses: edge artifacts. exposure variations, objects in orbit. There must be some way to deal with that, but none that I know of.

Now, Mendel seems to be suggesting another approach from both above. They want to prove a - that the transients are not objects in orbit - or at least prove it for the majority of transients.

Sincerely, I think this would make more sense if you invert the null hypothesis:

H_0: the transients are caused by objects in geosynchronous orbit
H_A: the transients are not caused by objects in geosynchronous orbit

Then, he has a model for H_0, and disproves it.

I think it also makes more sense if you take out the statistical framework altogether.

He's just proving, for most of the transients, that they're not in geosynchronous orbit.

This doesn't tell us exactly what they are yet, but now, it seems like new hypotheses need to be considered (and previous work need to be remade?).
 
Last edited:
It looks both possible to have the null being:

a. that the transients are not caused by objects in geosynchronous orbit

b. that the transients are random defects
Your (a) includes (b). Showing (b) also shows (a), but it's not necessary.


So this seems like it starts to fall into Bayesian statistics (as we establish priors for each model):
like almost any self-professed Bayesian I have ever seen, "establish" means "take a wild guess" for you


Now, Mendel seems to be suggesting another approach from both above. They want to prove a - that the transients are not objects in orbit - or at least prove it for the majority of transients.
because that is actually the proper null hypothesis, according to 2 sources I have quoted


He's just proving, for most of the transients, that they're not in geosynchronous orbit.
Not quite.
I'm showing that none of them have to be in orbit for the effect to occur.
I cannot disprove that there was a single alien satellite up there.
However, showing (a) means that Villarroel cannot prove it's up there.
So there's no reason to assume it is.

This doesn't tell us exactly what they are yet,
I would really like to find the shadow effect in data that Solano 2022 has already identified.
 
I wonder why they didn't simply try to distinguish and compare arbitrary entries with cross matches vs arbitrary entries without cross matches?

Because they (Hambly and Blair 2024) wanted to assess the reliability of Villarroel at al.'s identification of photographic features as transients.
Examining features that Villarroel et al. raise as "interesting" examples of transients would seem to be a promising (and obvious) approach.
 
One approach that I find useful is an analysis of the angles between any three points. This sort of analysis is commonplace in the study of ley lines, and has (for the most part) shown that ley lines are a chance association between points which have no causal connection. Any three points in a data set will form either a triangle or a straight line. The angle ε between the three points at the centre of the figure can be measured; if it is around 1 degree or less, the points will appear to form a straight line, whether the points are causally connected or not.
triangle.png

from here (an examination of quasar distribution, which is apparently random, yet still seems to show a number of alignments).
https://articles.adsabs.harvard.edu/full/1982MNRAS.201..179W

If you look at 180 samples of three random points, very roughly one of the samples will have an angle ε of about 1 degree or less. But because of random chance the number of apparent straight lines in a random data set can be much higher. This is why no respectable mathematician has any belief in ley lines. But sometimes there are other effects which can increase the incidence of apparent linear formations; if the sample is oblong rather than square, the sample will favour elongated triangles, and so on.

Ley lines don't exist, and neither do these chance alignments. As I've pointed out elsewhere, these are events that (if they existed at all) only lasted 0.5 seconds or less, and in a 50 minute exposure these events could occur at any time and in any order.
 
I'm showing that none of them have to be in orbit for the effect to occur.
I think I misunderstood your approach a bit.

I will make a comment in a similar vein as orianda, you can consider it as just some sort of statistical pedantry or maybe explanation.

Anything you consider useful to can be a null hypothesis. There's one detail for it to be useful in a statistics, though - it needs to give the probability of the data.

In this sense, a general "transients are not caused by objects in orbit", while it can be a starting point, is not directly useful as a statistical null hypothesis. How do you calculate the probability of data with it? It is not specific enough, that's the reason I gave the example of modeling it with multiple smaller models.

Another point, while I do agree with correcting the null hypothesis from Villarroel, for example, if it was too simplistic, testing multiple null hypothesis, for example considering (a.a), (a.b) and (a.c) as separate nulls, seems a little dubious to me. But I don't have enough background to know how to or whether to correct that except the try above of putting them into one single null hypothesis.
 
Last edited:
Anything you consider useful to can be a null hypothesis. There's one detail for it to be useful in a statistics, though - it needs to give the probability of the data.
more precisely:
the null hypothesis must explain the effect that is claimed in the hypothesis, without the relationship claimed in the hypothesis.

effect: there are transients in this data set, and they align/have a shadow deficit.
hypothesis: this is caused by orbital objects
null hypothesis: this is not caused by orbital objects

you really have to know your data to examine the alternate explanations. that's why a lot of research statistics is concerned with identifying confounding factors in the data that align with the effect, or to gather really random, representative data.

for the shadow effect, Villarroel has already identified the plate geometry as a confounding factor.
Also:
You can also see the huge gouge taken out by the Galactic Plane, because they remove anything near a Gaia star. This patchiness is not accounted for in their estimate of the statistical significance of the shadow.
Essentially, what they needed to do was to determine the actual defect distribution (e.g. using recognized defects), then distribute random points across all plates according to that distribution, then match the 3 star catalogs, and then determine the shadow deficit.

Hopefully, this can also explain why the shadow deficit is less at 80km altitude, why there's an "overdensity" for lower altitudes, or why the plater centers have a smaller shadow deficit than the whole plates. "Orbital objects" explains none of this.
 
Just giving a heads-up, I've looked a little bit and the approach I mentioned previously of using a mixture model as null hypothesis is what's appearing dubious now haha

Using multiple nulls seems like standard practice (I think it's called intersection-union test), though it can be conservative. It looks like likelihood ratio tests are an alternative.
 
Anything you consider useful to can be a null hypothesis. There's one detail for it to be useful in a statistics, though - it needs to give the probability of the data. In this sense, a general "transients are not caused by objects in orbit", while it can be a starting point, is not directly useful as a statistical null hypothesis. How do you calculate the probability of data with it?

The null hypothesis is not an alternative experimental hypothesis. A null hypothesis is the default position that a hypothesised relationship or observation (e.g. "A causes B", or "some stars' brightness varies") is not the case.
There is no requirement for the null hypothesis to explain or describe why the experimental hypothesis is incorrect/ unlikely.

The null hypothesis is not established by statistical testing, the experimental hypothesis is. The null hypothesis is retained unless significant evidence for the experimental hypothesis is demonstrated.

Villarroel says they're orbital glints, evidenced by shadow deficit pattern. To test this, you need a null hypothesis that specifies: What the transients are if NOT orbital glints, and what pattern that alternative would produce.

No, that's not correct. A hypothesis forwarding an explanation for transients (and/ or photographic artefacts) of previously undetermined origin(s) (and/or proposing a previously unobserved non-random distribution), i.e.
"...a null hypothesis that specifies: What the transients are if NOT orbital glints, and what pattern that alternative would produce"
is not a null hypothesis, even if it contradicts Villarroel et al.'s hypothesis.
It would be a different experimental hypotheses.

A null hypothesis does not provide an alternative explanation, it merely states that the relationship proposed by the researcher is not the case.
Examples might be:

Hypotheses
Astronomical transients can be reliably discriminated from photographic/ copying artefacts using a specific method
There are less transients visible in the Earth's shadow than in equivalent areas of sky elsewhere
There is a significant temporal relationship between the observations of transients and nuclear test explosions, 1949-1956

Null Hypotheses
Astronomical transients cannot be reliably discriminated from artefacts using the specified method
There is no relationship between reduced frequency of observed transients and the Earth's shadow
There isn't a significant temporal relationship between the observations of transients and nuclear tests

The null hypotheses do not state why the experimental/ observational hypotheses aren't statistically significant; they do not provide alternative explanations for observations made or results derived.
We should not expect a null hypothesis to specify "What the transients are if NOT orbital glints"; that wouldn't be a null hypothesis.

(NB, I'm not saying the "Hypotheses" examples above are accurate summaries of Villarroel et al.'s hypotheses, nor is it meant to be implied that either the hypotheses or null hypotheses are to be preferred).
 
Last edited:
Back
Top