Issues with Replicating the Palomar Transients Studies

orianda

Member
[Regarding: https://www.metabunk.org/threads/digitized-sky-survey-poss-1.14385/]
Quick methodological question: is this a rigorous replication attempt or exploratory analysis? I ask because the approach will matter for interpreting findings.

If it's rigorous replication: Should pre-register methodology upfront (exact data sources, processing pipeline with parameters, statistical tests, and decision criteria.) GitHub repo with timestamped protocol is standard for computational reproducibility. This allows findings to actually challenge or support Villarroel's claims.
If it's exploratory analysis, pre-registration isn't necessary, but findings should be framed as "here's what we found exploring the data" rather than "we attempted replication and here's the definitive result."

Either approach is valid, but they require different standards and produce different levels of confidence in conclusions. Which mode is this thread in?
 
Last edited by a moderator:
Quick methodological question: is this a rigorous replication attempt or exploratory analysis? I ask because the approach will matter for interpreting findings.

If it's rigorous replication: Should pre-register methodology upfront (exact data sources, processing pipeline with parameters, statistical tests, and decision criteria.) GitHub repo with timestamped protocol is standard for computational reproducibility. This allows findings to actually challenge or support Villarroel's claims.
If it's exploratory analysis, pre-registration isn't necessary, but findings should be framed as "here's what we found exploring the data" rather than "we attempted replication and here's the definitive result."

Either approach is valid, but they require different standards and produce different levels of confidence in conclusions. Which mode is this thread in?
I'm working on exploratory analysis based on the workflow description of the 2022 paper: https://academic.oup.com/mnras/article/515/1/1380/6607509?login=false because it is the only paper that describes the software with enough details.

The idea is to learn how this can be done, because I've not done this kind of stellar image analysis with software before. It seems doable: I can already download single 30x30 arcmin FITS file and run it through SExtractor and stilts/cdsskymatch. But the tessellation of all POSS-I red images is a difficult task probably due to lack of experience. If I manage to get it working, similar (or perhaps the same) pipeline could be used to replicate the findings of two new papers (PASP, SciRep), including Earth shadow and other calculations.
 
Should pre-register methodology upfront (exact data sources, processing pipeline with parameters, statistical tests, and decision criteria.)
Villarroel didn't do this, did she?
It's kinda useless when working with historical data, because you can run your analysis, then register, and then run the analysis again.
Pre-registration is only valid when data gathering is involved.
 
Villarroel didn't do this, did she?
It's kinda useless when working with historical data, because you can run your analysis, then register, and then run the analysis again.
Pre-registration is only valid when data gathering is involved.
That's not quite right. Pre-registration prevents p-hacking, HARKing, and selective reporting - all of which are just as possible (arguably more so) with historical data as with new data collection. The concern isn't "someone might cheat and pre-register after running analysis." The concern is analysts running multiple tests until finding significance, then claiming that test was their hypothesis all along. Pre-registration creates a public record of what was planned before seeing results.

You're right that Villarroel didn't pre-register; that's a limitation of her work. But "she didn't do it either" doesn't make it less important for rigorous replication. If the goal is to demonstrate her findings don't hold up to proper scrutiny, that means applying standards she didn't meet - not matching her limitations.

So...if findings from this thread are framed as definitive replication/refutation rather than exploratory analysis, they'll be vulnerable to the same methodological critiques being made of Villarroel.
 
So...if findings from this thread are framed as definitive replication/refutation rather than exploratory analysis, they'll be vulnerable to the same methodological critiques being made of Villarroel.
I think [that] thread should aim for a trivially replicable and improvable analysis. Arguing about p-hacking is an off-topic distraction. If there's p-hacking, then a fully open approach will reveal it.
 
Last edited:
The concern is analysts running multiple tests until finding significance, then claiming that test was their hypothesis all along. Pre-registration creates a public record of what was planned before seeing results.
My concern is analysts running multiple tests until finding significance, registering the test that worked out, then running it again and claiming that test was their hypothesis all along.

When you have research that gathers fresh data, then data gathering after registration removes this concern, but registration does nothing for analysis of historical data that you have already on file.
What helps here is being transparent about your choices.
 
I think this thread should aim for a trivially replicable and improvable analysis. Arguing about p-hacking is an off-topic distraction. If there's p-hacking, then a fully open approach will reveal it.

@Mick West - I'm not arguing you're p-hacking. I'm pointing out that without pre-registration, any findings will be vulnerable to that critique, and legitimately so. Open code shows what analysis was run, but not how many were tried first. If 10 approaches are tested and the one showing significance is reported, the code for that approach looks fine...but it's still p-hacking. Pre-registration distinguishes "this is what we predicted and tested" from "this is what we found after trying multiple things." That's not a distraction - it's the difference between exploratory analysis (which is fine if framed honestly) and rigorous replication (which requires committing to methodology upfront).
If findings are presented as definitive replication/refutation without pre-registration, they'll be eclipsed by methodological critiques - the same ones being made of Villarroel.

I realize the group here doesn't want to take this approach, and I'll rest my point here. But I reserve the right to say I told you so later. :D
 
@Mick West - I'm not arguing you're p-hacking.
Noone said you were, [...]

I'm pointing out that without pre-registration, any findings will be vulnerable to that critique, and legitimately so.
You're overlooking the fact that in order to preregister this, you'd need a time machine to take you back 70 years, before the data was gathered. There's no new information content in recycled data, there's no unpredictability, every outcome, for any of our chosen inputs, is purely deterministic now. You're advocating a method that, misused the way you want us to misuse it - /post facto/, incentivises cherry picking a preregistration that we know leads to the outcome that would support our priors. That's the opposite of good science. [...]
 
Last edited by a moderator:
@Mick West - I'm not arguing you're p-hacking. I'm pointing out that without pre-registration, any findings will be vulnerable to that critique, and legitimately so. Open code shows what analysis was run, but not how many were tried first. If 10 approaches are tested and the one showing significance is reported, the code for that approach looks fine...but it's still p-hacking. Pre-registration distinguishes "this is what we predicted and tested" from "this is what we found after trying multiple things." That's not a distraction - it's the difference between exploratory analysis (which is fine if framed honestly) and rigorous replication (which requires committing to methodology upfront).
If findings are presented as definitive replication/refutation without pre-registration, they'll be eclipsed by methodological critiques - the same ones being made of Villarroel.

I realize the group here doesn't want to take this approach, and I'll rest my point here. But I reserve the right to say I told you so later. :D
I think a general concern of the group is that the paper's conclusions rely on a series of inferences based on a series of disputable (and often disputed) standards:
  • Their definition of what counts as a glint and their dismissal of other researchers' interpretations of those spots as potential emulsion defects. The inference here is that these spots can't be from other causes and must be from reflective objects high in the atmosphere or in orbit.
  • Their envelope for "in a line" for transients being in a line seems somewhat generous. Even if someone can reproduce their methodology, the ranges here are a choice. There is also an arbitrary choice that the intervals don't have to be regular; that there could be objects that glint at irregular intervals, so the spacing of dots and length of a line are arbitrary.
  • The statistical correlations are rather loose. Sure, they chose a +/- 1 day window for associating detonation dates with glints before running the numbers, but the lack of precision about time zones and what time during a day a glint or explosion happened means the windows cover ~96 to 120 hours. When we're talking about a period of time when detonations happened an average of every 528 hours (one per 22 days), that interval seems generous. If more precise times aren't available, that limits the value of any conclusions.
  • The analysis should also incorporate the dates for when detonations were scheduled, but not conducted (which happened due to technical and weather reasons), since there is no meaningful difference. (It might be worth splitting the analysis into two groups -- did transients appear before the day of a scheduled test, or not? Which would be a much larger set of dates.) If the data is not available, that limits the value of any conclusions.
  • It's even more generous when you consider they don't control for the external correlations between the data sets -- both the blasts and the survey dates are influenced by human schedules, and the survey plates are not a random or representative sampling of the skies during the years in question.
  • I'm not even going to delve into the messiness of the UAP report data, which is at best not a random or representative sample.
So another researcher, or someone here on Metabunk, might be able to replicate their identification of transients using their methods, but what people are questioning is whether those methods and their association with other events choose to give too much weight to imprecise and incomplete data.
 
There is also an arbitrary choice that the intervals don't have to be regular; that there could be objects that glint at irregular intervals, so the spacing of dots and length of a line are arbitrary.
that's not an arbitrary choice, it's the only choice that yields result.
We can surmise that it was made after data exploration.
83 satellite candidates, and apparently none of them rotates periodically.
If they had pre-registered on that, they'd fail.

Their definition of what counts as a glint and their dismissal of other researchers' interpretations of those spots as potential emulsion defects.
My problem is that Villarroel and Solano are dismissing their own identification of many of these transients in the 2022 paper.

I really like that older paper. It's straightforward and transparent, a lot of astronomical knowledge went into it, and it went very far. One could say it didn't go far enough, but they published their resulting dataset, so anyone else could take it from there; and I think that's ok.

The new papers are a big step backwards, in several aspects.
 
Since this is a spin off of the digitizing transients thread, I think Mick missed one of my posts in the other thread that should have been bumped over here as well, so I'll just copy it here and note that @orianda PMed me, so as to not go OT, with some thoughtful responses. With her permission I will share them here or let her copy or reword them as she sees fit:

Post #156:
I'm not arguing you're p-hacking. I'm pointing out that without pre-registration, any findings will be vulnerable to that critique, and legitimately so

Isn't that what this thread is? It's public. It's transparent. It's a real time public record of what various members are doing in an attempt to understand, and perhaps ultimately, replicate Villarroel's findings. They are trying to identify the exact dataset used, download it and then attempting to recreate the software to replicate the workflow that identifies the transients.

Open code shows what analysis was run, but not how many were tried first. If 10 approaches are tested and the one showing significance is reported, the code for that approach looks fine...but it's still p-hacking.

Is any of Villarroel's code open? I'm assuming not, thus the people here are trying to recreate it. Do we know how many times she and others ran various bits of code on the datasets until they arrived at the final conclusions in their papers? If they ran multiple test on the data until it showed what they were looking for, is that "still p-hacking"? I honestly don't know.

That's not a distraction - it's the difference between exploratory analysis (which is fine if framed honestly) and rigorous replication (which requires committing to methodology upfront).

Yes, but how does one "commit to methodology up front", the software workflow, if the study they are attempting to replicate does not share the software workflow used? It becomes "exploratory" by definition it would seem, until one can create, and make multiple runs of software workflow configurations that replicate Villarreol's claims. Then one can comment on why the workflow used is meaningful or not.
 
I think the idea of registering a study makes a lot of sense if it's a one-off analysis. Like if you are studying whether aspirin helps memory, so you do some double-blind tests with 1000 students over 12 months. That's not something that someone else can easily replicate, and if you tried, it would be difficult to get exactly the same results.

But they could replicate an analysis of the data gathered from a study. You could check the math, the code, and run the code again. You could write your own version of the code

That's kind of what we are doing here. The POSS-1 DSS digitized plates (and other data) are an immutable data set. You can run exactly the same analysis on them twice. If someone else has sufficiently documented their steps, then you can replicate those steps.

Could you pick and choose what you publish? Sure, but that's not what we are doing here. We're basically trying to check their work by analyzing the data in the same way. Except we're (loosely) doing it open-source. What's to register, that we are trying to figure out what they did, and do it again to check if it was correct?
 
...
Could you pick and choose what you publish? Sure, but that's not what we are doing here. We're basically trying to check their work by analyzing the data in the same way. Except we're (loosely) doing it open-source. What's to register, that we are trying to figure out what they did, and do it again to check if it was correct?

@Mick West - I think you're mixing up whether data can be replicated with whether the analysis approach was decided upfront. Sure, POSS plates don't change and anyone can run the same code twice. But that's not the issue.

Here's an example:
Researchers collect aspirin/memory data but don't say ahead of time how they'll analyze it. They try overall memory scores (nothing), then morning doses only (nothing), then students who slept well (significant result). They publish that last one. You can perfectly replicate their code. But did they run 3 tests or 30 before finding something? With multiple tests, you expect some to show significance by chance alone. That's why the number of attempts matters. The same applies here - the plates are fixed but there are tons of ways to analyze them. Which altitude for shadow calculations? What filtering? Which statistical tests? What thresholds?

This matters especially for subtle effects like the shadow deficit, where analytical choices can flip the result. If a finding is robust it should hold up across multiple reasonable approaches. If it only appears with very specific choices, that's worth questioning.

Now, if you're doing open exploration and will frame it as "here's what we found trying different approaches" rather than "we definitively refuted her work," then formal pre-registration isn't necessary. Exploratory work is valuable, especially if you document what you tried and why. But if the goal is definitive claims about whether her findings hold up, you need to commit to the approach upfront. Otherwise there's no way to tell if the methodology was chosen for good theoretical reasons or because it gave the expected answer.

Pre-registration just creates a public record of what you planned to test before seeing results. Same principle for new experiments and historical data.
 
that's not an arbitrary choice, it's the only choice that yields result.
We can surmise that it was made after data exploration.
So..what you are describing here...sounds a lot like p-hacking. Trying different approaches until finding one that "yields result," then presenting that as if it was the hypothesis.

You're right that if they'd pre-registered on periodic rotation they would have failed. That's the point. The fact that they found a result by making post-hoc choices doesn't make it more credible, it makes it less credible.

I agree with you that the 2022 paper was better. It was more transparent about methods, published the dataset, and was appropriately cautious about conclusions. The newer papers seem like a step backwards in rigor.

Which raises the question: if you're critiquing Villarroel for making analytical choices "after data exploration" to get results, shouldn't the replication attempt here avoid doing the same thing? If Metabunk tries multiple altitude calculations, filtering approaches, and statistical tests until finding one that shows no deficit, that's the mirror image of what you're critiquing her for.
 
My main concern is that by reporting flaws or interpreting the results of a test, without having thought through the validity of the critique or test and how to interpret it before hand, you risk introducing a lot of noise. Already lots of half baked arguments and false claims have been posted in these threads, and even if they are shown to be false later, they will tend to persist and get repeated over and over. Many of the people outside the loop, who want to believe the study is flawed (conversely that it is correct), may just pick and choose from all of the stuff thrown at the wall, valid or not, and then just repeat it over and over elsewhere. In the wild, they become rumors, and the general public ends up misinformed. The credibility of the accusation or validation claim should be established before making it.

This doesn't really apply when it comes to just trying to faithfully reproduce exactly what they did.
 
Last edited:
Already lots of half baked arguments and false claims have been posted in these threads, and even if they are shown to be false later,
If you can't show them to be false now, how do you know they're false?
I'm pretty sure that everything I've written is backed by the paper, so what specifically are those claims you are talking about?
Or are you simply slandering Metabunk?
 
My main concern is that by reporting flaws or interpreting the results of a test, without having thought through the validity of the critique or test and how to interpret it before hand, you risk introducing a lot of noise. Already lots of half baked arguments and false claims have been posted in these threads, and even if they are shown to be false later, they will tend to persist and get repeated over and over. Many of the people outside the loop, who want to believe the study is flawed (conversely that it is correct), may just pick and choose from all of the stuff thrown at the wall, valid or not, and then just repeat it over and over elsewhere. In the wild, they become rumors, and the general public ends up misinformed. The credibility of the accusation or validation claim should be established before making it.

This doesn't really apply when it comes to just trying to faithfully reproduce exactly what they did.
This read like you think that we should not be allowed to discuss hypothesis about what flaws the studies may have because someone could read them and be convinced. Is this what you meant?
 
Why do you say it reads like that when that's not what it actually says?
I'm trying to understand your argument and what you implied.

Public discussion of the various hypothesis is a good thing. False claims and bad argument can be refuted, their public refutation is a good thing for the general understanding of the subject. Refuting bad arguments can even be the starting point for new great hypothesis and breakthroughs. Some mistakes might get spread around but that is true for any public conversation about any subject. The discussion being public means more people get access to the refutation.
Do you disagree with this?
 
Public discussion of the various hypothesis is a good thing. False claims and bad argument can be refuted, their public refutation is a good thing for the general understanding of the subject. Refuting bad arguments can even be the starting point for new great hypothesis and breakthroughs. Some mistakes might get spread around but that is true for any public conversation about any subject. The discussion being public means more people get access to the refutation.
Do you disagree with this?
I'd like to see an actual example.
Talking in the abstract often leads to going in circles.
 
I'm trying to understand your argument and what you implied.

Public discussion of the various hypothesis is a good thing. False claims and bad argument can be refuted, their public refutation is a good thing for the general understanding of the subject. Refuting bad arguments can even be the starting point for new great hypothesis and breakthroughs. Some mistakes might get spread around but that is true for any public conversation about any subject. The discussion being public means more people get access to the refutation.
Do you disagree with this?

No, but that doesn't contradict what I said. Do you disagree with "thinking through the validity of the critique or test and how to interpret it before reporting flaws or interpreting the results of a test,", or "establishing the credibility of an accusation or validation claim" before making it? Or what is it you disagree with specifically?
 
Last edited:
. But if the goal is definitive claims about whether her findings hold up, you need to commit to the approach upfront.

This was something we discussed in a PM, so help a dullard out here.

If I understand this correctly in a simplified version, Villarreol & Solano took a dataset of images digitized from the POSS-I survey (1949-1956). They created a software automated workflow program (the code) to sift through these digitized images, comparing them to more modern surveys to identify objects, or "transients" as they call them, captured in the original POSS-1 survey that no longer appear in modern surveys.

Having identified ~100,00 objects in the images, the code further analyzed them, finding that all but ~5000 of the objects could be explained as celestial bodies, defects or other things. The remaining ~5000 were then hypothesized to be a number of things, including possibly physical structures like artificial satellites prior to humans having launched any.

So, what exactly is anyone supposed to be replicating? If the replication involves creating code that will analyze the given dataset and spit out the same ~5000 unexplained transients there seems a bit problematic.

If the code used by Villarreol and Solano is unknown or not shared, someone pre-registering the exact code they intend to use is largely meaningless. Someone else's version of code to analyze the same dataset may give different results. Replication can only occur with the exact same code, right? At worst, using the same code on the same dataset gives a different result than reported suggesting a problem somewhere. But I suspect that IF the same code used by Villarreol & Solano is used on the same dataset they used, it will again spit out ~5000 unexplained transients. That's what their code extracts from their dataset. It's what we would expect to happen.

Others may create a more stringent code that possibly identified many of these transients as things like dust and defects resulting from the copying of the plates prior to being digitized. But that doesn't refute that Villarreol & Solano's code found ~5000, it just says different code creates different results.

Everyone is working from a set dataset, assuming it's properly described. Each person's code will analyze the dataset according to their own code. The only real issue would be if Villarreol & Salono's exact code and exact dataset gave different results when tried by others.
 
My main concern is that by reporting flaws or interpreting the results of a test, without having thought through the validity of the critique or test and how to interpret it before hand, you risk introducing a lot of noise. Already lots of half baked arguments and false claims have been posted in these threads, and even if they are shown to be false later, they will tend to persist and get repeated over and over. Many of the people outside the loop, who want to believe the study is flawed (conversely that it is correct), may just pick and choose from all of the stuff thrown at the wall, valid or not, and then just repeat it over and over elsewhere. In the wild, they become rumors, and the general public ends up misinformed. The credibility of the accusation or validation claim should be established before making it.

This doesn't really apply when it comes to just trying to faithfully reproduce exactly what they did.

The only half baked claim I can remember from the thread was you saying that it was easy to determine which plates were used in the study. And then pages and pages of investigation which resulted in nobody being sure of exactly which plates were used and what the justification for their use was.

Were there other half baked claims that need supporting?
 
@NorCal Dave - You're hitting on exactly the right question: what does "replication" even mean here? You're right that without her exact code, reproducing her exact results isn't possible. Different code will give different results even on the same data.But there are different things people might mean by "replication" in this context.

- One approach is just running her exact code on her exact data to confirm you get her ~5000 transients. That confirms the code runs correctly but doesn't test whether the findings are real.
-Another approach is using the same data with independently written code to test the same hypothesis. If you get similar results, that's evidence the finding is robust. If you don't, it suggests the finding was fragile or specific to her implementation.
- A third approach is testing the underlying hypothesis directly: does a shadow deficit actually exist in this data? This is independent of any specific code implementation but requires deciding upfront what test you'll run and what would count as supporting or refuting the hypothesis.

The issue is what claims get made from the work. If someone builds different code and gets different results, they can reasonably say "we couldn't reproduce her results with our implementation" or "the finding appears sensitive to methodological choices." But they can't say "we definitively proved her findings are wrong" without actually testing the hypothesis with pre-specified methodology rather than just trying different approaches until getting a null result.

When I talk about pre-registration, I'm talking about that third type of work. If the goal is just exploring whether results hold up with different reasonable approaches, that's valuable but can't support definitive claims about whether she's right or wrong.

The level of rigor should match the strength of the claims.
 
The level of rigor should match the strength of the claims.
Yes.
But your suggestion is that we should exceed it.
-Another approach is using the same data with independently written code to test the same hypothesis. If you get similar results, that's evidence the finding is robust. If you don't, it suggests the finding was fragile or specific to her implementation.
Yes. Requires no pre-registration, as we're following the pre-established methodology.
But they can't say "we definitively proved her findings are wrong" without actually testing the hypothesis with pre-specified methodology rather than just trying different approaches until getting a null result.
The fun here is that Villarroel 2025 uses a flawed null hypothesis for her findings. If we can show that a different plausible null hypothesis yields the same outcome, then her outcome is no longer significant.

That's just the same approach we generally use for debunking.
"There's a weird light in the sky, must be aliens!"—"Umm, Flight 1168 to NYC was in that position at that time."
And suddenly the light in the sky is no longer an abnormal finding.

The paper argues that we shouldn't expect a shadow deficit, or that we shouldn't expect transient dates to correlate with nuclear test dates, but these expectations are not established with any kind of rigor. If we show these expectations to be wrong, then the paper's results are no longer supported.
At that point, it doesn't matter if we pre-registered our effort or not.
 
@NorCal Dave - You're hitting on exactly the right question: what does "replication" even mean here? You're right that without her exact code, reproducing her exact results isn't possible. Different code will give different results even on the same data.But there are different things people might mean by "replication" in this context.
"Different code will give different results even on the same data." is an utterly absurd claim, trivially disprovable:
Code:
$ echo "hello world" | (read x && printf "you too\n")
you too

$ echo "hello world" | (head -n 1 > /dev/null && echo 'you too')
you too
Same input, different code, same output.

If she has described accurately enough what processing she performed, then anyone else's implementation of that process should yield the same results. If it doesn't, and the reimplementation is validated, then her description is proved inadequate. However, even if the reimplementation does yield the same results, that doesn't validate whether it was the appropriate processing to perform. However, that's an evaluation that can, and should - to avoid claims of cheating, be done absent the data set itself (which is kinda impossible, as already mentioned, so must be done on trust).
 
No, but that doesn't contradict what I said. Do you disagree with "thinking through the validity of the critique or test and how to interpret it before reporting flaws or interpreting the results of a test,", or "establishing the credibility of an accusation or validation claim" before making it? Or what is it you disagree with specifically?

I specifically disagree with you saying it's bad that people you arbitrarily decided that they haven't thought it through enough are discussing the study.
 
"Different code will give different results even on the same data." is an utterly absurd claim, trivially disprovable...
@FatPhil - Your bash example completely misses the point.

I meant: two astronomers analyzing the same POSS plates with different filtering thresholds, different altitude assumptions, different statistical tests - all valid implementations of "analyze for shadow deficit" - can get different results. @ThickTarget pointed this out in the other thread - even the choice of using Poisson statistics vs. other approaches matters when coverage is patchy. These are all valid methodological choices that affect outcomes.

That's why those choices need to be specified upfront - not to make code deterministic, but to prevent trying multiple valid approaches until one gives the expected answer.

Your example would be relevant if astronomical analysis was as simple as echoing a hardcoded string. It's not.
 
@Mendel - The Flight 1168 analogy doesn't work here.

In UFO debunking:
- Claim: Weird light, unexplained
- Debunk: Actually it's Flight 1168
Single definitive explanation accounts for observation

This case:
- Claim: Shadow deficit exists in data.
- You: We'll try different null hypotheses until one produces similar pattern

The difference: if you try 10 different null hypotheses and the 8th one happens to produce a similar pattern, that's not the same as identifying Flight 1168. That's finding one explanation among many you tested. Pre-registration matters because it distinguishes "we predicted null hypothesis X would produce this pattern" from "we tried null hypotheses A through J until one produced this pattern." The first is rigorous refutation. The second is exploratory finding that suggests alternative explanations exist, but doesn't definitively refute the observation.

@ThickTarget has pointed out the original analysis has statistical problems (wrong assumptions for patchy coverage). If you're going to show those problems matter, you need to specify your approach upfront - otherwise you're potentially making the same mistake in reverse.
 
The difference: if you try 10 different null hypotheses and the 8th one happens to produce a similar pattern, that's not the same as identifying Flight 1168
What if I first check Flight UA263, Flight D1650, a police helicopter and Venus -- and it turns out that Flight 1168 matches the report? Does having checked the others first somehow mean 1168 is now not what was seen?

Finding multiple possible answers to "what could have caused the claimed effect?" seems to me to then put us into Fr. William of Occam's ballpark, where the simplest that requires the fewest multiplied entities is preferred unless it can be ruled out. No matter how many flights I also checked that did not replicate the time and position of the claimed UFO...
 
Finding multiple possible answers to "what could have caused the claimed effect?" seems to me to then put us into Fr. William of Occam's ballpark, where the simplest that requires the fewest multiplied entities is preferred unless it can be ruled out.
That makes a lot of sense ...but it occurs to me that is exactly what people have been doing with the Calvine photo, isn't it? Endless posts on how it could have been hoaxed, without anybody knowing if it WAS hoaxed, and none of the suggested methods getting us an inch closer to a determination of the truth.
 
Still talking about shell syntax. Still missing the point about methodological choices in scientific analysis.
1762213843542.png

Let's try to focus more on your points and less calling out people. This applies to everyone.
 
Last edited:
That makes a lot of sense ...but it occurs to me that is exactly what people have been doing with the Calvine photo, isn't it? Endless posts on how it could have been hoaxed, without anybody knowing if it WAS hoaxed, and none of the suggested methods getting us an inch closer to a determination of the truth.
Given the evidence to work with in many of these cases, proving THE answer is not going to be possible.

But it is also not necessary. If a given case COULD be aliens, or COULD be a piece of cardboard on a string, or COULD be a kite, the most parsimonious explanation might be the cardboard, or might be the kite, and that can be argued forever if the evidence is poor enough. But both would be more likely than aliens, and given that it could be something other than aliens, the evidence cannot not prove aliens are behind it all! ^_^

It ain't on us to prove it is not aliens, it is up to somebody who thinks they have evidence of aliens to make and prove their case. That's hard -- given what we know so far, they'd have to prove that aliens exist, and can get here, and this case is them. That's a difficult thing to do. But that's not our fault, and it is not up to us to make their lives easier by accepting poor evidence!

(Which I figure you already know, I just felt the urge to orate!)
 
What if I first check Flight UA263, Flight D1650, a police helicopter and Venus -- and it turns out that Flight 1168 matches the report? Does having checked the others first somehow mean 1168 is now not what was seen?

Finding multiple possible answers to "what could have caused the claimed effect?" seems to me to then put us into Fr. William of Occam's ballpark, where the simplest that requires the fewest multiplied entities is preferred unless it can be ruled out. No matter how many flights I also checked that did not replicate the time and position of the claimed UFO...
@JMartJr - Great question, this gets at an important distinction.

Your UFO example works because you're identifying which specific thing explains a single observation. Flight 1168 either was or wasn't at that position and time. Checking other flights first doesn't change whether 1168 actually matches.

The statistical case is different. When testing whether a pattern is significant, how many tests you run affects the inference itself. Villarroel reports shadow deficit with p < 0.05, meaning "only 5% chance this is random noise." But if you test 20 different null hypotheses, you'd expect one to hit p < 0.05 just by luck. Finding one that produces a similar pattern doesn't tell you her finding is wrong - it tells you that among many things you tried, one happened to work.

The Flight 1168 analogy would work if there was one specific null hypothesis that should produce the pattern if she's wrong. Test that hypothesis. If it produces the pattern, that's real evidence against her. If not, evidence for her. What breaks the analogy is trying multiple different null hypotheses until one produces a similar pattern, then claiming that refutes her work. That's not the same as identifying which flight was actually there.

Does that clarify the difference? Subtle but important for whether you can make definitive claims.
 
The statistical case is different. When testing whether a pattern is significant, how many tests you run affects the inference itself. Villarroel reports shadow deficit with p < 0.05, meaning "only 5% chance this is random noise." But if you test 20 different null hypotheses, you'd expect one to hit p < 0.05 just by luck. Finding one that produces a similar pattern doesn't tell you her finding is wrong - it tells you that among many things you tried, one happened to work.
That's not how this works.

In statistics, you see an effect when you notice something unexpected: Here's this group of 21000 unvaccinated people, and 9 got Covid over 2 months; but in this other group of 21000 people, which we vaccinated, only 1 person did, which is less than we expected based on what happened to the control group. Hence, the vaccine has an effect. Without the comparison, there is no effect. But you have to compare the right things, which is why medical trials like this one are placebo-controlled and blinded (and, yes, ideally pre-registered).
Hypothesis: Vaccinated people are protected against Covid.
Null hypothesis: unvaccinated people are protected against Covid.
The vaccine trial proved the hypothesis likely true, and the null hypothesis likely false.

For the UFO report, we have a person see a light in the sky, and incredulously think: "that shouldn't be there". They compare their observation with what they think they should see, which is not a bright light.
Hypothesis: there's an UFO emitting light in the sky here
Null hypothesis: there should be something in the sky here
The problem here is that not only has the observer failed to produce any evidence for what is producing the light (all he saw was the light), he merely believed the null hypothesis was false, but didn't check properly. @flarkey proved the null hypothesis true, and that makes the effect vanish.

That's because we don't challenge the hypothesis. We're not in a parking lot in Long Island at midnight looking for lights in the sky, replicating the evidence for the hypothesis. We believe the officer saw a light in the sky. We simply don't believe the null hypothesis is false, because we're sceptical; and we have methods we apply to check it, including e.g. examining historical flight tracking data and a 3D simulation of the observer's view which includes the flight: it looks exactly like his video. The null hypothesis is likely true.

In Villaroel 2025a, we're dealing with two hypotheses:
External Quote:
We use images from the First Palomar Sky Survey to search for multiple (within a plate exposure) transients that, in addition to being point-like, are aligned along a narrow band. [..] These aligned transients remain difficult to explain with known phenomena, [..] We also find a highly significant (∼22σ) deficit of POSS-I transients within Earth's shadow when compared with the theoretical hemispheric shadow coverage at 42,164 km altitude. The deficit is still present though at reduced significance (∼7.6σ) when a more realistic plate-based coverage is considered.
Alignment effect:
Hypothesis: the transients are aligned because they are on a path
Null hypothesis: the transients are aligned randomly

Villarroel states she cannot reject the null hypothesis for most of her 3-point samples. She "doesn't show her work", i.e. she's citing older work about Quasars and claims it applies here, to reject the null hypothesis for the 4-point and 5-point alignments. The problem I see is that the data is not uniformly random, and that changes the statistical analysis. The grid pattern we're seeing in the data makes alignments more likely than a uniform random distribution. But it's not caused by orbital paths, and seems to correlate with the plate edges, i.e. the cause is on Earth and not in the sky.
This means Villarroel rejected the wrong null hypothesis.

That's similar in principle to saying "this darts player is better than average because his darts are significantly closer to the bullseye than a random uniform covering of the dart board" when instead you should've compared with random dart players.


The shadow alignment has the same problem:
Hypothesis: there are fewer transients in the 40Mm shadow because transients are caused by the sun
Null hypothesis: the deficit is caused by a uniform random coverage of the sky

Villaroel finds 0.32% of transients in the shadow, against 1.15% expected.
The problem: the null hypothesis doesn't reflect the actual distribution of plate defects.

New null hypothesis: the deficit is caused by a uniform random coverage of the plates.

Villarroel finds 0.32% of transients in the shadow, versus 0.53% expected. This means that the actual plate coverage pattern deviates from a uniform coverage by more than a factor of 2. Yet table 4 shows only the values for the uniform coverage, that she herself proved false. Plus, with the 80Mm altitude, she finds even less of a deficit! No explanation.

Then she does a great thing:
External Quote:
As a quick check, nevertheless, we also test by masking edge transients (>2° from plate center) to remove all artifacts close to the plate edge. Removing the edge of the plate in the analysis, yields a similar ∼30% deficit in Earth's shadow,
This is motivated by the grid pattern in the data that we've also found (after Hambly & Blair pointed it out), see e.g. https://www.metabunk.org/threads/digitized-sky-survey-poss-1.14385/post-355943 . And it proves that the shadow effect is bogus.
Note that the paper is skimpy on numbers here. The 2⁰ cut removes about half of the plate area (4⁰ vs. 6⁰ diameter, aka 16:36), but many more plate defects, since the grid pattern is caused by plate defects. So if there are orbital objects, the data should still have about 50% of these, but maybe only 10% of the plate defects. (I don't really know the number, but it's substantially less than 50%.) But the shadow deficit shrunk from 39% to 30%! It should have done the opposite!

We now know that the edge half of the plates has a stronger shadow deficit than the center half. This falsifies Villarroel's finding.



Hypothetical example, with numbers instead of algebra:
Assume 106,339 data points. 349 points are found in Earth's shadow, but we expected 564 points. We have a deficit of 215 points. We can then split our sample in 65850 sources that are uniformly randomly distributed, and 40489 sources that are uniform everywhere but never in shadow.

What happens when we cut the edges off? If the edges are mostly plate artifacts, which are shadow agnostic, we may be left with, say, 10% of the 65850=6585. The shadow avoiders are supposed to not favor edges, so we're keeping 16/36 of 40489=17995, we have a total of 24580 transients, with 35 in the shadow. 35/24580=0.14% versus 0.53%, that's a 73% deficit. The hypothetical deficit has doubled! Yet the real deficit did not.

Instead, we saw a 30% deficit. We'd expect 0.53% of the 24580 samples (130) to be in shadow if there was no shadow effect. With a 30% deficit, we actually found 91. That means our sample is now split in 17170 uniformly distributed sources and 7410 shadow avoiders. The edge cut eliminated 48680 of 65850 uniformly distributed sources, and 33079 of 40489 shadow avoiders.

This means we have 33078 shadow avoiders in the edges, and 7410 shadow avoiders in the centers of the plates, although these areas are roughly equal. And that should not happen. (The only way to get this to work out is to assume that there are more defects near the centers of the plates, and that's clearly not the case.)

This is only possible if the data that Villarroel expects to be uniformly distributed is in fact not uniformly distributed. Her null hypothesis is wrong.

If we can replicate her data, we can hopefully show why that is.
And we won't be doing it by trying "random patterns". As with flight 1168, we'll try to show exactly what data causes the effect.
 
Last edited:
But if you test 20 different null hypotheses, you'd expect one to hit p < 0.05 just by luck. Finding one that produces a similar pattern doesn't tell you her finding is wrong - it tells you that among many things you tried, one happened to work.
I think another difference is, to me at least, my goal is not to prove that a UFO (whether flying over the countryside on Tuesday or orbiting decades ago) is NOT aliens. My interest is -- are there other things it could plausibly be, things that are orders of magnitude more likely. If so, aliens have not been proven no matter how many things were considered that could not have caused the observations. (Other people here of course have different goals and approaches!)

The jet example is much easier, of course, as it is simpler to find one plane that exactly matches what was reported, and the old "if that is not the plane, why aren't we seeing the plane as well, since it must also be right there?" thing kicks in. Or "Venus was right there, too, why is Venus not in the picture?" All of this is less likely in the study under discussion.

But if there are other tings (or just one other thing) that could also explain the observations, and those/that thing/s are more Occam-friendly, as it were, that's good enough for me -- the onus is on Team Alien to show that it CAN'T be the other things if they want to prove the extraordinary claim that in spite of all the immense difficulties caused by physics, Aliens are in fact flying around here.

So, did they find something odd on the plates? Possibly, possibly not. The arguments against it seem pretty good, to me, your mileage may vary of course. But the gulf between that and "it's aliens" is huge.
 
That's not how this works.

In statistics...
@Mendel - I think we're talking about two different things. Your vaccine example is correct; that's how hypothesis testing works with ONE pre-specified hypothesis and ONE test.

My point is about what happens when you test multiple hypotheses sequentially. If you test 20 different null hypotheses at p < 0.05, you'd expect one to appear "significant" just by chance - that's the multiple testing problem.

In your vaccine example:
- Pre-specified hypothesis (vaccine protects)
- Single test
- No multiple testing issue

The concern I'm raising:

- Try explanation A for shadow deficit (doesn't fit data)
- Try explanation B (doesn't fit)
- Try explanation C (fits!)
- Without pre-specification, hard to know if C is the real cause or just the one that happened to fit

Your technical analysis is strong - the edge-cutting observation suggests Villarroel's null hypothesis (uniform coverage) was wrong. I'm saying: when testing alternative null hypotheses, pre-specifying which one you'll test (and why) prevents this issue.
 
Back
Top