Subpixel Motion Tracking: Methods, Accuracy, and Application to Video of Collapsing Buildings

You're correct but remember it takes 1000 frames to move the circular path, which is about 22 pixels. If you check what I did, I rounded to 1/5 pixel per frame for the 100 frame video, then claimed a resolution greater than 50 times that, i.e. ~1/250th pixel... which is what you just said. Yes, indeed, 1/256th pixel is the gray scale resolution.
 
I have to make my own blob animation first?

I don't see one in the Femr2 link. Bummer. Only if you have the time. No hurry.

I still expect comparable results. That is, mine is as good or better. No reason small displacement should alter that.


I do, too. That is why you've always been on my "A" list. You know that I always tap the knowledge base of the people on my 'A' list and provide that information to others. It would be interesting to zoom in on the limits of accuracy using multiple methods and tools. I'm going to scoop up your info later and form a tutorial to compare it with the Femr2 info (with appropriate credits, of course).

One researcher pushing the limits of another provides great reference material for other observers. And it's fun, too!

This allows us to compare the limits of accuracy of the tool with the restrictions of the video, to show the tool is not the limit. The video is.

>>>>>>>>>>>


Remember, most all sub-pixel application to WTC research provides displacement information only. Only a few applications require information on velocity, acceleration, or those infamous (missing?) jolts. This case is the only time acceleration plots are ever used in the mappings.

The displacement only information is beautiful in itself. Multiple applications.
 
Maybe we could summon the digital ghost of Femr2 for a few questions? You could pm me at the other forum or meet me in the pub.
 
How different would the height measurement have to be for "over-g" results not to be there?
I'll look at that. It is the lynchpin.

If we are doing a sub-pixel-trackoff maybe some practical source reference video might be more useful, rather than an artificially accurate source.
Agreed. But it may be useful to start with tightly controlled knowns and introduce things, too, to see what happens.

What you are doing is extracting information. The information is there in the moving blob animations, but is it actually there in the WTC collapse videos?
Yes, definitely. The blob looks a lot like the "little" dishes on WTC1's antenna, which is the first thing I measured about 8 years ago. This was the first quantified look at the downward creep of the antenna prior to initiation:

9c0c1871973e1ecc73f8d4b5d5eb5721.png


Vertical axis is pixels, horizontal is frames. What this shows is 0.75 meter total travel over a period of 4 seconds, with the first 3 seconds only 25cm! This is from the Sauret video, shot from more than 1400m away (IIRC). The start of global collapse appears at the end as the rate of motion increases sharply. Not only is the motion visible, the shape of the curve is evident, too, and some inferences may be drawn from that. Most notably that this graph does not show the beginning of motion...

Some time later both femr2 and achimspok demonstrated detectable motion nearly 10 seconds before global collapse. This has profound implications for any argument which asserts that collapse began instantaneously.

The antenna blobs were the low hanging fruit of all the possible things to measure precisely because they so closely resemble the tidy example we've been playing with.
 
Yes, definitely. The blob looks a lot like the "little" dishes on WTC1's antenna, which is the first thing I measured about 8 years ago. This was the first quantified look at the downward creep of the antenna prior to initiation:

Right, but I was more getting at the limit of information, obviously there's some information, certainly enough to create a noisy curve. But with the simulation you've got an absolutely fixed camera position, zero perspective distortion, zero lens distortion, a perfectly flat response curve, identical properties for each pixel, zero atmospheric distortion (and zero change in that over time, rippling air etc), and no digital processing artifacts.

So I'm saying you can't really expect 1/256th of a pixel accuracy.
 
I don't see one in the Femr2 link. Bummer. Only if you have the time. No hurry.
I started messing around with it last night. Odd, but I could not reproduce femr2's little blob no matter what I did. The first gif shows a solid disk of color RGB (192,192,192) approximately 66 pixels diameter (it was antialiased already). I tried many reductions to 5 pixel diameter using a variety of methods: plain resize, a variety of filters with multiple strengths, no dice. It never looked like the little blob.

I picked a random frame of the little blob gif and found that the peak intensity was 212. I'm not aware of any resample operation which results in brighter pixels than are found in the original, nothing I tried did that or would be expected to. Does anyone know of such a thing? It's not a matter of random noise added to the disk before resize, I don't think, because the trace turned out so smooth. It's no issue that femr2 apparently threw in another controlled variation to make it more realistic, but we don't know what that is.

I do, too. That is why you've always been on my "A" list.
Aww, thanks.

To be clear, I only take exception to mentioning filtering/upscaling as a necessary first part of the process. It may be that SynthEyes performs better when this is done, and especially in messy real-world measurements. femr2 may well have confirmed this via experiment, given his nature to check things every which way. And he specifically says "Get SynthEyes" as part of the process, so it's natural that his advice would be specific to optimizing the performance of SE. Therefore it's not really a criticism so much as a caution.

If it DOES make SE perform better, I'm not sure why. There's a difference between looking better subjectively to the eyes and performing better in terms of information extraction. I agree, smooth curves look better but smoothness always comes at a price. It's not unusual to impose that penalty right away; like I said earlier, filtering is a normal part of the data reduction for a lot of transducer data, but... I wouldn't in this case. It can always be done later. It only makes sense if it affects SE performance in actual tracking.

Fact is, my method DID do better than SE on the simple example. That surprised me. I would've expected both SE and AE to track this as well as simple weighted sums. I'm sure those programs excel at messy targets, where I have to work like a dog. Corners, for instance, are not so easy. I prefer closed target shapes (blobs). But why can't they do something like this as well? I don't think it was prefiltering since AE did it, and I believe there's a clue in Mick's two trials. The variable is having to tell the software what to track.
 
Right, but I was more getting at the limit of information, obviously there's some information, certainly enough to create a noisy curve. But with the simulation you've got an absolutely fixed camera position, zero perspective distortion, zero lens distortion, a perfectly flat response curve, identical properties for each pixel, zero atmospheric distortion (and zero change in that over time, rippling air etc), and no digital processing artifacts.

So I'm saying you can't really expect 1/256th of a pixel accuracy.
Oh, yes, absolutely. For starters, unless the blob utilizes the full dynamic range (0-255) of brightness in every frame, that cuts it down proportionally right there. And it never is full range, usually only a dim blob and sometimes difficult to distinguish from the background. So, if there's a threshold to be applied to isolate the blob, that's subtracted off and then there's even less to work with. That's before any noise or distortion in the object itself.

However, you get some of that back with multi-pixel objects, which is pretty much everything. The multiplicity of possible values increases with the addition of each pixel in a group. It can be argued that a rigid linear translation of a rigid body should result in the same delta value for all boundary pixels, and in a perfectly generated orthogonal scene like what we've been working with, that's true. But that's really the only case I can think of where it is true. The reality is, different portions of the object undergo transitions at different times and differing magnitude steps which increases the number of possible values which can be assumed. Less holes in the number line.

How much of that is useful information, real signal, is up for much debate. But it is some, because even a rotation plus linear translation in an orthogonal scene would produce real additional information. When sampling every frame, a better picture can certainly emerge. In this sense, the uber accuracy helps, because there's so much trash in there already.
 
I picked a random frame of the little blob gif and found that the peak intensity was 212

Which brings up another limit, the maximum difference between pixels, either luminosity, or (r,g,b). In the Sauret footage it's only about 80 for the antenna blobs. , so a third of the theoretical resolution.
20150530-150117-r3q8i.jpg


Edit: or as you just said:
Oh, yes, absolutely. For starters, unless the blob utilizes the full dynamic range (0-255) of brightness in every frame, that cuts it down proportionally right there. And it never is full range, usually only a dim blob and sometimes difficult to distinguish from the background. So, if there's a threshold to be applied to isolate the blob, that's subtracted off and then there's even less to work with. That's before any noise or distortion in the object itself.
 
Look at it this way, to a person who doesn't believe subpixel tracking is possible, none of the motion in the first three seconds of my WTC1 graph is real. So, 256, 80, 15, what does the actual dynamic range matter when you can get a trace that nice, even with smoke around? And non-deinterlaced video... ugh.

Back then, very few people were acknowledging that there was prelease motion at all, for historical perspective.
 
@Major_Tom, if I may sum up my position on noise and filtering in a brief and folksy manner, noise is the turd in a box of chocolates. Generally speaking, no one likes to see the turd there. Blending up the turd with the rest of the chocolates will make it so the presence of a turd is not so easily distinguished, but is that really what you want?

When the turd can be successfully distinguished as a turd, there exists the possibility of removing it, or some of it.
 
Mick, I think you can imagine how if you do the aggregate of the six dishes, either together or individually measured then composed, you can expect to get a more accurate, smoother trace. More pixels, more information, despite all of the individual pixels being subject to the aforementioned resolution limitations.
 
Mick, I think you can imagine how if you do the aggregate of the six dishes, either together or individually measured then composed, you can expect to get a more accurate, smoother trace. More pixels, more information, despite all of the individual pixels being subject to the aforementioned resolution limitations.

The red line here is the original, the black is tracing a wider area, and the green is an area in the foreground for shake comparison.

20150530-153500-azjg2.jpg


The camera shake in the green line would likely be magnified in the other lines

Zoomed in a bit:
20150530-153756-xystd.jpg


The foreground feature is higher contrast, so more accurate.
 
Also, to clarify something from earlier... when I wrote the 5000 frames claim, Mick, I wan't aware that femr2 used a gray circle originally. I assumed that at least one pixel preserved what I thought was originally a white disk. Still, I think it will be interesting to see what happens with an assemblage of mid-gray and below, in terms of limits of resolution.
 
You can see how difficult it would be to strictly quantify the true uncertainty associated with these measurements; per-frame context-dependent.
 
Mick, the concentric squares in your image represent the tracking area which is included in some way or degree; can you nail down for me exactly what's represented there? Do the different squares delineate weighting regions?
 
Mick, the concentric squares in your image represent the tracking area which is included in some way or degree; can you nail down for me exactly what's represented there? Do the different squares delineate weighting regions?

The middle rectangle is the feature it will try to match (the "feature region"), the outer rectangle is the search area in which it will look for that feature in the next frame. See:
https://helpx.adobe.com/after-effec...zing-motion-cs5.html#motion_tracking_workflow
External Quote:

You specify areas to track by setting track points in the Layer panel. Each track point contains a feature region, a search region, and an attach point. A set of track points is a tracker.


a4334d7b9f712442c44c4366f900fac6.png

Layer panel with track point

A. Search region B. Feature region C. Attach point
Feature region



The feature region defines the element in the layer to be tracked. The feature region should surround a distinct visual element, preferably one object in the real world. After Effects must be able to clearly identify the tracked feature throughout the duration of the track, despite changes in light, background, and angle.

Search region

The search region defines the area that After Effects will search to locate the tracked feature. The tracked feature needs to be distinct only within the search region, not within the entire frame. Confining the search to a small search region saves search time and makes the search process easier, but runs the risk of the tracked feature leaving the search region entirely between frames.

Attach point

The attach point designates the place of attachment for the target —the layer or effect control point to synchronize with the moving feature in the tracked layer.

 
My workflow in AE/Excel:

  1. Drag the clip into AE
  2. In the project right click on the clip then "New Comp from selection"
  3. In the timeline, move to time you want to start at
  4. Window->Tracker (if it's not visible)
  5. (If you have a track already) Tracker Window -> Current Track -> None
  6. Tracker Window -> Track Motion
  7. Position and adjust track point and search regions
  8. Tracker Window -> Play button
  9. AE will track, wait until you've got what you need
  10. Tracker Window -> Stop button
  11. In timeline, expand the new tracker and click on the "Feature Center" line to select the data
  12. Copy (Cmd-C)
  13. In Excel, past the data in a new sheet
  14. Make a chart with the Y data
 
Odd, but I could not reproduce femr2's little blob no matter what I did.
...
I picked a random frame of the little blob gif and found that the peak intensity was 212. I'm not aware of any resample operation which results in brighter pixels than are found in the original, nothing I tried did that or would be expected to.
I was wrong about this. I just got similar results out of a vanilla Lanczos filter. I didn't know it worked that way. Learn something new every day; today was several somethings. So I have a reasonable reproduction of femr2's original, though it doesn't look exactly the same to my eye. Time for some exploration.
 
Which brings up another limit, the maximum difference between pixels...
Which brings up another issue - you generally don't want full brightness in a target since it means the object is at least saturated and more likely oversaturated unless it's just one pixel every so often. Any change in apparent luminousity can cause drift artifact. Just as turning your stereo up past the point of clipping doesn't assist in faithful sound reproduction.

Edit: and it occurred to me we'd been measuring GIFs, which are limited to 216 colors.
 
Edit: and it occurred to me we'd been measuring GIFs, which are limited to 216 colors.
GIFs have a palette of 256 colors from the 24 bit color space. The 216 is is just a specific palette of "web safe" colors, not relevant here. You can have the fully 256 color greyscale #000000 to #FFFFFF, so should make no difference in the simulation.
 

And just looking at this, I'd guess that the initial drift (up on the graph, so the image is moving down in the frame, so the camera is moving up) is due to the camera being mounted on a fluid head, not fully tightened, and being slight back heavy, and so it's sinking back. I have similar effects with my long lens on my SLR on a fluid head tripod. The motion can be very slow, as it's fluid damped.
 
Interestingly enough, having generated a 100 frame femr2-like blob, and storing the image as both png and gif, then dumping frames of gif through VirtualDub, then tracking both - there is some significant difference. More lost in straightforward conversion than I thought.

Differences in x,y values between the two source files:

0823eaad48b0428f17975c7b3f7a4968.png


The traces weren't as smooth as those I got from femr2's gif, but the trajectory radius is smaller. Once the example is better matched and the 100/1000 frame comparison is done, then I'll go to something else.
 
@Major_Tom, if I may sum up my position on noise and filtering in a brief and folksy manner, noise is the turd in a box of chocolates. Generally speaking, no one likes to see the turd there. Blending up the turd with the rest of the chocolates will make it so the presence of a turd is not so easily distinguished, but is that really what you want?

When the turd can be successfully distinguished as a turd, there exists the possibility of removing it, or some of it.


I prefer my chocolates turd-free.


Very, very informative, guys. The Sauret antenna and NW corner washer early movement is a great way to test the limits of your measuring tool. That is the gold standard for displacement tracing when compared to a static point reference.

Two researchers were able to detect early motion from 9.5 seconds before the visible collapse began.

NW corner washer:

8461c480188adfcc80a886bde51f6099.gif


Experiences horizontal displacement beginning at the moment just after the camera shake, increasing to the visible collapse, according to two perceptive researchers.
 
Last edited:
I prefer my chocolates turd-free.
As we all do, but the rule around here is - no turds, no chocolate.

I did the 100/1000 frame comparison. The 1000 frame run turned out super smooth and had only small deviation from a true mathematical circle. This was not a surprise. Here's the trace:

befe15c688d7338e2eca6aa7587f587e.png


and the deviation in both dimensions as a function of angle (in radians, starting at high noon to match femr2):

e24e281140296eedd3bc71658d6bb9c7.png


What was a surprise was that the 100 frame version had much more error than this. I generated the gif from scratch again and ran the tracking just to be sure, but it peaked out around 0.2 pixel deviation. As I mentioned way upstream, the encoding of disk and its path isn't perfect, either; in fact, it's not so different from this motion extraction process and possibly subject to even more deviation. We then don't know what portion of the error is badly drawn circle/path or image type conversion.*

This was to show the tracking method is fine at small displacements. Going forward, I won't be attempting to match femr2's example, instead being a little more systematic. Still with blobs. No gifs. (except to illustrate)

*Using Ruby/Rmagick for drawing; exact same code is used to draw and store both sequences with the exception of total frame count. Stored as gif then frame dumped by VirtualDub. Same routine to track.
 
Last edited:
And just looking at this, I'd guess that the initial drift (up on the graph, so the image is moving down in the frame, so the camera is moving up) is due to the camera being mounted on a fluid head, not fully tightened, and being slight back heavy, and so it's sinking back. I have similar effects with my long lens on my SLR on a fluid head tripod. The motion can be very slow, as it's fluid damped.


That is interesting. Femr2 was able to trace details of the camera shake:


159ddf3bb73c0879af0964a461dd8fbe.jpg


Larger size here


The vibration seems to be the natural frequency and dampening of the camera-tripod set-up you are mentioning.

>>>>>>>>>>

These effects are great ways to compare tracing tools for resolution.


>>>>>>>>>>>>>>>


Note: when applying tracing to actual video, there are other tricks such as deinterlacing the video to get the highest detail possible in the trace. If interlaced video is not separated into different fields, it may affect jitter. Static point traces are also very important. A very good list is available in the subpixel tracing link provided earlier.
 
Last edited:
I'm starting with what I consider to be an ideal baseline. Like femr2's target example, this is a light circular blob which moves in a full circle. The numbers are little more sanitary: 1000x1000px black background with white 100px diameter disk moving in a circular path about the center of the image with radius 300px. The frame is generated then resized with a quadratic filter using high blur to 100x100, a uniform 10x reduction in size.

The first frame:
58a144d0f08f991aed822fd99335dcde.png


A false color 3D profile of the target:
e4575080caf429acaf004e8943ac4034.png


Ideal in that it smoothly and narrowly peaks to maximum brightness of 255 and is a fair size object. The full width of the blob after resize and blur is around 16 pixels across.

For the time being, 100 frames with path starting at angle 0 (directly to the right), moving counterclockwise.

a87dd19460f93844cc9b959b690c081f.gif
 
I'm beginning to appreciate how well this method serves as a validation tool for all sorts of resize/resample/blur/etc operations by libraries and applications.
 
@Major_Tom, if I may sum up my position on noise and filtering in a brief and folksy manner, noise is the turd in a box of chocolates. Generally speaking, no one likes to see the turd there. Blending up the turd with the rest of the chocolates will make it so the presence of a turd is not so easily distinguished, but is that really what you want?

When the turd can be successfully distinguished as a turd, there exists the possibility of removing it, or some of it.
OWE you should issue a disclaimer - warning members of the risks of discussing turdology when one of the posting members is former Chief Turdologist for Sydney.

I was once asked to speak at a Rotary Club in Western Sydney - and they - jokingly - asked me to bring samples. I did - fresh that day from the local plant. On display for the after dinner talk. Said it was an opportunity to "get your own back".

The Club ran a waste paper recycling program ( it was 35 years ago when there was money in waste paper.) I offered to provide them with 3 tonnes per day - only used once and only on one side.
 
I am reading and looking at pictures with interest, and almost entertained. First time I see people discussing AND testing ways to aproach a quantification of sub-pixel accuracy / error margins. I get the limit of >1/255 error for single pixel error due to brightess resolution; that measuring multi-pixels tends to improve that (though not clear yet by how much - is that a 1/n or 1/ln(n) or 1/sqrt(n) factor). I see that velocity delta-px/frame is critical, I see experimental error scattered to an order of magnitude of n/100 px with single digit n in the case of a well-controlled artificial blob.

Elsewhere I had offered someone estimates of 1/200 px accuracy output values with better than 1/10 px actual precision - glad that I estimated the gist of previous discussions and demonstration about correctly.

Mick, OWE, planning to experiment with video-recorded real falls? Like drop a ball - something dense ideally, a shot-put sphere perhaps? - filmed from several distances? Max distance such that it occupies as few pixels as a typical feature that yoz'd track on the towers. Vary lighting and background contrast etc.
 
I'm beginning to appreciate how well this method serves as a validation tool for all sorts of resize/resample/blur/etc operations by libraries and applications.


Yes, and this type of validation is not possible using prepackaged software. Methods can be compared, but you can't get down to the turd-chocolate level like you are doing. You are the only one that started from basic code. Creative approach.


>>>>>>>>

Two more images on the Sauret NW corner washer (used to test the limits of our tools):

584890dcd3575aff477b6430c8b86f31.gif



With a glimpse into SynthEyes tracing software:


3d6a2779c49cd1ad7084d7193bfc3f84.gif



Horizontal displacement of both the NW corner and antenna leading into the visible collapse shouldn't be surprising since it is visible to the naked eye in this gif:

http://www.sharpprintinginc.com/911_misc/sauret_120-220.gif
 
Last edited:
Mick, OWE, planning to experiment with video-recorded real falls? Like drop a ball - something dense ideally, a shot-put sphere perhaps? - filmed from several distances? Max distance such that it occupies as few pixels as a typical feature that yoz'd track on the towers. Vary lighting and background contrast etc.

I tried this a couple of days ago with a heavy silicone ball. There were problems of scale I had not considered:
20150531-085728-z4f16.jpg


No subpixel movement there of relevance.
 
  • Like
Reactions: qed
Back
Top