So doing a google search for Dylann Storm gives an unusual result:
This search was done at 5PM, June 18, 2015, but Storm was only named as a suspect about 12 hours earlier. So why was Huffington Post reporting it two days before the shooting even happened?
Here's the page.
http://www.huffingtonpost.com/2015/...leston-church-shooting-suspect_n_7612232.html
The answer is just in the way Google dates the pages it indexes. Google does not know when you put up a page, it just knows when it first sees it. We can find this by clicking on the "cached" link in the search results, which gives us:
http://webcache.googleusercontent.c...ect_n_7612232.html+&cd=10&hl=en&ct=clnk&gl=us
Notice there the page was indexed on Jun 18th, 20:38 GMT, which is 16:38 EDT, or 4:38PM, Eastern.
Google relies on the web site to supply the date the page was made. It looks in various places, in order of preference:
For #1, Huffington Post has a site map, however it does not include the date for each page. So Google next looks at data in the page to date it, firstly looking for <pubdate> tags, which are a standard way of indicating the published date:
http://www.w3schools.com/tags/att_time_datetime_pubdate.asp
Have a look at the source for the Huffington Post page from the Google Cache (which I've also added as a text file). The following is what Google sees when it reads it and looks for anything that says "date"
Notice there's a lot of dates in there, and specifically there's a lot of "pubdate" tags when really there should be just one. How does Google pick the correct one? If you don't explicitly tell Google, it seems like it simply picks the earliest date on the page. In this case it comes from this bit of code:
It's also possibly just picking the first pubdate, as the order of the included stores will vary, and we can't tell what it was when the page was first indexed.
So in summary, poor coding of the page led to the inclusion of multiple <time pubdate> tags for linked stories. Google simply picked the oldest (or the first) date on the page as the date the page was created, but it refers to a different story from a few days earlier.
This search was done at 5PM, June 18, 2015, but Storm was only named as a suspect about 12 hours earlier. So why was Huffington Post reporting it two days before the shooting even happened?
Here's the page.
http://www.huffingtonpost.com/2015/...leston-church-shooting-suspect_n_7612232.html
The answer is just in the way Google dates the pages it indexes. Google does not know when you put up a page, it just knows when it first sees it. We can find this by clicking on the "cached" link in the search results, which gives us:
http://webcache.googleusercontent.c...ect_n_7612232.html+&cd=10&hl=en&ct=clnk&gl=us
(The link above to the cached version will probably expire eventually)External Quote:
This is Google's cache of http://www.huffingtonpost.com/2015/...leston-church-shooting-suspect_n_7612232.html. It is a snapshot of the page as it appeared on Jun 18, 2015 20:38:27 GMT.
The current page could have changed in the meantime. Learn more
Full versionText-only versionView source
Notice there the page was indexed on Jun 18th, 20:38 GMT, which is 16:38 EDT, or 4:38PM, Eastern.
Google relies on the web site to supply the date the page was made. It looks in various places, in order of preference:
- The site map (sitemap.txt) which lists all the pages on a site, and (optionally) the date they were created
- The actual page itself, which can contain the date in a variety of ways
- "pubdate" attributes, preferably in <time> tags
- an "article:published_time" meta tag
- other ad-hoc dates
- The first crawl time that Google saw the page, as a fall-back date
For #1, Huffington Post has a site map, however it does not include the date for each page. So Google next looks at data in the page to date it, firstly looking for <pubdate> tags, which are a standard way of indicating the published date:
http://www.w3schools.com/tags/att_time_datetime_pubdate.asp
Have a look at the source for the Huffington Post page from the Google Cache (which I've also added as a text file). The following is what Google sees when it reads it and looks for anything that says "date"
Code:
<meta name="sailthru.date" content="Thu, 18 Jun 2015 10:14:00 -0400">
<meta name="sailthru.expire_date" content="Fri, 19 Jun 2015 10:14:00 -0400">
huff.v({"last_deploy_commit_id":"1df7a3900d9300b62c6988bcb4c4868966fd5e58","deploy_commit_id":"197441df9432e0486d7fd0b1a18746b4f5de8bbf","deploy_seq":1434556901,"last_deploy_seq":1434470743,"deploy_date":"Wed, 17 Jun 2015 12:01:41 -0400","[B]last_deploy_date":"Tue, 16 Jun 2015 12:05:43 -0400"[/B]});
"datePublished": "2015-06-18T10:15:42-04:00",
date: '2015-06-18 10:14:00',
<!-- page branding and current date -->
<time datetime="2015-06-18">June 18, 2015</time>
<time class="off" datetime="2015-06-17 22:34:40" pubdate>2015-06-17 22:34:40</time>
<time class="off" datetime="2015-06-18 11:25:29" pubdate>2015-06-18 11:25:29</time>
<time class="off" datetime="2015-06-18 12:17:35" pubdate>2015-06-18 12:17:35</time>
<time class="off" datetime="2015-06-17 15:47:35" pubdate>2015-06-17 15:47:35</time>
<time class="off" datetime="2015-06-16 17:36:54" pubdate>2015-06-16 17:36:54</time>
<time class="off" datetime="2015-06-17 13:49:11" pubdate>2015-06-17 13:49:11</time>
<time class="off" datetime="2015-06-17 16:35:47" pubdate>2015-06-17 16:35:47</time>
<time class="off" datetime="2015-06-18 13:40:42" pubdate>2015-06-18 13:40:42</time>
<time class="off" datetime="2015-06-18 12:28:26" pubdate>2015-06-18 12:28:26</time>
<time class="off" datetime="2015-06-17 15:29:31" pubdate>2015-06-17 15:29:31</time>
<time class="off" datetime="2015-06-17 16:40:44" pubdate>2015-06-17 16:40:44</time>
<time class="off" datetime="2015-06-17 17:19:25" pubdate>2015-06-17 17:19:25</time>
<time class="off" datetime="2015-06-18 14:46:05" pubdate>2015-06-18 14:46:05</time>
<time class="off" datetime="2015-06-18 11:31:24" pubdate>2015-06-18 11:31:24</time>
Posted: <time datetime="2015-06-18T10:15:42-04:00">
Updated: <time datetime="2015-06-18T11:59:01-04:00">
Notice there's a lot of dates in there, and specifically there's a lot of "pubdate" tags when really there should be just one. How does Google pick the correct one? If you don't explicitly tell Google, it seems like it simply picks the earliest date on the page. In this case it comes from this bit of code:
Now this comes from a section of a Huffington Post page called the "3up Carousel" which seems to be an element on Huffington Post pages which includes a preview and links to other stories, rotating so three stories are visible at once. However it's something that's switched off and it does not actually appear in the visible page. It's either leftover code, or a work in progress. Either way it's adding the date of the linked story, and Google is picking that up as the date for the main story, as it's the oldest date on the page.External Quote:
<!-- 3up carousel module --> <!-- ads_threeup_edit_promo --> <section id="carousel" class="three-up" data-beacon='{"p":{"mnid":"threeup_top_wrapper","mlid":"threeup"}}' > <div class="group three-up-list"> <ul class="three-up-holder"> <li class="three-up-item"> <article data-beacon='{"p":{"plid":"7582374","mpid":4}}' > <figure> <a href="http://www.huffingtonpost.com/2015/06/16/baltimore-joe-crystal_n_7582374.html" data-beacon='{"p":{"lnid":"img"}}' onclick="HPTrack.trackPageview('');"><img data-img-path="http://i0.huffpost.com/gen/3079082/images/r-JOECRYSTALGRAPHIC-medium260.jpg" src="http://s1.huffpost.com/images/blank.gif" alt="Image for 'Rat Cop' Joe Crystal Shunned From Baltimore Police Department After Reporting Officer Brutality" class="async-img-load" /></a> </figure> <strong><a href="http://www.huffingtonpost.com/2015/06/16/baltimore-joe-crystal_n_7582374.html" data-beacon='{"p":{"lnid":"hdln"}}' onclick="HPTrack.trackPageview('');"> 'Rat Cop' Joe Crystal Shunned From Baltimore Police Department After Reporting Officer Brutality</a></strong> <!-- need php for times here -->
<time class="off" datetime="2015-06-16 17:36:54" pubdate>2015-06-16 17:36:54</time>
</article>
It's also possibly just picking the first pubdate, as the order of the included stores will vary, and we can't tell what it was when the page was first indexed.
So in summary, poor coding of the page led to the inclusion of multiple <time pubdate> tags for linked stories. Google simply picked the oldest (or the first) date on the page as the date the page was created, but it refers to a different story from a few days earlier.
Attachments
Last edited: