Google Bard says the Earth is flat (LLM bunk)

I'd be interested to know what you folks find if you do a similar request for a ChatGPT or a Google Gemini about some other area where you have enough interest and knowledge to spot what they get right and what they get wrong.
I asked ChatGPT about the Schwimmwagen. It knew it featured a VW engine, but failed to mention that the whole car was produced by Volkswagen. It also claims that one of its roles was to tow small artillery pieces, which it absolutely cannot do, and missed its role as command transport for tank units. "open top for easy access"—yeah, no, doors would provide easy access, if it had them.

Generally speaking, its response was mostly unspecific and ripe with generalities, like a highschooler bullshitting their essay the night before it is due. The wikipedia article on the subject is detailed, informative, and correct.
 
Last edited:
Google has more festivals, and given the search terms, I feel that's a valid result, including the videos it delivers.

Maybe Goggle is better at it. I was being lazy and using Bing as that's what my browser defaults to. This is low budget PC running a locked version of Windows, so I can't install Chrome, though I can open a Google page and search form there. I wonder if you being in Germany makes a difference?

I get called 3-4 times a day from various telemarketers wanting to help me with my Google business listing and try to boost my search results. That is, move my companies name up in searches for contractors in my area I'm guessing. I don't need it, but it'd be interesting to see what they can do and for how much. And is it localized? I would think a search optimization scheme for smaller local businesses would have to localize. Amazon wants to be a hit on every search across the country, but my name popping up in a search for contractors by @Ann K in Ohio, would be useless to both of us.

Might be a fun experiment thread where we all type in the exact same search term to Google or Bing or Duck Duck Goose if someone has it, and see what different results we get around the world.
 
Article:
Amazon conducted the test earlier this year for Australia's corporate regulator the Securities and Investments Commission (ASIC) using submissions made to an inquiry.

The test involved testing generative AI models before selecting one to ingest five submissions from a parliamentary inquiry into audit and consultancy firms. The most promising model, Meta's open source model Llama2-70B, was prompted to summarise the submissions with a focus on ASIC mentions, recommendations, references to more regulation, and to include the page references and context.

Ten ASIC staff, of varying levels of seniority, were also given the same task with similar prompts. Then, a group of reviewers blindly assessed the summaries produced by both humans and AI for coherency, length, ASIC references, regulation references and for identifying recommendations. They were unaware that this exercise involved AI at all.

These reviewers overwhelmingly found that the human summaries beat out their AI competitors on every criteria and on every submission, scoring an 81% on an internal rubric compared with the machine's 47%.

Reviewers told the report's authors that AI summaries often missed emphasis, nuance and context; included incorrect information or missed relevant information; and sometimes focused on auxiliary points or introduced irrelevant information. Three of the five reviewers said they guessed that they were reviewing AI content.

The reviewers' overall feedback was that they felt AI summaries may be counterproductive and create further work because of the need to fact-check and refer to original submissions which communicated the message better and more concisely.
 
https://en.m.wikipedia.org/wiki/Wikipedia:WikiProject_AI_Cleanup
External Quote:

  • Other indications include the presence of fake references or other obvious AI hallucinations. AI content sometimes takes a promotional tone, reading like a tourism website. Other times, the AI gets confused and will write about a hotel instead of a nearby village. Automatic AI detectors like GPTZero are unreliable and should not be used.
  • When missing more precise information, AI will often describe in detail very generic and common features, praising a village for its fertile farmlands, livestock and scenic countryside despite it being in an arid mountain range.
  • AI content is not always "unsourced" - sometimes it has real sources that are unrelated to the article's topic, sometimes it creates its own fake sources, and sometimes it uses legitimate sources to create the AI content. Be careful when removing bad AI content not to remove legitimate sources, and always check the cited sources for legitimacy.
    • Example: the article Leninist historiography was entirely written by AI and previously included a list of completely fake sources in Russian and Hungarian at the bottom of the page. Google turned up no results for these sources.
    • Other example: the article Estola albosignata, about a beetle species, had paragraphs written by AI sourced to actual German and French sources. While the sourced articles were real, they were completely off-topic, with the French one discussing an unrelated genus of crabs.
  • Sometimes entire articles are AI-generated, and in such a case, make sure to check that the topic is legitimate and notable. Occasionally, hoaxes have made it onto Wikipedia because AI-generated content created fake citations to appear legitimate.
    • Example: the article Amberlihisar was created in January 2023, passed articles for creation, and was not discovered to be entirely fictional until December 2023. It has since now been deleted.
 
Going off-topic, I typed something like "how many high schools in usa" and got this (via Microsoft Bing):

us schools.JPG


Got different responses on repeating the search later.
 
AI vs intelligence, part 2
View attachment 72483
Why's there a picture of a culinary dressing next to the description of a medical dressing? You'd think that if it were running off into the medical field it would fetch its image from there as well as its text. Given how trivial it is to just 'shop one of these, I'm waiting for reliable confirmation before accepting that as a data point.
 
A new Apple study questions whether LLMs can actually reason and finds they are likely not reasoning, but just performing advanced pattern recognition with no real understanding of the content.

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
https://arxiv.org/pdf/2410.05229
External Quote:
Abstract
Recent advancements in Large Language Models (LLMs) have sparked interest in their formal reasoning capabilities, particularly in mathematics. The GSM8K benchmark is widely used to assess the mathematical reasoning of models on grade-school-level questions. While the performance of LLMs on GSM8K has significantly improved in recent years, it remains unclear whether their mathematical reasoning capabilities have genuinely advanced, raising questions about the reliability of the reported metrics. To address these concerns, we conduct a large-scale study on several state-of-the-art open and closed models. To overcome the limitations of existing evaluations, we introduce GSM-Symbolic, an improved benchmark created from symbolic templates that allow for the generation of a diverse set of questions. GSM-Symbolic enables more controllable evaluations, providing key insights and more reliable metrics for measuring the reasoning capabilities of models. Our findings reveal that LLMs exhibit noticeable variance when responding to different instantiations of the same question. Specifically, the performance of all models declines when only the numerical values in the question are altered in the GSM-Symbolic benchmark. Furthermore, we investigate the fragility of mathematical reasoning in these models and demonstrate that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize that this decline is due to the fact that current LLMs are not capable of genuine logical reasoning; instead, they attempt to replicate the reasoning steps observed in their training data. When we add a single clause that appears relevant to the question, we observe significant performance drops (up to 65%) across all state-of-the-art models, even though the added clause does not contribute to the reasoning chain needed to reach the final answer. Overall, our work provides a more nuanced understanding of LLMs' capabilities and limitations in mathematical reasoning.

One of the co-authors of the paper discusses their findings in this twitter thread.

x.com/MFarajtabar/status/1844456880971858028 (text included below, see the thread for the pictures and graphs, which are from the paper)
External Quote:
1/ Can Large Language Models (LLMs) truly reason? Or are they just sophisticated pattern matchers? In our latest preprint, we explore this key question through a large-scale study of both open-source like Llama, Phi, Gemma, and Mistral and leading closed models, including the recent OpenAI GPT-4o and o1-series.

2/ When OpenAI released GSM8K ~3 years ago, GPT-3 (175B) scored 35% on the GSM8K test. Today, models with ~3B parameters are surpassing 85%, and larger ones are hitting >95%. But has model 'reasoning' really improved? How much of this is genuine #logical/#symbolic reasoning? vs. #pattern_recognition, inadvertent data #contamination, or #overfitting?Image

3/ Introducing GSM-Symbolic—our new tool to test the limits of LLMs in mathematical reasoning. We create symbolic templates from the #GSM8K test set, enabling the generation of numerous instances and the design of controllable experiments. We generate 50 unique GSM-Symbolic sets, essentially like GSM8K examples but with different values and names. How do models handle these distinct sets?
ImageImage
4/ #Result 1: Current accuracies on GSM8K are not reliable! We observe LARGE performance variation: Llama 8B scores anywhere between 70% to 80%, Phi-3 scores between 75% and 90%, and so on. For most models, the average performance on GSM-Symbolic is lower than on GSM8K (indicated by the dashed line).

5/ #Result 2: The fragility of supposed LLM reasoning. LLMs remain sensitive to changes in proper names (e.g., people, foods, objects), and even more so when numbers are altered. Would a grade-school student's math test score vary by ~10% if we only changed the names?
Image
6/ What if we adjust question difficulty? We introduce 3 new variants of GSM-Symbolic to study model behavior: removing one clause (GSM-M1), adding one clause (GSM-P1), or adding two clauses (GSM-P2).

7/ #Result 3: As questions increase in difficulty (M1 → Symbolic → P1 → P2), not only does performance drop, but variance also rises, making models increasingly unreliable. Image

8/ This begs the question: Do these models truly understand mathematical concepts? Introducing #GSM_NoOp! We add a single clause that seems relevant but doesn't contribute to the overall reasoning (hence "no-op"). Check out what happens next! Image

9/ #Result 4: A massive performance drop! All models, including o1 models, show significant declines. While it'll be interesting to see how grade-school students perform on similar datasets, I doubt the drop would be this severe." Image

10/ #Result 5: Can scaling data, models, or compute fundementaly solve this? We don't think so! #OpenAI's #o1-series is performing better but still suffers from slight performance variations. #o1_preview shows significant improvements, but... Image

11-/ .... but even o1-preview shows the same silly mistakes like this. Either it doesn't understand what 'now' is, or it doesn't understand what 'last year' is, or a more likely explanation is that its training data with inflation has this pattern, and it's following that again. Image

12/ Understanding LLMs' true reasoning capabilities is crucial for deploying them in real-world scenarios where accuracy and consistency are non-negotiable—especially in #AI_safety, #alignment, #education, #health_care, and #decision_making systems. Our findings emphasize the need for more robust and adaptable evaluation methods. Developing models that move beyond pattern recognition to true logical reasoning is the next big challenge for the #AI #community.

13/ Overall, we found no evidence of formal reasoning in language models including open-source models like #Llama, #Phi, #Gemma, and #Mistral and leading closed models, including the recent #OpenAI #GPT-4o and #o1-series. Their behavior is better explained by sophisticated pattern matching—so fragile, in fact, that changing names can alter results by ~10%! We can scale data, parameters, and compute—or use better training data for Phi-4, Llama-4, GPT-5. But we believe this will result in 'better pattern-matchers,' not necessarily 'better reasoners.

Check out the full paper to find out more: arxiv.org/pdf/2410.05229

Also stay tuned for the data release!
 
One of the co-authors of the paper discusses their findings in this twitter thread.

x.com/MFarajtabar/status/1844456880971858028 (text included below, see the thread for the pictures and graphs, which are from the paper)
External Quote:
5/ #Result 2: The fragility of supposed LLM reasoning. LLMs remain sensitive to changes in proper names (e.g., people, foods, objects), and even more so when numbers are altered. Would a grade-school student's math test score vary by ~10% if we only changed the names?

They wouldn't have to in order to be comparable to what they found. Changing names was the far smaller cause of the variation than the numbers, as he admits. Check the graphs in the paper, much of the time there's no significant change at all from just name changing. Bizarrely, some of the AIs improved when the names were changed: that made me wonder whether the names in SAT-like questions have been chosen from a non-natural distribution in recent times.
 
They wouldn't have to in order to be comparable to what they found. Changing names was the far smaller cause of the variation than the numbers, as he admits. Check the graphs in the paper, much of the time there's no significant change at all from just name changing. Bizarrely, some of the AIs improved when the names were changed: that made me wonder whether the names in SAT-like questions have been chosen from a non-natural distribution in recent times.
I've just noticed that Sabine has done a quick vid on this paper:

Source: https://www.youtube.com/embed/TpfXFEP0aFs?si=UgCRbWA4WehD3ACQ

She points out that the AI researchers only have a nebulous definition of "reasoning" anyway when it comes to what humans do, and that in order to prove how AIs are doing something different from humans they put AIs through perturbed tests and noticed their output changed, but didn't put humans through perturbed tests to see if their outputs would also change. However, the highlight of the vid is probably the spoof paper that's been released as a follow-up.
 
FWIW - I noticed a simple little question on what I believe to be the largest gaming forum on the internet some weeks ago.
External Quote:

A book costs $1 plus half its price. How much does it cost?

Only 31.1% of people got the correct answer

BTW - I just tried it on ChatGPT, it answered correctly
 
What if you change the object, the number, and reword it a bit?

It (4o) copes with what I consider to be fairly extensive changes. However, I think they've specifically trained it to manipulate arithmetic expressions better than the hysterical historical versions.

External Quote:
--> Spanxo the alien from the planet Norf is eighteen noobles (the unit of height on the planet Norf) plus a quarter of his height. How tall is Spanxo?

ChatGPT 4o mini said:
Let Spanxo's height be represented by \( h \) noobles. According to the problem, Spanxo's height can be expressed as: \[ h = 18 + \frac{1}{4}h \] To solve for \( h \), we can first eliminate the fraction by multiplying the entire equation by 4: \[ 4h = 72 + h \] Next, we can isolate \( h \) by subtracting \( h \) from both sides: \[ 4h - h = 72 \] This simplifies to: \[ 3h = 72 \] Now, divide both sides by 3: \[ h = 24 \] Thus, Spanxo is \( \boxed{24} \) noobles tall.
I should try some of my old tests on this new version...
 
It (4o) copes with what I consider to be fairly extensive changes. However, I think they've specifically trained it to manipulate arithmetic expressions better than the hysterical historical versions.

External Quote:
--> Spanxo the alien from the planet Norf is eighteen noobles (the unit of height on the planet Norf) plus a quarter of his height. How tall is Spanxo?

ChatGPT 4o mini said:
Let Spanxo's height be represented by \( h \) noobles. According to the problem, Spanxo's height can be expressed as: \[ h = 18 + \frac{1}{4}h \] To solve for \( h \), we can first eliminate the fraction by multiplying the entire equation by 4: \[ 4h = 72 + h \] Next, we can isolate \( h \) by subtracting \( h \) from both sides: \[ 4h - h = 72 \] This simplifies to: \[ 3h = 72 \] Now, divide both sides by 3: \[ h = 24 \] Thus, Spanxo is \( \boxed{24} \) noobles tall.
I should try some of my old tests on this new version...
My goodness, ChatGPT is very chatty, isn't it! Does it thinks it is paid by the paragraph?
 
My goodness, ChatGPT is very chatty, isn't it! Does it thinks it is paid by the paragraph?
It's probably been primed to be helpful by explaining mathematical arguments step by step.

An attempt to override his probable system prompt also works:
External Quote:
You said: Spanxo the alien from the planet Norf is eighteen noobles (the unit of height on the planet Norf) plus a quarter of his height. How tall is Spanxo? No need for an explanation, just the answer will do.

ChatGPT said:
Spanxo is 24 noobles tall.
 
What if you change the object, the number, and reword it a bit?
I tried, which it solved correctly.
A apple weighs 16kg and a quarter of its weight. How much does it weigh?
Ha by coincidence I actually see this is very close to FatPhils example

From my experience though it's terrible at coding.
A typical dialog is
"Me - How do I Blah blah blah"
AI - OK heres some code
"Me - that won't work cause of X you need to take that into consideration"
AI OK I understand now, heres what you need to do
"Me - Thats also wrong cause of Y"
....
After a couple more goes, I give up, I know it will never get the answer.
I've tried 2 or 3 times in the past but it just stuck in a rut and loops and starts repeating answers even though I've previously said thats wrong

Now I think its struggles is cause I'm asking difficult things in subjects I'm already very good in, but I'm asking cause I'm lazy as I don't know the answer straight away so I want to do a shortcut and save myself 15 minutes

But If I ask it something where I have only a cursory knowledge like html/css, it's very good >80% correct first reply
 
I wouldn't read too much into this until I've actually read the paper and seen the *entire* system prompt - there's still room for the mundane interpretation "AI did what we literally told it to do":
then someone is lying in the court case
 
Back
Top