Testing GPT-4.0 accuracy rate on Japanese language

Regarding these sections

I don’t think I’ve ever seen 原型 being used in these contexts. The rest of the response is okay and makes sense to me.

2 Likes

Agree, the only obstacle to creative solutions for an IA is the user, about verifiability I like to think that hallucination rate is on a favorable trend and will soon be low enough to be substantially lower than humans.
Still as for today’s situation I think that the problem most always than not is utilizing prompts that are either incomplete or inaccurate or both

1 Like

lemme ask it for example sentences with sources :joy: happy to know we’re talking about a subtlety

2 Likes

Hmmm no, it’s more about how complex the model is and what was it trained on. For instance, if you ask ChatGPT for a list of 10 romcom light novel recommendations, the prompt is both complete and accurate (though here accurate wouldn’t be a metric, because “romcom” + “light novel” + “recommendations” + “10” is an open question).

However, ChatGPT will still be wrong and will provide a mixed list of light novels and manga, because the data set it was trained on contained manga, light novels, probably anime, etc. So it fails to see the discrete difference.

You can correct it as much as you like, but it will still keep on returning a mixed list of 10 manga + light novels + anime, instead of light novels alone.

4 Likes

ok, check this and tell me what you think. I didn’t check all the words but it seems like there are 1+ million results with a google search.

Edit: stupid link not working again, here’s a screen

screen

1 Like

I swear I have an open chat where I asked it this exact question (yes, I’m into it :joy:)

okay I agree but to me it wouldn’t be enough to obtain the ideal answer, I chose the wrong words

Anyway you chose a pretty complex example to make, the infamous “recommendations”

2 Likes

It would be funny if ChatGPT took the Stack Exchange approach to dealing with requests for recommendations

2 Likes

I’m unaware of stackexchange’s approach, tell me!

1 Like

It’s a good use of ChatGPT, so why not, right? :wink:

Yes, that can trip up ChatGPT as well if it assumes the information you give it is “correct”.

I think it’s actually a fairly box-standard user question. “ChatGPT, give me 10 of X in the style of Y”.

If ChatGPT can’t do that, then whoops :smiley:

One other thing to watch out for is bias in data. If you’re aware there might be a common answer to your question which is not necessarily what you’re after, you might get most of that and then it’s hard to get the answer you’re looking for.

For instance, if you’re looking for a certain C, which is related to B, but ChatGPT is more familiar with A + B which are often linked, you’ll get B with a flavor of A, instead of your C.

2 Likes

7 Likes

Yeah, the problem is deciphering what the “style of Y” looks like :grin: it could pick a random “example style” for its evaluations but in the end it would be reduced to “here’s a list of the 10 highest ranking results” and I suppose that can vary hugely depending on source…?

exactly my point!

in case u missed the last link wasn’t working and I shared a screenshot of GPT’s defense line to that critique

1 Like

It’s more that it will just make up plausible stuff, as usual. I asked Bard to write me a simple template implementation of something for a well-known open source project. It got a lot of the project specific boilerplate right (which was surprising and pretty neat). But it happens that “read” and “write” are a bit complicated, so instead of implementing against the real APIs for reads and writes, it just hallucinated totally nonexistent simple ‘read’ and ‘write’ methods to call. I’ve also heard that if a library you want to use is missing useful functionality X, ChatGPT will tend to hallucinate that it has that function with the API it “ought to have”…

3 Likes

Yeah, I saw it. I think it looks fine. It’s a simple dictionary search so I’m not surprised it got it right. On the other hand, that’s good. At least this we can rely on :smiley:

Yes, one of the things I for instance very often struggle with in Stable Diffusion is getting a specific color of clothing on my characters. Usually the model will pick the character’s hair color as the lead for the image, even if I increase prompt strength (not possible in ChatGPT for obvious reasons). The only out from such a bias is counter-bias where I would retrain the model on different images or train an extra lightweight “filter” with the prompts for the colors I want.

1 Like

This is not a bad thing given that the user is hardly fooled no? At a certain grade of knowledge of a field I guess three things can happen. You recognize the blunder, you don’t recognize it but it magically works with everything else you know, or you don’t recognize till your house falls. That’s the same of how humans works, no? What did we do with bridges, cars, skyscrapers? We’re constantly completely hallucinated but things worked out in some way, and it’s improving. Only thing, language models are doing it faster.

For the rest, can’t comment, I’m not using AI for functions on that level of intricacy, but I suppose its true and a key point of why LLMs shouldn’t be used for anything at stake, at this point in time.

1 Like

Kind of yes and no. If the information is common/accessible enough, ChatGPT shouldn’t be making mistakes like that.

Or rather, people are using a generative AI model who’s really good at sweet-talking the user for concrete requests it was not meant to handle.

1 Like

ok, trial number 1# has an evaluation!

this is so interesting. Gonna come back to this in the future

1 Like

here. It’s here where I consider humans approach to machine bein the problem. Not always, but - in my limited comprehension of things and experience with it - probably at least 60-70% of cases.

I imagine this has a role in it all the time, and I also imagine that most of the attempts at reducing hallucinations are converging onto this specifically…?

1 Like

Yup, engineers are trying to tackle the problem of an all-purpose assistant-like model from the communication side, because it’s usually more appealing if the AI is at least good at talking to the user :sweat_smile: .

On the other hand, if making it factually correct had been a higher priority, then understanding the responses of the AI might’ve been an issue as well.

1 Like

I had the impression that the real problem is not creating a model that is highly accurate as much as it is to create a highly accurate model that can satisfy mass request and so a huge degree of variability in user’s requests.

We fall again into the same point :grin:

1 Like

That’s usually the starting point. The original Stable Diffusion model was also trained to be as generic as possible and then later biased via retraining to draw anime characters, for instance.

1 Like