Testing GPT-4.0 accuracy rate on Japanese language

Basically every machine-human interface that interacts with humans in a human-human style

Yes, on here I totally agree, we want it to be progressively better at - accuracy and - user’s comprehension

May be nothing, may even be stupid, but I always try not to use the “could you” or “can you?” Which comes natural to us since we’re so used to kindly interact with humans. Here I just use a tougher and more severe “do X”, or “provide X” and else

1 Like

Yeah, no problem.

The stuff about てみろ being a conditional it got right. It’s used for something bringing about a certain hypothetical result, typically. For example, I used it for if it rains, then it will hypothetically bring the result of a game getting canceled iirc.

In the thing @WeebPotato linked, they use it like
彼に見つかってみろ。大変な目に会う

Which is saying that if you are found by him, it will bring the result of 大変な目に会う

And 彼と飲んでみろ。朝まで帰れない

Which is saying that if you drink with him, it will bring the result of 朝まで帰れない.

I think GPT did a poor job of choosing an example sentence for it, though. It could be used as a conditional with 食べる but you really need to make it a full sentence to showcase that. The whole point is that there is some hypothetical outcome, so if there isn’t one written it’s kinda wrong to say it can be interpreted that way since in that case you would interpret it as an order instead.

EDIT: Since this is something that looks like is harder to find online and I just realized the grammar blog I linked in my test is no longer up, I’ll show a few more sentences from my books. I just searched for 言ってみろ

context: MC got a step sister and was expecting her to be elementary, but she was actually a highschooler who happens to be at his same school in the same grade…and beautiful of course. MC is thinking to himself that if he let people know of that situation, it would lead to the latter sentence.

妹は小学生じゃなくて高校生だったんだ。それも同じ学校の、同学年。どこのクラスか知らないけど、美人の女子。──なんて言ってみろ、無駄に好奇心をくすぐって、あらぬ嫌疑をかけられかねない。

context: guy saying to his crush who is beautiful and perfect of course. He’s not ordering her to tell guys to touch her lol.

男に触ってなんて言ってみろ、襲われるぞ

context: dunno but dont need it

それ他の男子に言ってみろ、視線で殺されるぞ」

食べるだけならまだいい、残った皮を階段にポイ捨てしてみろ、僕はお前を絶対に許さない!

context: pansy is the name of a girl. person is thinking to themself

パンジーを俺の部屋になんざ入れてみろ。どうせロクでもないことにしかならない。
e

context: uhh, two dame ningen potentially living together being hypothesized about? I think MC was trying to rationalize why he and his girlfriend shouldnt live together iirc

一緒に住むなんてことになってみろ。毎日コンビニ飯ばっかりで部屋は通常の2倍の速さでどんどん汚くなって、あっという間に住んでるだけで病気になりそうな瘴気漂う汚部屋のできあがりだ!

5 Likes

Heck, I just had an intuition. In italian we may have something similar, it’s a bit like saying

Try eating it (do it). You will get stomachache.

I wouldn’t even define this a grammar point in my language tho because it’s simply a particular way very seldomly used to express condition. (Se perhaps it’s the exact same thing in Japanese?)

Also if anyone has will and time to check this

Yeaahh, I mean I don’t know italian so its hard to say how similar they are, but I just went ahead and added some example sentences you can read through. If you go through those, especially like the last one, and can feel like you can wrap your head around it, id say you’ve got it. If you have any questions about them just lemme know.

1 Like

Can I be MC?
Jk, anyway I’d say that if I got it right it’s pretty much what I was talking about in italian, so this is perfectly clear now, thanks!
How accurate was GPT in its answer on てみろ overall in your opinion? I’m thinking about typing it out every time in a scale of 1-5 or 1-10 maybe, because at the end of the experiment if you provide it this url and ask it to interpret it and express it’s own accuracy rate it’ll be able to do so haha

Side question for @WeebPotato , I was wondering, if LLMs are strictly limited by the data sets they were trained on, how do they interact so effectively with external content through the bing function and plug-ins in general?
My guess is that the data set on which an LLM was trained on doesn’t influence the information over which the machine is competent - it just improves the machine’s capacity to function logically and accurately. Does this make any sense to you?

Yes, it would be in general nice if the text encoding layer was better at actually comprehending user’s text input as opposed to tokenizing it which is currently a limiting factor. We’re still way further than we used to be only a couple of years back so there’s that :smiley: .

Yes, that can also have an impact :joy: . Interacting with a LLM is way harder than with the textual layer of a Stable Diffusion model so literally any keywords might lead to different results.

The plugins are very likely an add-on. What I can imagine is that the LLM generates a query to the Bing API, gets the results back and then interprets them for the user. So it’s not strictly a part of the model itself which is limited to text input/output.

No, it actually limits both. A LLM can’t go beyond the data it was trained on. What’s deceptive is that LLMs and other generative AI models use a random number/seed when generating the output so the answer seems “creative”. It’s just a matter of how it combines the symbols and their weights (not the same as probabilities, I think) to give you an answer to the query.

The model is trained on a varied enough data set to be general and to be able to provide some answer, but there are 2 outcomes regarding accuracy

  • For sub-group requests like the light novel example from earlier, the answer will be biased by the mentions of light novels among other similar things like manga and anime

  • For fringe things like てみろ and 誰何, it will be in general limited by the availability of information on these specific items VS a ton of info about kanji, Japanese grammar, etc.

What I personally don’t know and this might be the issue with not understanding the intrinsics of LLMs is that it’s difficult to predict how a LLM interprets text to decide what information is more reliable and therefore more accurate. Is it the general representation of the information or does it also use specific keywords as qualifiers. Stuff like that.

2 Likes

On whether or not it can be used it got it right, which was your question after all.

On the explanation I would probably give it like a 2/10, however. It got almost all the explanation details wrong.

It is not for urgency. It’s not for immediacy. I also never heard it in spoken Japanese and my girlfriend said the same yet GPT said it’s common for spoken. The example sentence it gave was not inaccurate necessarily, but it was insufficient imo. And, well, it’s true that context can change what you interpret it as. So yeah 2/10 for that part

3 Likes

I don’t know about all that, but I studied CS with a small background in programming, and it wasn’t hard. Mostly because I was relatively good at math and programming, but you really can learn anything you need as you go. I would suggest to give it a try, as you don’t lose anything from trying.

CS in general is a good path, very lucrative and with a lot of future.

2 Likes

They say GPT5 will be an AGI haha the speed at which this technology is improving truly scares me because there is no time for regulation, but as an user I’m looking forward to it :grin:

happy to know it’s not (just) paranoia :smile:

I studied the subject today (as an amateur without technical knowledge of course) and I’m happy to see that the conclusion I came to is the exact thing you’re saying here. I learnt that when it’s about real time datas or more generally datas that weren’t included in the LLM’s training, the LLM interacts with other layers that are responsible of generating queries, fetching data and providing it to the LLM that at that point can process it and generate a response that is understandable to the user. Correct me if this is wrong

As an ignorant, to me it made sense. What do you think?

I believe as well that if someone has some self-teaching and research skills almost everything can be learnt so this shouldn’t be a problem.

Now its just matter of figuring out if that roadmap makes sense

Thanks a lot for your feedback, hugely appreciated!

Yes, what ChatGPT wrote is correct. But it’s also sad, because it means ChatGPT isn’t exactly useful beyond fooling around with it. I was hoping there is some supervision involved in the training process, but apparently not and all word order patterns are derived from training data.

Haha no, this is correct. I liked your summary :grin:.

1 Like

I’m not sure to understand what you mean here. Can you explain? I don’t want to be sad as well but I want to understand

Check the last 2 paragraphs in your screenshot.

1 Like

Yeah, I know, I’m just not sure what changes technically from what you believed and what are the practical outcomes. One key goal for developers will remain accuracy and those patterns are going to be influenced by specific training, and in the end it’s possible to manipolate prompts to increase chances that accuracy will be higher, no? Maybe I misunderstood your point or the technicalities

Anyway many people on reddit complain that the post update gpt for is significantly worse than the previous one and it may probably be due to the addition of training based on user interactions, how long have you been using it, and did you notice anything?

Also, in the last couple days it seemed quite less accurate than previously, for some reason.

1 Like

Accuracy is something difficult to measure in this case. Do you mean how accurate the information is? In order to properly measure that, you would have to have automated test cases for that, possibly using an adversarial neural network paired with the LLM to make sure it’s not spitting out misinformation.

Regarding manipulating prompts - theoretically yes, but the LLM is still limited by the diversity and quality of its training data. If there is a topic X for which you have 50<% of opinions from uninformed users and 50>% of actually useful data from textbooks, experts, etc. you’re kind of likely to run into bollocks :sweat_smile: . And as mentioned in the screenshot, the LLM doesn’t actually distinguish between the data - it takes it all and tries to infer the textual relationship between words. So it doesn’t actually understand the data, it merely navigates through it, learns the language patterns and then uses that knowledge to generate text responses.

How this is different from the way we acquire information is that we learn what an “expert opinion” or “evidence” is and can use these non-quantitative qualifiers to judge whether a piece of information is true or not. ChatGPT doesn’t do that.

I haven’t used it much lately and even before when I noticed it’s not really good (meaning, reliable) at getting meaningful information out, I just used it for conversation practice. That it’s really good at :smiley: .

3 Likes

I don’t know if that is the case but I’d like to find out. I don’t know how they currently check the model’s accuracy…

Oh now I see it clearly… yeah it’s a bit depressing at first but I believe there are various solutions or workarounds on which developers are working… even though I now realize that reducing hallucination and increasing accuracy are not exactly the same thing.
At this point just considerably reducing hallucination would be a big win, they did it with the 4th and they’re doing it again. But it wouldn’t be enough to ensure it’s possible to use it for some tasks and in some contexts.
For what concerns the process of establishing if something is right or wrong (aka accuracy), I imagine that different layers would be utilized for this purpose, maybe? Like, in the end LLMs are “just” a concerted effort into making a machine communicate (and nothing else) like a human.
The more I learn about it, the more I tend to imagine it as the brain area dedicated to language acquisition and production - and not the whole brain, I don’t know if you think this picture is accurate.

I have the impression that now this machines possesses enormous amounts of data but the biggest obstacle is providing it the capacity to judge the variable importance (weight) of those data in ambiguous contexts such as the case where a too big amount of information on a subject is inaccurate or there simply are misleading or not enough good pointers that a minority of information is right.

I think you’re missing out… the more I use it and study about it and the more its limitations disclose, but I still confide into its potential as much as before.

1 Like

Not sure if that can be done inside a single neural network. Typically the neural net works as a single entity on a single data set. There are means of excluding poorly scoring items, reducing noise, etc. but all of it relies on math that touches all of the elements of the data set. To validate how well the network is doing, you need something external.

What exactly is “hallucination”? I’ve never heard of it in the context of machine learning or AI.

Something like this, I guess. From a fairly broad perspective, LLMs do a simple job of producing intelligible output in response to the user. They don’t do much more than that.

I don’t think weights are equivalent to variable importance. Model weights are more about the representation of patterns in data. Validation is a retroactive step.

That being said, I now remember there are methods of excluding outliers during training… However, outliers are NOT the same as someone judging if the information provided by the model is subjectively correct/incorrect. That’s something different.

1 Like

weird, I learnt it in this exact context (I think I saw it used on the OpenAI website as well). Basically it refers to the machine blundering without apparent reason. I also assume this is either caused by bugs or inefficiencies of the model, correct me if my assumptions are wrong.
Edit: by saying this I consider that all bugs are inefficiencies but not all inefficiencies are bugs.

Ops, I misused the term variable importance, that is kind of what I meant (just more ignorantly haha).

Interesting, what’s that?
(feel free not to answer :joy:)

1 Like

Hallucination is the usual term to describe the phenomenon where an LLM simply “makes something up” when asked for information about the real world, like producing nonexistent court case names when asked to produce a legal brief, or nonexistent research paper titles when asked to produce an essay with references.

It is not, in my view, a mere bug in the model, but a reflection of the underlying fact that an LLM has no understanding and no true connection to reality and is just outputting plausible looking text.

8 Likes

Hmm hard to say what kind of bugs there might be. Perhaps in the libs like Torch and their implementation? Something off with the numerical precision, for instance.

But in general, mathematical inefficiencies are also a thing. For instance, some time ago I downloaded someone’s Stable Diffusion-based model for testing and was getting NaN values when generating output. It would happen only for some random seeds, but often enough to make the model difficult to use.

So it’s possible for the math in the neural net to break on edge cases, however usually things like that are weeded out quite quickly and the individual math functions used in the network’s neurons should consider edge cases.

Thanks for the explanation! Yes, this seems like a generative AI problem.

3 Likes