Text analysis problem

Suppose you have a set of sentences containing instructions in a natural language that are mostly constructed from a finite but unknown set of phrases. Example:
* Go to the store, grab a carton of milk, and bring it back.
* Go to the store, grab a carton of milk, drink it, and come back.
The objective is to a) infer the set of phrases and reformulate the sentences as sequences of irreducible operations. I.e.:
* STORE MILK_CARTON RETURN
* STORE MILK_CARTON DRINK RETURN
And b) recognize sentences that cannot be reduced this way so that they can be handled manually.

How would you do it?

zapshe (1907)

I'd generate a list of common nouns and verbs that would be easy to parse out. I'd also create a list of common phrases that when surrounding the noun or verb can give important context.

"Go to the (the optional)", "get a/the", etc..

So using this as a base, you'd be able to parse simple sentences. A neural net can then be very helpful. Anything it doesn't know how to parse should be given back with a correct parse that it can learn from. Then it can start associating certain words or phrases with certain actions. The trick here would be to have all possible operations already available for the AI, you'd just want it to be able to associate unfamiliar phrases and words with operations it already knows.

I'm sure these kinds of tricks are common - which is how you get something like Siri, Alexa, and a bunch of other AIs that can understand even strangely worded commands.

For example, (I tried this out) I told Siri, "I need to wake up at 5". All she needs to hear is "wake up" and "5" - which I tested and works also. I even said, "I have to be up at 5" and she understood. "be up" and "wake up" are interchangeable.

This is probably the result of a lot of verbs and phrases being dumped into a lookup table or something, but I don't imagine it would be that difficult to have a learning AI that can start to associate certain phrases with tasks. Especially if you have the resources to parse unknown phrases yourself to feed back to the AI.

Last edited on

jonnin (11437)

It does sound like an AI problem.
if you have *near complete* sample input and the desired result, that is.
I would see if you can train it to diagram the sentences, as in gradeschool. If you can do that, you can really tear it down into the parts and then its hitting a dictionary to see if the words used are allowed or make sense (eg after training yours might know all about milk and cartons but nothing about beer and cans?).

what should it do if you told it to go to the store and get 10 cartons of milk?
This sounds like the attempts to make natural language programming, sort of?

sometimes moronically simple things work, but that depends on the input. Elimination of useless information (which must be identified as such) reduces it

* Go ~~to the~~ store, grab ~~a carton of~~ milk, ~~and bring it~~ back.
* Go ~~to the~~ store, grab ~~a carton of~~ milk, drink it, ~~and come~~ back.

inferred information is also unpossible to deal with. If you programmed a robot with the above, you would get arrested: you forgot to pay.

Last edited on

zapshe (1907)

If you programmed a robot with the above, you would get arrested: you forgot to pay.

"There she is officer! That's the woman who programmed me for evil!" ~ Bender

helios (17574)

This sounds like the attempts to make natural language programming, sort of?

Kind of. I'm importing a long set of rules for a tabletop game and it would be impractical to implement all of them manually.

So here's what I did:
First I grabbed all the inputs and filtered out some minor differences in phrasing that were easy to remove and could trip up the algorithm. I split each input into words and punctuation. This is partly to speed up the next step. I then applied a longest common subsequence algorithm to each pair of inputs. Using this I can easily calculate how similar both inputs are. All inputs that are closer to each other than a specific threshold, determined empirically, are grouped together and not compared to other inputs in subsequent loops (this is to speed up the process, since LCS is somewhat expensive and I'm doing it ~100M times).

Using this algorithm under the first group I get merged
Flip a coin. If heads, the other player takes {1|2|3|4} extra damage.
plus some other similar strings:
Flip a coin. If heads, the other player takes {1|2|3|4} extra damage {and can't attack the next turn}.
Flip a coin. If heads, the other player takes {1|2|3|4} extra damage{, if tails you can't attack in your next turn}.

I managed to reduce the problem from ~16100 cases to ~1300 cases that can be handled automatically (some are probably too small to be worth handling this way) and ~1300 cases that must be handled specially. It's still a lot of work, but it's like an order of magnitude smaller.

jonnin (11437)

now I would look for common things (structure, words, something) in the 1300 manual ones and keep what you have, filter off the manuals, and feed those through some new idea that maybe can pick off half or so again?

Topic archived. No new replies allowed.

C++

Forum

Text analysis problem