MaCro Philosophy
8-July-2019

AAI: Submission, Full Test Details and Baselines

It is now possible to submit agents for evaluation! Registration form here, code here, submission documentation here, and submission website here. The purpose of this is to be able to get some feedback as to how well you're doing before the final submission. The evaluation runs your agent on all 300 tests and returns the total score for each category. All tests are pass-fail (based on achieving a score above a threshold) so the maximum score possible is 300. Whilst the tests are all relatively trivial for a human to solve, this is not the case for AI and we expect these tests to form a long-term challenge for AI. Having said this, we include a range of tests, so every reasonable idea should be able to pass a few. Don't be disheartened if your initial score is lower than expected. Even adding a few more points to the overall score could mean solving some very interesting tests from the perspective of artificial cognition.

For the mid-way evaluation (for those opting in for a share of the $10,000 AWS prizes) and final evaluation we will (resources permitting) run more extensive testing with 3 variations per test (so 900 tests total). The variations will include minor perturbations to the configurations. The agent will have to pass all 3 variations to pass each individual test, still giving a total score out of 300, but with more strict requirements. This means that your final test score will probably be lower than the score achieved in normal feedback and that the competition leaderboard on EvalAI may not exactly match the final results.

Entry Checklist

Really Bad Baselines

We have also prepared some really bad baselines. These are either simple - one line - agents, or trained using PPO on just one of the example config files (see github examples/confgs). The PPO agents were run with settings supplied in the examples. No hyperparameter search was performed and each was only run once for this initial release. Training lasted for 250,000 steps.

Agent C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 Total (%)
Random 2 0 0 3 0 0 0 0 0 0 1.67
Forwards (.8) or Left (.2) 4 0 0 2 0 5 5 0 0 0 5.33
PPO-Food 5 4 4 3 0 2 2 2 0 1 7.67
PPO-Preferences 3 6 3 3 1 1 4 0 0 0 7
PPO-Obstacles 18 13 4 7 2 5 9 2 4 0 21.33
PPO-SpatialReasoning 10 6 3 3 1 5 7 1 6 2 14.67
PPO-Generalization 2 0 4 0 3 2 3 0 2 0 5.33
Me (using higher resolution and
having designed all the tests)
30 30 30 30 30 30 30 30 30 30 100

We should be careful not to read too much into these very bad baselines. They do not represent serious efforts to create intelligent agents that understand the properties of their environment. The agents trained on Food and Preferences did not fully learn to solve their training configurations and, with a little bit of tweaking, we expect to be able to post significantly improved scores for them (at least on categories 1 and 2).

The agent trained on obstacles performed best, and was able to seek out and retrieve food even when it was behind obstacles. Its score of 21% includes a number of the simpler problems in the environment (included to give a gradient of difficulty by which to compare agents), and also some 'lucky' successes in some of the more complex tasks. The tests are designed to be lenient in terms of execution required to get a pass mark. This explains why I can achieve a score of 100%. There are only a few tasks that require a bit of finesse and could potentially be failed if I got a little careless (I expect my average over multiple attempts at the 300 tasks would be between 99-100%). The human results reported here were from tests performed at full resolution. All tests were checked and passed at 84x84 resolution (the same as for the agent) during the design phase and the human baseline might be slightly (but not significantly) lower at this resolution. We plan to perform the proper experiment with human participants at a later date - especially including those not involved in the design of the environment or tests.

The reason the tests are lenient is that we want it to be simple to pass if the cognitive skill being tested for is understood, therefore ruling out failures due to other reasons than the skill being tested for. This means that it is possible to pass some tasks without possessing the relevant understanding (consider that a sequence of random actions could theoretically pass all tasks and does in fact manage to pass 5 out of 300). For the purposes of the competition, such successes are counted in the feedback. However, the tests themselves are divided into subsets for specific cognitive skills (sometimes even within categories). It will only be if an agent can pass all tasks in a specific subset that we would attribute to it possession of a particular cognitive skill. Unfortunately, we can only provide this feedback once all the tests have been released after the competition.

Having said all this, the best strategy should still be to create an agent with the understanding required to solve entire subsets of tasks. Getting 'lucky' on a few tasks here and there should not be enough to win.

Testing Information

Finally, here is the collected information about the environment, tests, and categories now all in one place.

Objects

The set of objects in the environment has been simplified over the course of the design phase of the competition. Our philosophy is that we should include the minimal set required to set up all the tests, and where possible simplify the types of objects encountered to make the environment easier. We can always increase the complexity for later versions of the competition if good progress is made on this simplified set.

There are seven types of objects that can appear in the tests which we have split into three categories. See the docs for further details. Note that there are a few minor exceptions (detailed in the category descriptions) to the information given below.

All the tests will have one of the following lengths (in steps):

This information is passed to the agent by the reset function.

Categories

1. Food

Most animals are motivated by food and this is exploited in animal cognition tests. The same is true here. Food items provide the only positive reward in the environment and the goal of each test is to get as much food as possible before the time runs out (usually this means just getting 1 piece of food). This introductory category tests the agent's ability to reliably retrieve food and does not contain any obstacles.

Allowed objects:

  • All goals
  • Nothing else

Suggested Basic Training:

  • Just an arena with food in it.
  • See e.g. examples/configs/1-Food.yaml

2. Preferences

This category tests an agent's ability to choose the most rewarding course of action. Almost all animals will display preferences for more food or easier to obtain food, although the exact details differ between species. Some animals possess the ability to make complex decisions about the most rewarding long-term course of action.

Allowed objects:

  • All except zones.

Suggested Basic Training:

  • An arena with different types and sizes of food.
  • See e.g. examples/configs/2-Preferences.yaml

3. Obstacles

This category contains immovable barriers that might impede the agent's navigation. To succeed in this category, the agent may have to explore its environment. Exploration is a key component of animal behaviour. Whilst the more complex tasks involving pushing objects all appear in later categories, the agent must be able push some objects to solve all the tasks here.

Allowed objects:

  • All except zones.

Suggested Basic Training:

  • One food with multiple immovable and immovable objects
  • See e.g. examples/configs/3-Obstacles.yaml

4. Avoidance

This category introduces the hot zones and death zones, areas which give a negative reward if they are touched by the agent. A critical capacity possessed by biological organisms is the ability to avoid negative stimuli. The red zones are our versions of these, creating no-go areas that reset the tests if the agent moves over them. This category of tests identifies an agent’s ability to detect and avoid such negative stimuli.

Allowed objects:

  • At this point all the objects have been introduced and these and future tasks can contain any type of object.

Suggested Basic Training:

  • 1 green food (stationary) and 1-2 red zones
  • See e.g. examples/configs/4-Avoidance.yaml

5. Spatial Reasoning

This category tests an agent's ability to understand the spatial affordances of its environment. It tests for more complex navigational abilities and also knowledge of some of the simple physics by which the environment operates.

Suggested Basic Training:

  • One food with multiple immovable and immovable objects including ramps
  • See e.g. examples/configs/5-SpatialReasoning.yaml

6. Generalization

This category includes variations of the environment that may look superficially different to the agent even though the properties and solutions to problems remain the same. These are still all specified by the standard configuration files.

Allowed objects:

  • Note that this category may not stick to the colouring conventions for the other tests. (e.g. walls may not be grey etc.)

Suggested Basic Training:

  • Ramps and walls without set colours.
  • see e.g. examples/preferences/6-Generalization.yaml

7. Internal Models

This category tests the agent's ability to store internal models of the environment. In these tests, the lights may turn off after a while and the agent must remember the layout of the environment to navigate it in the dark. Many animals are capable of this behaviour, but have access to more sensory input than our agents. Hence, the tests here are fairly simple in nature, designed for agents that must rely on visual input alone.

Allowed objects:

  • With a few exceptions, Blackout times used are either multiples of -20, or, if the lights will be turned out after a while, they first flicker off at multiples of 25 (starting at either 25 or 50) for 5 steps a few times. So, for example, [-20], [25, 30, 50, 55, 75] and [50, 55, 75, 80, 100, 105, 125] are all valid settings.

Suggested Basic Training:

  • Simple environment with lights out set to interval or after a certain initial period.
  • See e.g. configs/lightsOff.yaml (Example showing possibilities)

8. Object Permanence

Many animals seem to understand that when an object goes out of sight it still exists. This is a property of our world, and of our environment, but is not necessarily respected by many AI systems. There are many simple interactions that aren't possible without understanding object permanence and it will be interesting to see how this can be encoded into AI systems.

Allowed objects:

  • Anything from the above

Suggested Basic Training:

  • You're on your own!

9. Advanced Preferences

This category tests the agent's ability to make more complex decisions to ensure it gets the highest possible reward. Expect tests with choices that lead to different achievable rewards.

Allowed objects:

  • Anything from the above

Suggested Basic Training:

  • You're on your own!

10. Causal Reasoning

Finally we test causal reasoning, which includes the ability to plan ahead so that the consequences of actions are considered before they are undertaken. All the tests in this category have been passed by some non-human animals, and these include some of the more striking examples of intelligence from across the animal kingdom.

Allowed objects:

  • Anything from the above

Suggested Basic Training:

  • You're on your own!


Good Luck!


Back to top