My notes of UC Berkeley online course Intro to Artificial Intelligence
- The path to the goal is the important thing
- Paths have various costs, depths
- Heuristics give problem-specific guidance
- The goal itself is important, not the path
- All paths at the same depth (for some formulations)
- CSPs are specialized for identification problems
- All search algorithms are the same except for fringe strategies.
- Informed Search: Introduced heuristic to estimate of distance to nearest goal for each state and therefore speed up search (Solve performance problem of UCS).
- A state space.
- A successor function (with actions, costs).
- A start state and a goal test.
- Traveling in Romania (Map with weighted edge)
- Pac-man game planning
- Pancake Problem
- Complete: Guaranteed to find a solution if one exists?
- Optimal: Guaranteed to find the least cost path?
- Time and space complexity.
- Complete if we prevent cycles
- Not optimal
- Time O(b^m)
- Space O(bm)
- Optimal only if costs are all 1
- Time O(b^s)
- Space O(b^s)
- Complete and optimal
- Time O(b^(C*/E))
- Space O(b^(C*/E))
- Problem: Explores options in every “direction” because no information about goal location
- Expand the node that seems closest
- Can be wrong because of this (not optimal)
Uniform-cost orders by path cost, or backward cost g(n). Greedy orders by goal proximity, or forward cost h(n). A* Search orders by the sum: f(n) = g(n) + h(n).
- Inadmissible (pessimistic) heuristics break optimality by trapping good plans on the fringe
- Admissible (optimistic) heuristics slow down bad plans but never outweigh true costs
- A heuristic h is admissible (optimistic) if: 0 <= h(n) <= h(n) where h(n) is the true cost to a nearest goal.
Heuristic “arc” cost ≤ actual cost for each arc: h(A) – h(C) ≤ cost(A to C)
A* is optimal with admissible/consistent heuristics. In general, most natural admissible heuristics tend to be consistent, especially if from relaxed problems so often use relaxed problems.
- A special subset of search problems
- State is defined by variables Xi with values from a domain D (sometimes D depends on i)
- Goal test is a set of constraints specifying allowable combinations of values for subsets of variables
- Map Coloring
Backtracking search is the basic uninformed algorithm for solving CSPs
idea 1: One variable at a time: Variable assignments are commutative, so only need to consider assignments to a single variable at each step
Idea 2: Check constraints as you go: Consider only values which do not conflict previous assignments. (Might have to do some computation to check the constraints “Incremental goal test”)
Detect inevitable failure early: Keep track of domains for unassigned variables and cross off bad options.
Cross off values that violate a constraint when added to the existing assignment.
Disadvantage: doesn’t provide early detection for all failures.
Forward checking propagates information from assigned to unassigned variables, but doesn’t provide early detection for all failures.
So introduce Arc consistent: An arc X->Y is consistent iff for every x in the tail there is some y in the head which could be assigned without violating a constraint.
A simple form of propagation makes sure all arcs are consistent.
Disadvantage: Running slow and can not detect all failures.
1-Consistency (Node Consistency): Each single node’s domain has a value which meets that node’s unary constraints
2-Consistency (Arc Consistency): For each pair of nodes, any consistent assignment to one can be extended to the other
K-Consistency: For each k nodes, any consistent assignment to k-1 can be extended to the kth node.
k, k-1…1 Consistency which means we can solve without backtracking.
Choose the variable with the fewest legal left values in its domain (“Fail-fast” ordering)
Given a choice of variable, choose the least constraining value which leave more choices for future. I.e., the one that rules out the fewest values in the remaining variables. Note that it may take some computation to determine this! (E.g., rerunning filtering)
Independent subproblems are identifiable as connected components of constraint graph. Solve subproblems Independently will speed things up based on # of variables of subproblems.
If the constraint graph has no loops, the CSP can be solved in O(n d^2) time compared to general CSPs, where worst-case time is O(d^n)
Instantiate (in all ways) a set of variables such that the remaining constraint graph is a tree structure CSP, which reduce runtime to O((d^c)(n-c)d^2).
Take an assignment with unsatisfied constraints, operators reassign variable values until problem solved.
The same appears to be true for any randomly-generated CSP except in a narrow range of the ratio
- Just like DFS: Time: O(b^m), Space: O(bm).
- Optimal against a perfect player but do not reasoning against possible outcome.
Hard to search all leaves of tree (bound by time).
Search only to a limited depth in the tree, replace terminal utilities with an evaluation function for non-terminal positions.
- Guarantee of optimal play is gone
- The deeper in the tree the evaluation function is buried, the less the quality of the evaluation function matters
- Tradeoff between complexity of features and complexity of computation
- Typically weighted linear sum of features
- Have danger of replanning
Reduce unnecessary compute if possible.
(MIN version, MAX version is symmetric)
- We’re computing the MIN-VALUE at some node n
- We’re looping over n’s children
- n’s estimate of the children’s min is dropping
- Who cares about n’s value? MAX
- Let a be the best value that MAX can get at any choice point along the current path from the root
- If n becomes worse than a, MAX will avoid it, so we can stop considering n’s other children (it’s already bad enough that it won’t be played)
- No effect on minimax value computed for the root!
- Values of intermediate nodes might be wrong
- Good child ordering improves effectiveness of pruning
- A simple example of meta-reasoning: computing about what to compute
Uncertain outcomes controlled by chance (EXP node), not an adversary (MIN node).
Replace MIN node with EXP node
- MAX nodes as in minimax search.
- Chance nodes are like MIN nodes but the outcome is uncertain.
- Calculate their expected utilities. I.e. take weighted average (expectation) of children.
- Hard to prune if using average.
- For evaluation, magnitude become important other than ordering.
Ad EXP node between MAX and MIN node.
- Environment is an extra “random agent” player that moves after each min/max agent
- Each node computes the appropriate combination of its children
As depth increases, probability of reaching a given search node shrinks
- So usefulness of search is diminished
- So limiting depth is less damaging
- But pruning is trickier
Historic AI: TDGammon uses depth-2 search + very good evaluation function + reinforcement learning: world-champion level play.
- Terminals have utility tuples.
- Node values are also utility tuples.
- Each player maximizes its own component.
- Can give rise to cooperation and competition dynamically and naturally.
- A rational agent should chose the action that maximizes its expected utility, given its knowledge.
- Note: an agent can be entirely rational (consistent with MEU) without ever representing or manipulating utilities and probabilities. E.g., a lookup table for perfect tic-tac-toe, a reflex vacuum cleaner
Agent do not have fully control over future (action). “Markov” generally means that given the present state, the future and the past are independent.
- A set of states s
- A set of actions a
- A transition function T(s, a, s’)
- Probability that a from s leads to s’, i.e., P(s’| s, a)
- Also called the model or the dynamics
- A reward function R(s, a, s’)
- Sometimes just R(s) or R(s’)
- A start state
- Maybe a terminal state
- Policy = map of states to actions
- Utility = sum of discounted rewards
- Values = expected future utility from a state (max node)
- Q-Values = expected future utility from a q-state (chance node)
- The value (utility) of a state s: V*(s) = expected utility starting in s and acting optimally
- The value (utility) of a q-state (s,a): Q*(s,a) = expected utility starting out having taken action a from state s and (thereafter) acting optimally
- The optimal policy: t*(s) = optimal action from state s
- Noisy movement
- Racing car states
- It’s reasonable to maximize the sum of rewards
- It’s also reasonable to prefer rewards now to rewards later
- So use discounting: values of rewards decay exponentially
- U([1,2,3]) = 11 + 0.52 + 0.25*3
- U([1,2,3]) < U([3,2,1])
What if the game lasts forever? Do we get infinite rewards?
Finite horizon: (similar to depth-limited search)
- Terminate episodes after a fixed T steps (e.g. life)
- Gives nonstationary policies ( depends on time left)
Discounting: use 0 < y < 1
- Smaller means smaller “horizon” – shorter term focus
Absorbing state: guarantee that for every policy, a terminal state will eventually be reached (like “overheated” for racing)
Based on Bellman Equations, fixed depth k and compute the actual value by max over all actions to compute the optimal values:
- It’s slow – O(A*S^2) per iteration
- The “max” at each state rarely changes
- The policy often converges long before the values
Because of the problem if value iteration, think of using another way to solve MDP.
Find a way calculate the V’s for a fixed policy.
If we fixed some policy (s), then the tree would be simpler: only one action per state instead of all actions.
Idea 1: Turn recursive Bellman equations into updates (like value iteration)
- Efficiency: O(S2) per iteration
Idea 2: Without the maxes, the Bellman equations are just a linear system
- Good for parallel
“Reversed” Value Iteration: Find a way to calculate the policy given values.
Do a mini-expectimax (one step)
Completely trivial: just choose the max Q-Values action!
Actions are easier to select from Q-Values than values
- Policy evaluation: calculate utilities for some fixed policy (not optimal utilities!) until convergence
- Policy improvement: update policy using one-step look-ahead with resulting converged (but not optimal!) utilities as future values
- Repeat steps until policy converges
- It’s still optimal!
- Can converge (much) faster under some conditions
- They basically are – they are all variations of Bellman updates
- They all use one-step lookahead expectimax fragments
- They differ only in whether we plug in a fixed policy or max over actions
Specifically, reinforcement learning is an MDP, but you couldn’t solve it with just computation (probability unknown). So you needed to actually act to figure it out.
- Exploration: you have to try unknown actions to get information
- Exploitation: eventually, you have to use what you know
- Regret: even if you learn intelligently, you make mistakes
- Sampling: because of chance, you have to try things repeatedly
- Difficulty: learning can be much harder than solving a known MDP
- Still assume a Markov decision process (MDP):
- A set of states s S
- A set of actions (per state) A
- A model T(s,a,s’)
- A reward function R(s,a,s’)
- Still looking for a policy (s)
- New twist: don’t know T or R, must actually try actions and states out to learn
Learn an approximate model based on experiences。 Solve for values as if the learned model were correct
Learn empirical MDP model
- Count outcomes s’ for each s, a
- Normalize to give an estimate of T
- Discover each R when we experience (s, a, s’)
Solve the learned MDP
- For example, use value iteration, as before
- Simplified task: policy evaluation
- Input: a fixed policy (s)
- You don’t know the transitions T(s,a,s’)
- You don’t know the rewards R(s,a,s’)
- Goal: learn the state values
- Full reinforcement learning: optimal policies (like value iteration)
- You don’t know the transitions T(s,a,s’)
- You don’t know the rewards R(s,a,s’)
- You choose the actions now
- Goal: learn the optimal policy / values
- In this case:
- Learner makes choices!
- Fundamental tradeoff: exploration vs. exploitation
- This is NOT offline planning! You actually take actions in the world and find out what happens…
Compute values for each state under
Average together observed sample values
- Act according to
- Every time you visit a state, write down what the sum of discounted rewards turned out to be
- Average those samples
- It’s easy to understand
- It doesn’t require any knowledge of T, R
- It eventually computes the correct average values, using just sample transitions
- It wastes information about state connections
- Each state must be learned separately
- So, it takes a long time to learn
Sample-Based Policy Evaluation. Learn from every experience.
- Update V(s) each time we experience a transition (s, a, s’, r)
- Likely outcomes s’ will contribute updates more often
TD value leaning is a model-free way to do policy evaluation, mimicking Bellman updates with running sample averages. However, if we want to turn values into a (new) policy, we’re sunk. Solution: learn Q-values, not values.
- Converges to optimal policy – even if you’re acting suboptimally
- Have to explore enough.
- Have to eventually make the learning rate small enough but not decrease it too quickly.
- Basically, in the limit, it doesn’t matter how you select actions.