Using Asynchronous Method For Deep Reinforcement Learning

Image may be NSFW.
Clik here to view.

Machine Learning applications have propelled artificial intelligence to achieve realistic results to a great extent. This can be largely attributed to improved research and developments in areas like neural networks — particularly deep neural networks. The advancements in these networks have led to other areas of ML, like reinforcement learning (RL), to grow parallelly.

RL gets its inspiration from behavioural psychology, where software entities known as ‘agents’, work together to achieve positive training throughput called as ‘rewards’. Although RL algorithms employ neural networks for functionality, it is found that the algorithms are sometimes unstable when learning data. This issue has been impeding ML researchers for a while, and they have come up with numerous solutions to stabilise RL algorithms in terms of performance.

In this article, we will focus on one specific study by researchers at Google’s Deepmind. It is called the Asynchronous Method for Deep Reinforcement Learning, and also uses gradient descent optimisation technique in ML.

Foundation For Asynchronous Applications

Online RL algorithms rely on data anticipated at the moment. This means rewards are dependent on the action taken on the data; and updates on these algorithms happen in incremental portions. Researchers have worked on improving this process by including a step called experience replay. However, this takes a toll on computing resources such as memory and processing power, in addition to other issues such as data from older RL policy.

In order to resolve this, asynchronous (computing processes which are independent and take place parallelly) methods are developed. Here, the method enables multiple agents to act, instead of relying on multiple instances in RL. It also makes correlation between input and output data easier. In the study mentioned earlier, asynchronous method is applied to typical RL algorithms such as Sarsa (state-action-reward-state-action), n-step Q learning and actor-critic methods. The authors also emphasise on the computing benefits the asynchronous methods provide, by demonstrating that they even work on a standard multi-core CPU instead of the powerful GPU, which is generally used in a deep learning environment.

Asynchronous Method

The study considers the backdrop of the standard RL method for developing asynchronous algorithms. For creating an RL framework, the researchers follow a two-step approach. First, they use a technique called asynchronous actor-learners, due to its robustness. For this, they use a single machine with multiple CPU threads, mainly to lower communication costs between threads and achieve efficient algorithm updates. Second, they analyse the various actor-learners present and use exploration policies on these learners. This provides the advantage of applying online updates parallelly with better correlation in the algorithm. In addition, these parallel multiple actor learners have manifold benefits such as reducing training time and promoting online RL.

Here’s an illustration of the pseudocode written by the authors for each actor-learner thread:

“ //Assume global shared θ, θ– , and counter T = 0

Initialize thread step counter t ← 0

Initialize target network weights θ– ← θ

Initialize network gradients dθ ← 0

Get initial state s

repeat

Take action a with ϵ-greedy policy based on Q(s, a; θ)

Receive new state s’ and reward r

y = { r for terminal s’, r + ℽ max_a, Q (s’, a’ ; θ–) for non terminal s’}

Accumulate gradients wrt θ: dθ ← dθ + ∂(y−Q(s,a;θ))²/∂θ

s = s’

T ← T + 1 and t ← t + 1

if T mod I_target == 0 then

Update the target network θ– ← θ

end if

if t mod I_AsyncUpdate == 0 or s is terminal then

Perform asynchronous update of θ using dθ.

Clear gradients dθ ← 0.

end if

until T > Tmax “

Now with the help of this pseudocode, asynchronous methods are developed which are based on existing algorithms given below:

Asynchronous One-step Q-learning: In this algorithm, each thread references its copy within the computing environment and evaluates gradient in every step of the Q-learning loss. This lowers the chances of overwriting in updates done to the algorithm.

Asynchronous One-step Sarsa: This is similar to the above algorithm except it differs in the target value for Q (s, a), which is given by r + ℽ Q(s’,a’, θ–) where a’ is action and s’ is the state.

Asynchronous n-step Q-learning: This algorithm computes n-step return which is more of a ‘forward view’ algorithm rather than conventional ‘backward view’ programming. For a single update, it uses exploration policy for each state-action and then computes gradients for n-step Q-learning updates for every state-action.

Asynchronous Advantage Actor-Critic (A3C): This algorithm follows the same ‘forward view’ approach, except it varies with respect to policy.

These algorithms are tested on various use cases like the Atari 2600 game, and so on. They show a significant reduction in training time (as low as one day in A3C algorithm compared to eight days in deep Q-network). Also, these are computed on CPUs instead of GPUs, and are found to be more stable in terms of performance and learning rates.

Conclusion

Asynchronous method in RL is resource-friendly and can be computed for a small scale learning environment. It shows improved data efficiency and faster responsiveness. Therefore, integrating existing RL algorithms will certainly make it consume lesser resources for computing along with achieving accuracy when it comes to building large neural networks.

The post Using Asynchronous Method For Deep Reinforcement Learning appeared first on Analytics India Magazine.

Using Asynchronous Method For Deep Reinforcement Learning

Foundation For Asynchronous Applications

Asynchronous Method

Conclusion

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112