I'm running one service in Azure with 4 worker instances. When I scale up to 5 worker instances the first instance that had started goes into the "busy" state. Why is that? What happens during scale up? Does azure re-run all the startup tasks? I'm very confused and can't seem to find any documentation on this.
After scaling up to 5 instances the first instance changes its status to:
Busy (Waiting for role to start... Application startup tasks are running. [2014-08-12T18:36:52Z])
And the java process that was running there stops. Why would this happen?!
Any help would be appreciated.
Startup.cmd
REM Log the startup date and time.
ECHO Startup.cmd: >> "%TEMP%\StartupLog.txt" 2>&1
ECHO Current date and time: >> "%TEMP%\StartupLog.txt" 2>&1
DATE /T >> "%TEMP%\StartupLog.txt" 2>&1
TIME /T >> "%TEMP%\StartupLog.txt" 2>&1
REM enable ICMP
netsh advfirewall firewall add rule name="ICMPv6 echo" dir=in action=allow enable=yes protocol=icmpv6:128,any
ECHO Starting WebService >> "%TEMP%\StartupLog.txt" 2>&1
tasklist /FI "IMAGENAME eq java.exe" 2>NUL | find /I /N "java.exe" >NUL 2>&1
if "%ERRORLEVEL%"=="0" GOTO running
SET %ERRORLEVEL% = 0
START /B java -jar WEB-SERVICE-1_0--SNAPSHOT.jar app.properties >> "%TEMP%\StartupLog.txt" 2>&1
:running
SET %ERRORLEVEL% = 0
During a scale operation Azure will send a RoleEnvironmentTopologyChange via the Changing event to all existing instances. This lets those instances discover the new role instance in order to allow communication between the instances. Note that this only happens if you have an internal endpoint defined (if you turn on RDP then you implicitly get an internal endpoint).
By default these topology changes won't affect running instances. However, if you subscribe to the Changing event and you set e.Cancel=True
then the role instance will recycle and run your startup tasks again.
For more information on the topology change see http://azure.microsoft.com/blog/2011/01/04/responding-to-role-topology-changes/.
So there are two issues here:
- Why is your role not able to recover from a recycle? This is a significant issue and one you must fix in order to have a reliable service. You can start with the troubleshooting workflows at http://blogs.msdn.com/b/kwill/archive/2013/08/09/windows-azure-paas-compute-diagnostics-data.aspx, and in particular Scenario 3 at http://blogs.msdn.com/b/kwill/archive/2013/09/06/troubleshooting-scenario-3-role-stuck-in-busy.aspx.
- Why are you recycling your role instances in response to a topology change? Check your Changing event handler and make sure you aren't setting
e.Cancel=true
.
This is too long for a comment, just adding to what kwill has already told:
My ASP.NET Web Role didn't have e.Cancel = true
anywhere but still got restarted (or rather: recycled, the environment being completely re-initialized even before OnStart()
was called for 10 minutes, just like after a fresh deployment) after a scale-out. So I went ahead and added an event handler which is just supposed to set what's already a default:
public class WebRole : RoleEntryPoint
{
public override bool OnStart()
{
RoleEnvironment.Changing += (sender, e) =>
{
if (e.Changes.Any(change => change is RoleEnvironmentTopologyChange))
{
e.Cancel = false;
}
};
}
}
And this helped! The role still becomes busy, but just for a few seconds instead of 15-20 minutes. It seems that only the website in the role restarts (or maybe the whole IIS), but the role doesn't restart, neither is the whole environment reinitialized.