SGX securities market temporarily ceased trading as at 1138 hours

tanjm · 14-07-2016, 10:40 PM

Buying and installing the hardware for backup is but the first step. More important is to regularly test and rehearse failure scenarios and this is where many organizations fail. This regular testing is a very costly exercise especially if you want to do it properly and if you keep updating the plans after every change in software and infrastructure.

When considering mission criticality, you also have to consider the cost of failure versus the cost of actually ensuring that the DR actually works.

getrich · 14-07-2016, 10:56 PM

(14-07-2016, 10:40 PM)tanjm Wrote: Buying and installing the hardware for backup is but the first step. More important is to regularly test and rehearse failure scenarios and this is where many organizations fail. This regular testing is a very costly exercise especially if you want to do it properly and if you keep updating the plans after every change in software and infrastructure.

When considering mission criticality, you also have to consider the cost of failure versus the cost of actually ensuring that the DR actually works.

These kinds of glitches shouldn’t be happening with such frequency in an international financial center.

The malfunction exceeded the regulator’s acceptable maximum unscheduled downtime for financial institutions of four hours in any 12-month period.

tanjm · (This post was last modified: 15-07-2016, 07:40 AM by tanjm.)

This is a (in theory) a non political, value investing forum. So, unless you are a SGX investor, and not someone with an axe to grind, nor a day trader, nor working for the regulator, I would say "chill dude, what's the big deal?".

corydorus · 15-07-2016, 07:51 AM

This thread is not stock counter so i think is good to hear opinions and not impose freedom of views, and discuss it out.

Having say that, i have the impression SGX has been actively involves in adding new segments of earnings and operational complexities. Is part of growing pains. We are not happy but feel is a good problem to have.

specuvestor · 15-07-2016, 10:25 AM

(14-07-2016, 09:37 PM)weijian Wrote:
(14-07-2016, 09:13 PM)thor666 Wrote:
(14-07-2016, 08:17 PM)weijian Wrote: There are many different kinds of backup for different systems.
For example during Nov2015 issue, it was due to a malfunction of a certain function of the backup generator.

COI found here: http://infopub.sgx.com/FileOpen/20150624...eID=357241

I would like to quip my take on the previous incident. I agree with u it is not acceptable. The explanation of failure of a 3rd party DC did not address the concern that there is no disaster recovery (DR) backup datacentre.

I do note that in singapore, we only have a single power grid, hence it is technically impossible to meet a tier 4 dc standard. However, with the type of operations that sgx has, i do not think an uptime service level of 99.5% (5hr/(365*24)) or thereabouts is acceptable.

My guess is that cost is the reason for lack of DR datacentre. My personal viewpoint is that this is the wrong level of priority.

(My opinion based on my limited work knowledge in the IT industry.)

Sent from my LG-H818 using Tapatalk

hi thor666,
There is a SDC (secondary data center) that i reckon has all the backup. I also reckon this secondary data center is for disaster recovery purposes, as you mentioned. What failed was (1) the failure of the backup generator, (2) followed by not re-considering to abandon the fail-safe method (recover from PDC-primary data center), when new info was available to allow it to fail safe to SDC...my guess is fail safe to SDC would have allowed it to recover much earlier.

(14-07-2016, 10:40 PM)tanjm Wrote: Buying and installing the hardware for backup is but the first step. More important is to regularly test and rehearse failure scenarios and this is where many organizations fail. This regular testing is a very costly exercise especially if you want to do it properly and if you keep updating the plans after every change in software and infrastructure.

When considering mission criticality, you also have to consider the cost of failure versus the cost of actually ensuring that the DR actually works.

Nobody test a backup until crap happens. Even for us doing personal on-the-fly RAID backups, very few would actually go and swap and see if it actually works Smile

It should not be called a backup. The solution is to run these 2 parallel systems on alternate week or month. Then it will become operationally ready and it is built into the work routine. But humans will resist doing it and yes cost will be a factor which is why I keep saying critical infrastructure and regulators should not be privatised and become PnL focused.

tanjm · (This post was last modified: 15-07-2016, 11:19 AM by tanjm.)

(15-07-2016, 10:25 AM)specuvestor Wrote: Nobody test a backup until crap happens. Even for us doing personal on-the-fly RAID backups, very few would actually go and swap and see if it actually works

It should not be called a backup. The solution is to run these 2 parallel systems on alternate week or month. Then it will become operationally ready and it is built into the work routine. But humans will resist doing it and yes cost will be a factor which is why I keep saying critical infrastructure and regulators should not be privatised and become PnL focused.

Exactly. DR is a crapshoot. And yes, running your DR as an active and parallel production system is indeed one of the more practical solutions. Unfortunately, you have to keep on doing design reviews to ensure that your parallel system is truly non-dependent on the first one! There are no easy solutions here.

Actually, I'm not sure I would call the SGX exchange a "critical infrastructure" compared with water, electricity and telecoms (day traders might differ). Once again, cost of DR vs cost of failure needs to be weighed (and also the regulator's preferences). It is up to the regulator to insist on properly tested DR plans, which SGX would have to comply with to stay in business. That would in turn reduce SGX ROE or SGX has just to find ways to keep its profits up (e.g. raise fees).

btw SGX is only a partial monopoly. In today's world, traders have ease of access to practically any stock market in the world.

SGX has a supervisory function over companies etc, but MAS certainly does exert regulatory influence over SGX operations. See http://www.mas.gov.sg/news-and-publicati...tages.aspx

SGX was actually penalized and had to comply with MAS instructions on follow up.

sgd · 15-07-2016, 01:57 PM

I believe under the mandate of MAS all financial services companies in SG need to have proper redundancies in place including DR and cluster like systems and restoration of tapes media are tested at least twice a year.

it's very easy just to meet MAS standard, money is not a problem for these big financial organizations they just need buy enough equipment and prove it works and show you have it tested annually.

The reality and the biggest unknown I see is the Application. if your system is so massive and grown into such a complex beast it's very hard to test all angles and most of them go for a scaled down DR test not because of negligence but because there's really not enough time and time is really money to these people we talking millions daily in lost revenue from hours and minutes.

Most financial trading companies offer accesses to different markets. Market opening times in the region will be overlap.
9am - 5pm : Singapore market, what about msia, indonesia, hk, UK, Japan etc ...

When singapore and region closes they have a few hours to quickly process the backlog to be in time for US market opening which is the highlight of the day and after it closes a few hours to quickly complete processing before SG market opens again and the whole cycle repeats.

If you have system running literally 24/7 the only time where all the markets are closed is on a weekend but even then have to observe timezones and work to meet cut-off timings before markets reopen.

Where SGX differs here is I heard they also offer hosting services for other financial organizations who may offer access for their own customers to the region, So I see it is so complex even if they wanted to they also have to consult these "customers" for approval before anything can be scheduled.

So with such a restrictive window it's not humanly possible to test everything. I don't know how SGX does their DR. But most big fin organization have this challenge and will go for a scaled down test and that is the biggest danger.

Few years ago we had this customer who's a commodity trading subset for this major foreign bank which I will not mention. Every year there would be a global DR exercise they get all the regional stakeholders involvement, where they would use a scaled down system to simulate a "controlled" system crash and as usual test recovery of tape media and test system functionality after that see how fast they could do it. It was a scaled down version of the entire system which was massive but even will still take almost 12 hours just to do and every year they just follow the well rehearsed routine script and we will pass and the big regional bosses pat themselves on the back.

One day there was a routine maintanenece which suddenly went wrong and quickly spun out of control and turned into a real system disaster and they activated a real DR. They found out the DR exercise that was done every year was pretty useless because it doesn't meet the reality when you try to use a scaled down system for the real upsized thing it cannot meet the load.

So to me a DR test, it is something better than nothing but I think is not very practical because it's likely done in a scaled down controlled environment. System recoveries will work if tested regularly but the biggest unknown factor is the Application because it never been tested the real life situation.

For that disaster outage we went from live-DR-live took almost 3 days.

brattzz · 15-07-2016, 02:02 PM

"and every year they just follow the well rehearsed routine script and we will pass and the big regional bosses pat themselves on the back.One day there was a routine maintanenece which suddenly went wrong and quickly spun out of control and turned into a real system disaster and they activated a real DR. They found out the DR exercise that was done every year was pretty useless because it doesn't meet the reality when you try to use a scaled down system for the real upsized thing it cannot meet the load"

lolz! Big Grin

sounds like a singpost to me!! Tongue

lack of governance?

**weijian** · 15-07-2016, 03:10 PM

(14-07-2016, 10:34 PM)thor666 Wrote: [quote pid='131508' dateline='1468503428']
Thank u for sharing this.
Seems they thought it might be more risky to trigger the SDC. Not sure if that is a good thing.

Sent from my iPad using Tapatalk

[/quote]

hi thor666,
from my understanding of the COI summary, it is more risky to trigger the SDC if only a partial black out occur at PDC --> For a layman like me, I think this implies that the certain parts of the PDC will still be working while SDC may not have the latest update due to the partial operating PDC. But if the PDC totally blacks out, the SDC will have the exact duplicate info. In this case, it is less risky to trigger the SDC, which was

The AOI for SGX was in the communications and disaster recovery procedures - ie. SGX chose the correct path not to trigger SDC initially, but as more info was available (or somehow it either wasn't disseminated), they should have fail to SDC, but they didn't.

touzi · 16-07-2016, 01:42 PM

(15-07-2016, 03:10 PM)weijian Wrote:
(14-07-2016, 10:34 PM)thor666 Wrote: [quote pid='131508' dateline='1468503428']
Thank u for sharing this.
Seems they thought it might be more risky to trigger the SDC. Not sure if that is a good thing.

Sent from my iPad using Tapatalk

hi thor666,
from my understanding of the COI summary, it is more risky to trigger the SDC if only a partial black out occur at PDC --> For a layman like me, I think this implies that the certain parts of the PDC will still be working while SDC may not have the latest update due to the partial operating PDC. But if the PDC totally blacks out, the SDC will have the exact duplicate info. In this case, it is less risky to trigger the SDC, which was

The AOI for SGX was in the communications and disaster recovery procedures - ie. SGX chose the correct path not to trigger SDC initially, but as more info was available (or somehow it either wasn't disseminated), they should have fail to SDC, but they didn't.
[/quote]

The SDC is meant for Disaster Recovery/Biz Continuity. DR/BC system is normally not triggered for system failure. It is only triggered in catastrophic events like earthquake, severe flood, terrorist attack etc. A DR/BC is not merely a redundant system or a High Availability (HA) system. It is a massive exercise to trigger a DR. It is not just about hardware, but also impact connectivity (some re-routing involved), staff relocation and process changes. The whole process of activating DR is not something that can be completed in a few minutes. It is not a decision that is made lightly. It is a decision from the top echelon. Note that DR are never triggered automatically, unlike redundancy.

I believe that companies that have a DR solution will normally have a dry run once or twice a year. For redundancy and HA, you test it during maintenance, upgrading, software patching etc. Plenty of opportunities to test.

Login
Username:
Password:	Lost Password?
	Remember me