The following is a sample report that ArsDigita's Scaling Team wrote for
one of its clients. It describes how ArsDigita was able to scale the
application-specific code on one of its more complicated Web sites
through a methodical process. ArsDigita has scrubbed the report of
company and URL names in order to protect the identity of its client.
ArsDigita presents the modified paper here as an example of how to solve
difficult scalability problems. For more information about building
scalable Web sites, see the articles:
1 Introduction
ArsDigita's Scaling Team evaluated the Site X Web site in its Cambridge
Scaling Lab. This evaluation had two main goals:
- Determine the existing performance and scalability of Site X
- Ensure that Site X would scale well enough to handle expected
traffic increases coming from a COMPANY Y-Site X joint venture
1.1 Requirements
1.1.1 Throughput
Site X required that its Web site scale acceptably up to four times its
existing load levels in order to prepare for the COMPANY Y-generated
traffic. Based on Site X's current peak throughput of about 1.6
pages/second, the overall site would have to support a throughput of at
least 6.4 pages/second. Thus, ArsDigita and Site X set as a requirement
that Site X must support a throughput of at least 6.5 pages/second.
1.1.2 Page Performance
The industry-standard requirement for individual Web-page performance is
that 90th percentile page load times should be within eight seconds.
ArsDigita used this standard for determining whether Site X's Web pages
were performing acceptably during load tests.
1.2 Scalability Findings and Results
ArsDigita's Scaling Team worked iteratively with the Site X development
team in order to test Site X, find its bottlenecks, and tune its
architecture. From this work, the scaling team determined and
accomplished the following:
- Site X's initial AOLserver configuration was not optimal. For each
Netra, doubling the number of AOLserver instances and changing various
AOLserver initialization parameters increased Site X's scalability and
stability
- The page, /www/page3.adp, performed and scaled poorly. ArsDigita
improved this page so that it was no longer a bottleneck in the system
- The page, /www/page4.adp, performed and scaled poorly. ArsDigita
improved this page so that it was no longer a bottleneck in the system
- Many of the Netra front-end servers suffered from memory problems.
ArsDigita traced this memory issue to poor thread-handling and was able
to resolve the problem
- Under load testing in a lab environment, Site X scaled acceptably to
a throughput of about 49 pages/second, assuming the following
configuration of five Netras:
- Two consumer-site Netras
- Two XML Netras
- One Netra serving images, running the cache, and handling XML
bulk downloads
- Site X's AOLservers should be restarted twice a day to ensure
long-term stability of the site
Following is a detailed description of how ArsDigita arrived at these
findings:
2 Testing Approach
ArsDigita performed three rounds of testing while working on Site X:
- An initial assessment of Site X's performance and scalability
- A reassessment of Site X's performance and scalability after
eliminating bottlenecks found in round one. This included
troubleshooting and working alongside the Site X project team.
- A final test of Site X's performance and scalability based upon new
hardware purchased for the COMPANY Y-Site X joint venture
3 Round One Testing
3.1 Test Environment
ArsDigita performed all Round One testing within its Cambridge Scaling
Lab. In this lab, ArsDigita used the following testing setup:
- Sun e450 Server
- 9 Sun Netras with 256 MB of RAM
- BigIP F5 load balancer
- 100 Mbps Ethernet
- 6 PC's running Empirix e-Load for performing load tests
For load testing, ArsDigita exported a copy of Site X's production
database into the scaling lab's e450. It then setup the e450 and Netras
with Site X's code base and produced a mirror copy of Site X's
production environment for load testing.
3.2 Index+3 Script
Site X has a custom Web traffic profiler on its production site.
ArsDigita used this profiler to gather information about Site X's
traffic patterns. From analyzing this data, ArsDigita found that one
path through Site X consisting of four pages comprised about 70% of the
site's traffic. Therefore, ArsDigita focused its Round One testing
efforts on this particular path.
ArsDigita wrote an e-Load script, called Index+3, to test this path.
This path consisted of Site X's index page, and then three additional
pages in the site:
- /index.adp
- /page2.adp
- /page3.adp
- /page4.adp
ArsDigita used a databank to vary values for things such as originating
cities when running Index+3. This allowed testing using a variety of
data.
3.3 Test Results
ArsDigita ran Index+3 repeatedly, adding Netras for each successive run,
to see how Site X's performance would scale with more front-end servers.
ArsDigita tested in configurations ranging from three to nine Netras.
ArsDigita did not employ any delays between page clicks for these tests.
The initial run used the production setup of three Netras. Next,
ArsDigita tried a different configuration with a separate image server
and reduced main servers:
Netra 1
| Netra 2
| Netra 3
|
Cache-1 | Main-2 | Image
|
Main-1
|
Table 1 Test Configuration with Separate Image Server
This did not seem to produce any significantly different results from
the previous configuration.
Then, ArsDigita gave each server instance its own Netra:
Netra 1
| Netra 2
| Netra 3
| Netra 4
| Netra 5
| Netra 6
|
Cache | Image | Main | Main | Main
| Main
|
Table 2 Test Configuration with Six Netras
ArsDigita continued to add Netras until it reached a total of nine
Netras:
Netra 1
| Netra 2
| Netra 3
| Netra 4
| Netra 5
| Netra 6
| Netra 7
| Netra 8
| Netra 9
|
Cache | Image | Main | Main | Main | Main | Main
| Main | Main
|
Table 3 Test Configuration with Nine Netras
The following table summarizes the results of these tests:
|
3 Netras |
6 Netras |
9 Netras |
Max Users w/o errors | 40 | 60 | 55
|
Max Users before 8 sec. latency
| Page 1: n/a | Page 1: n/a | Page 1: n/a
|
Page 2: n/a | Page 2: n/a | Page 2: n/a
|
Page 3: 6 | Page 3: 14 | Page 3: 26
|
Page 4: 20 | Page 4: 32 | Page 4: n/a
|
Performance at 40 Users | 72.75 sec. | 37.39 sec. | 21.04
sec.
|
Average Throughput (transactions/second) | 0.37 | 0.66
| 1.14
|
Average Throughput (extrapolated pages/second) | 1.48 | 2.64
| 4.56
|
Throughput Bottleneck | 10 users | 10 users | 10 users
|
Table 4 Summary of Index+3 Results
3.3.1 Page Latency
A common performance requirement for Web sites is that no page should
have 90th percentile load times greater than eight seconds. Therefore,
although a page may eventually load without errors, if it takes a long
time to load, it may still be considered broken.
In these tests, pages one and two (/index.adp and /page2.adp) always
performed within eight seconds before the Web site began to exhibit
errors. Page three, /page3.adp, however, was extremely slow; it
consistently took too long to load early on in the testing ramp-ups.
Page four, /page4.adp, was also slow, although it performed within
acceptable limits when tested with nine Netras.
The following graphs illustrate the page latencies from these test runs:
Figure 1 Page Performance Vs Users (3 Netras)
Figure 2 Page Performance Vs Users (6 Netras)
Figure 3 Page Performance Vs Users (9 Netras)
3.3.2 Relative Performance
Adding Netras linearly improve the overall performance of the site. For
example, at 40 concurrent users, the index+3 script took 72.75 seconds
to complete with three Netras, 37.39 seconds with six Netras, and 21.04
seconds with nine Netras.
The following graph illustrate Site X's overall performance versus users
for index+3 testing:
Figure 4 Round One Index+3 Performance Vs. Users
One striking characteristic of these performance versus users results is
that they are linear. A typical performance versus users graph should
take the following form:
Figure 5 Typical Performance Versus Users Graph
A Web site's performance should remain flat until it hits a bottleneck
in the system. Once the site encounters a bottleneck, it will trade
performance for users-the more users the site handles, the slower it
performs.
Site X's performance-versus-users graphs are not flat anywhere.
Therefore, index+3 testing is encountering a fundamental bottleneck as
soon as testing begins. The reason this bottleneck shows up so quickly
was likely due to the testing pace being too fast as the virtual users
did not have any delay between page clicks. In other words, this test
pushed the system as much as possible even with only one user.
3.3.3 Throughput
Adding Netras increased the overall throughput of the site. However,
even with nine Netras, the site did not scale up to 6.5 pages/second.
This indicated that ArsDigita would have to improve Site X's scalability
as it worked to meet scalability requirements.
A typical throughput curve for a Web site will take the following form:
Figure 6 Typical Statistics Vs Time Graph
As long as a Web site's throughput is increasing as the number of
concurrent users increases, it has not hit a major bottleneck. But,
once the site's throughput levels off, the site has hit a bottleneck as
it is no longer able to achieve further throughput despite the increased
number of users.
Site X's throughput graphs took the following form:
Figure 7 Index+3 Transactions/Second - 9 Netras
This graph shows that Site X hit its scalability limit almost
immediately after testing began. However, increasing Netras increased
overall throughput. This, like the performance-versus-users graphs,
indicates that the virtual user pace for Round One testing was too fast.
3.4 Round One Findings
From this first round of testing, ArsDigita determined the following:
- ArsDigita needed to tune Site X in order to improve the Web site's
scalability from 4.56 pages/second beyond 6.5 pages/second
- Site X had two slow pages which needed tuning:
In addition, ArsDigita decided to modify its testing strategy for Round
Two. First, ArsDigita would introduce delays between virtual user
clicks in order to slow down the testing process. This would enable
ArsDigita to more easily identify at what points scalability bottlenecks
manifested themselves.
ArsDigita would also add additional pages for testing during Round Two.
This would serve a couple purposes: it would allow ArsDigita to test the
scalability and performance of new XML pages written for the COMPANY Y
integration, and it would allow ArsDigita to make a more complete
assessment of Site X's overall scalability.
4 Round Two Testing
4.1 Test Environment
ArsDigita performed all Round Two testing within its Cambridge Scaling
Lab.
4.2 AOLserver Configuration
The first thing that ArsDigita did in Round Two was to optimize Site X's
server setup. From previous experience and also through tests for
confirmation, ArsDigita determined that Site X should configure its
front-end servers so that:
- Each Netra runs two instances of AOLserver rather than one instance.
This helps increase throughput to the database and also provides some
measure of redundancy
- Each AOLserver should change its MaxThreads parameter from the
default of 100 to 10
- Each AOLserver should use eight main database handles
ArsDigita ran its major tests in Round Two with these configurations in
place.
4.3 Tuning /page3.adp
During Round One Testing, ArsDigita found the page, /page3.adp, to be
slow and a bottleneck in overall scalability. Therefore, ArsDigita
spent some time optimizing this page. One of the principal
optimizations ArsDigita performed was to cache the query results for
this particular page. Once ArsDigita had finished its tuning, it load
tested the page to see how it would now perform.
ArsDigita used the following configuration for this test:
Number Of Netras | 1
|
AOLservers/Netra | 1
|
MaxThreads Parameter | 10
|
Number of Main Database Handles | 8
|
Ramp Up | 5 users, every 3 iterations
|
Virtual User Delay Between Clicks | 10 seconds
|
Table 5 Test Configuration for /page3.adp test
Note that ArsDigita added a ten-second delay between virtual user
page-clicks to slow down the testing pace.
The results of this load test are illustrated in the following graphs:
Figure 8 /page3.adp Performance Vs Users
Figure 9 page3.adp Pages/Second Throughput
As these two charts illustrate, page3.adp performed quite well following
optimization so that on a single Netra running one AOLserver instance,
it could scale to 250 virtual users and achieve a steady-state
throughput of about 36 pages/second. These results were quite
acceptable, and so ArsDigita next turned to tuning the other slow page
in the Index+3 script.
4.4 Tuning /page4.adp
ArsDigita spent several rounds optimizing /page4.adp and then re-testing
it to evaluate the results of the optimizations. At the beginning of
the optimization process, this page scaled to about 30 virtual users
with a throughput of about 1.12 pages/second on one Netra. Furthermore,
even at these user levels, it performed unacceptably beyond eight-second
load times.
After ArsDigita finished tuning /page4.adp, the page performed
acceptably up to around 40 users and scaled up to 90 users. Its average
throughput also increased fivefold to 5.56 pages/second. ArsDigita
deemed this figures acceptable as this page performs several expensive
computations.
ArsDigita used the following setup for evaluating this page:
Number Of Netras | 1
|
AOLservers/Netra | 1
|
MaxThreads Parameter | 10
|
Number of Main Database Handles | 8
|
Ramp Up | 5 users, every 3 iterations
|
Virtual User Delay Between Clicks | 10 seconds
|
Table 6 Configuration for /page4.adp test
The following graphs illustrate the improvements in this page:
Figure 10 Comparison of /page4.adp Performance Vs. Users
Figure 11 Comparison of /page4.adp Throughput: Pages/Second
4.5 Index+3 Comparison
Once ArsDigita had tuned /page3.adp and /page4.adp, it re-ran the
Index+3 test to see how the overall script performance would improve.
However, for this test, ArsDigita used a different server configuration
from the initial Round One Index+3 test. In Round One, ArsDigita had
used up to nine Netras, of which seven served primary content. For this
round, though, ArsDigita only used four Netras for serving primary
content due to server availability in the lab. Therefore, when
comparing Index+3 results between Round One and Round Two, keep in mind
that the Round Two tests ran on half the front-end hardware as Round
One. Nevertheless, the results in Round Two were still significantly
better.
|
Round One |
Round Two |
Number of Main Netras | Up to 7 | 4
|
Number of AOLservers/Netra | 1 | 2
|
MaxThreads Parameter | 100 | 10
|
Virtual User Delay Between Clicks | 0 seconds | 0 seconds
|
Table 7 Comparison of Index+3 Test Setup for Round One Vs. Round
Two
Note that ArsDigita ran its Round Two Index+3 test with no virtual user
delays in order to compare with Round One results.
The following graphs show the Index+3 results following ArsDigita's
optimizations:
Figure 12 Comparison of Index+3 Results: Performance Vs. Users
Figure 13 Comparison of Index+3 Results: Throughput
As these graphs show, ArsDigita's tuning produced significantly improved
results while running on less hardware than in Round One testing.
4.6 XML Testing
A large part of the Site X-COMPANY Y joint venture involved ArsDigita's
development of various XML-generating pages. ArsDigita individually
tested the two particular XML pages expected to receive the bulk of the
new XML-based traffic.
4.6.1 XML-1
The first XML page ArsDigita examined in isolation was xml-1. ArsDigita
used the following setup to test this page:
Number Of Netras | 1
|
AOLservers/Netra | 1
|
MaxThreads Parameter | 10
|
Number of Main Database Handles | 8
|
Ramp Up | 2 users, every 10 iterations
|
Virtual User Delay Between Clicks | 10 seconds
|
Table 8 Configuration for xml-1 test
From this analysis, ArsDigita found that xml-1 scaled well up to about
45 users and an average throughput of about 37.1 pages/second for one
AOLserver on a single Netra:
Figure 14 xml-1 Performance Vs. Users
Figure 15 xml-1 Throughput: Pages/Second
Based on these results, ArsDigita concluded that xml-1 scaled
adequately.
4.6.2 xml-2
The other XML page which ArsDigita tested was xml-2. ArsDigita setup
this test as follows:
Number Of Netras | 1
|
AOLservers/Netra | 1
|
MaxThreads Parameter | 10
|
Number of Main Database Handles | 8
|
Ramp Up | 5 users, every 10 iterations
|
Virtual User Delay Between Clicks | 10 seconds
|
Table 9 Configuration for xml-2 test
From this test, ArsDigita found that xml-2 scaled acceptably to about 90
users and a steady-state throughput of about 20 pages/second on one
AOLserver running on a single Netra.
Figure 16 xml-2 Performance Vs. Users
Figure 17 xml-2 Throughput: Pages/Second
Based on these results, ArsDigita concluded that xml-2 scaled
adequately.
4.7 Site-Wide Test
The final test that ArsDigita performed during Round Two was a site-wide
test. ArsDigita used this test to gain an estimate of Site X's overall
scalability.
To mimic projected traffic patterns following the joint Site
X-COMPANY Y launch, ArsDigita wrote additional scripts and
classified each script as either background or foreground.
Background scripts would consist mostly of XML bulk-download and
administration activity, whereas foreground scripts would consist of
pages that users would frequently visit.
ArsDigita further divided foreground scripts into two types: XML scripts
and consumer-site scripts. Because Site X was expecting that the
XML-generating pages would receive far more traffic than the regular
Site X pages, ArsDigita decided to deploy foreground scripts at a 5:1
XML:consumer-site ratio.
ArsDigita's e-Load license supports up to 500 virtual users. Therefore,
ArsDigita deployed its users using the background, XML-foreground, and
consumer-site foreground scripts as follows:
Script |
Number of Users |
User Frequency |
xml-3 | 1 | every 10 minutes
|
xml-4 | 1 | every 10 minutes
|
xml-5 | 1 | every 10 minutes
|
xml-6 | 1 | every 10 minutes
|
xml-7 | 1 | every 10 minutes
|
xml-8 | 1 | every 10 minutes
|
xml-9 | 1 | every 10 minutes
|
xml-10 | 1 | every 10 minutes
|
xml-11 | 1 | every 10 minutes
|
xml-12 | 1 | every 10 minutes
|
trace-1 | 2 | every 5 minutes
|
trace-2 | 2 | every 5 minutes
|
trace-3 | 2 | every 5 minutes
|
Total | 16 Background Users
|
Table 10 Site-Wide Test Background Users
Script |
Number of Users |
User Frequency |
/index.adp | 20 | every 10 seconds
|
/page2.adp | 20 | every 10 seconds
|
/page3.adp | 20 | every 10 seconds
|
/page4.adp | 11 | every 10 seconds
|
/page4-special.adp | 10 | every 10 seconds
|
Total | 81 Main Foreground Users
|
Table 11 Site-Wide Test Consumer-Site Foreground Users
Script |
Number of Users |
User Frequency |
xml-1 | 101 | every 10 seconds
|
xml-2 | 201 | every 10 seconds
|
xml-3 | 101 | every 10 seconds
|
Total | 403 XML Foreground Users
|
Table 12 Site-Wide Test XML Foreground Users
All of these scripts were one page-scripts, except for the trace-*
scripts. These trace scripts performed administration functions and
also involved logging into the Site X site.
The configuration ArsDigita used for testing was as follows:
Number Of Netras | 2 Main, 2 XML, 1 Image, 1 Cache*
|
AOLservers/Netra | 2
|
MaxThreads Parameter | 10
|
Number of Main Database Handles | 8
|
Virtual User Delay Between Clicks | 10 Seconds
|
Ramp Up | Background: 16 users, all at once Foreground: 5
users, every 3 minutes
|
Table 13 Site-Wide Test Configuration
*The Cache did not actually run during this test.
During this test, the site achieved an overall throughput of about 20
pages/second-exceeding its scalability requirements. Despite these
figures, though, virtual users started encountering server errors such
as connection resets after a load of about 130 users. Upon analyzing
the test results, ArsDigita pinpointed these errors being due to Netras
running out of physical memory. During load testing, the AOLserver
processes on the four Main and XML Netras eventually consumed more
memory than the 256MB of physical RAM available to them. Once this
happened, the Netras' performance suffered, and the site began to
experience errors. ArsDigita noted this memory problem but did not
address it until the next round of testing.
One other problem that ArsDigita found during this test was that at
higher loads, many of the XML bulk-download pages were failing.
ArsDigita suspected this was because many of these pages didn't have
access to the database handles they required. Thus, to attend to this
problem, ArsDigita decided to dedicate an AOLserver specifically for
serving XML bulk-downloads in its next test.
Following are graphs highlighting some of the site-wide test results:
Figure 18 Main Foreground: Performance Vs. Users
Figure 19 XML Foreground: Performance Vs. Users
Note that xml-1's performance grew too slow beyond 150 users (when Netra
memory ran out), and the other XML foreground pages' performance became
unacceptable at around 250 users. The main foreground pages, however,
performed adequately throughout testing.
The next three graphs illustrate how errors started occurring once Netra
physical memory ran out:
Figure 20 Main Foreground Errors Vs. Users
Figure 21 XML Foreground Errors Vs. Users
Figure 22 XML Background Error Rate Vs. Users
Figure 23 Netra Memory Use Vs. Users
The point where each Netra crosses over the 25000.00 line is
approximately where it uses up its 256 MB of RAM. The two Netras that
do not cross over the 256 MB threshold are the image and cache servers.
4.8 Round Two Findings
In this round of testing and tuning, ArsDigita was able to successfully
tune two slow pages as well as improve the server configuration.
Furthermore, through an Index+3 comparison test against Round One
findings, ArsDigita found that these changes significantly improved
overall site performance. Finally, in preparation for COMPANY Y,
ArsDigita individually tested two XML pages and performed a site-wide
test. Although these tests performed beyond requirements, they exposed
a memory-growth problem.
5 Round Three Testing
5.1 Test Environment
For its final round of scaling, ArsDigita moved its test equipment from
Cambridge to its space in the Waltham Exodus hosting site. There,
ArsDigita setup it machines to test against Site X's new e4500 server as
well as five Netras. This would help ArsDigita verify that on this new
production environment, Site X would still be able to meet its
scalability requirement of a 6.5 pages/second throughput.
ArsDigita was not able to test against a full complement of the nine
Netras that Site X planned on using for its site because four of the
Netras Site X planned on using with the e4500 were still deployed on the
existing e450-based site. Nevertheless, if Site X could pass
scalability requirements with five Netras, then it would certainly be
able to do so with nine.
5.2 Initial Site-Wide Tests
For this round of testing, ArsDigita added one more script to the XML
foreground, xml-13. It distributed its 500 e-Load virtual users
similarly to the Round Two site-wide test: 16 background users, and five
times as many XML foreground users as main foreground users.
Script |
Number of Users |
User Frequency |
xml-3 | 1 | every 10 minutes
|
xml-4 | 1 | every 10 minutes
|
xml-5 | 1 | every 10 minutes
|
xml-6 | 1 | every 10 minutes
|
xml-7 | 1 | every 10 minutes
|
xml-8 | 1 | every 10 minutes
|
xml-9 | 1 | every 10 minutes
|
xml-10 | 1 | every 10 minutes
|
xml-11 | 1 | every 10 minutes
|
xml-12 | 1 | every 10 minutes
|
trace-1 | 2 | every 5 minutes
|
trace-2 | 2 | every 5 minutes
|
trace-3 | 2 | every 5 minutes
|
Total | 16 Background Users
|
Table 14 Round Three Background Users
Script |
Number of Users |
User Frequency |
/index.adp | 20 | every 5 seconds
|
/page2.adp | 20 | every 5 seconds
|
/page3.adp | 20 | every 5 seconds
|
/page4.adp | 11 | every 5 seconds
|
/page4-special.adp | 10 | every 5 seconds
|
Total | 81 Main Foreground Users
|
Table 15 Round Three Consumer-Site Foreground Users
Script |
Number of Users |
User Frequency |
xml-1 | 91 | every 5 seconds
|
xml-2 | 201 | every 5 seconds
|
xml-3 | 91 | every 5 seconds
|
xml-13 | 20 | every 5 seconds
|
Total | 403 XML Foreground Users
|
Table 16 Round Three XML Foreground Users
For its first test, ArsDigita setup the site as follows:
Number Of Netras | 1 Main, 1 XML, 1 XML bulk-download,1 Image, 1
Cache
|
AOLservers/Netra | 1 for Main; 2 for XML
|
MaxThreads Parameter | 10
|
Number of Main Database Handles | 8
|
Ramp Up | Background: 16 users, all at once Foreground: 5
users, every 3 minutes
|
Virtual User Delay Between Clicks | 5 seconds
|
Table 17 Configuration for First Round-Three, Site-Wide Test
Note that ArsDigita, at Site X's request, reduced the virtual user delay
between clicks from ten seconds to five seconds.
5.2.1 Memory Issue
During this test, the Netras again ran out of memory as in Round Two.
However, this time, they ran out of RAM earlier on in testing than
before. Furthermore, several of the Netras in this test had 512 MB of
RAM rather than 256 MB.
One other thing that ArsDigita noticed during this test was that the XML
bulk-download, image, and cache Netras were not working hard at all.
ArsDigita decided that for its remaining tests in Round Three, it would
re-configure its Netras so that these three functions would reside on
one server:
Number Of Netras | 2 Main, 2 XML, 1 XML bulk-download/Image/Cache
|
AOLservers/Netra | 1 for Main; 2 for XML
|
MaxThreads Parameter | 10
|
Number of Main Database Handles | 8
|
Ramp Up | Background: 16 users, all at once Foreground: 5
users, every 3 minutes
|
Virtual User Delay Between Clicks | 5 seconds
|
Table 18 Configuration for Remaining Round-Three Site-Wide Tests
With this configuration in place, ArsDigita proceeded to work on
isolating the cause for the problematic AOLserver memory growth. After
multiple rounds of testing and analysis, ArsDigita was able to trace the
primary source of memory growth to the page, xml-2.
5.2.2 dqd_threadpool
Xml-2 spawned multiple threads to perform a reactive query every time it
was loaded. Thread creation within AOLserver is a CPU and
memory-intensive process. Thus, every time a user requested xml-2,
AOLserver would spawn multiple threads and grow in memory size.
Once ArsDigita had determined that spawning threads was causing
AOLserver to grow in size, it installed a custom AOLserver module,
dqd_threadpool. With this module, ArsDigita created a pool of threads
dedicated to the reactive query launched by xml-2. AOLserver would
automatically create all the threads for this pool upon startup and keep
them for the life of the server.
Now, whenever xml-2 ran its reactive query, it would grab existing
threads from the thread pool rather than create new threads. If all the
threads within the thread pool were already being used by other
requests, then the reactive query would enter a queue to await the next
free thread.
Using this pool put a bound on the number of threads that xml-2 could
use, and it thus capped the memory size to which AOLserver could grow.
After setting up dqd_threadpool, Arsdigita tested xml-2 with a pool size
of ten without seeing any memory growth problems. It, therefore, setup
one final site-wide test for Site X to see how the site would scale with
the new e4500 in place and memory issues resolved.
5.3 Final Site-Wide Test
For this final test, ArsDigita used the same script setup as in the
initial Round Three site-wide tests. But, it increased the ramp-up pace
for the foreground users to adding five users every 90 seconds:
Number Of Netras | 2 Main (10.0.1.240-1), 2 XML
(10.0.1.242-3), 1 XML bulk-download/Image/Cache (10.0.1.244)
|
AOLservers/Netra | 1 for Main; 2 for XML
|
MaxThreads Parameter | 10
|
Number of Main Database Handles | 8
|
Ramp Up | Background: 16 users, all at once Foreground: 5
users, every 90 seconds
|
Virtual User Delay Between Clicks | 5 seconds
|
Table 19 Final Site-Wide Test Configuration
During this test, the final site scaled acceptably to 500 users with a
throughput of about 49 pages/second. These results, on less hardware
than the production site, exceed the scalability requirements of 6.5
pages/second by nearly an order of magnitude.
The Netras did not run out of memory throughout this test, indicating
that using dqd_threadpool had largely solved the AOLserver memory
problem. Furthermore, the database and consumer-site Netras were not
using their CPU's to full capacity. The XML servers were using their
entire CPU capacity, but this did not hurt Site X's ability to meet
scalability or performance requirements.
A couple potential problems do reveal themselves in this test. First,
xml-13 performed poorly after about 300 concurrent users. The site
itself has already about reached its throughput bottleneck at this
point. Secondly, after about 200 users, the consumer-site pages started
experiencing errors, primarily in the form of connection resets or
timeouts. These problems occurred far above the required throughput but
may be areas which should be investigated in the future.
Following are some graphs highlighting the results from this test:
Figure 24 Final Site-Wide Test Overall Throughput: Pages/Second
Figure 25 Main Foreground: Performance Vs. Users
Figure 26 XML Foreground: Performance Vs. Users
Figure 27 Main Foreground: Error Rate Vs. Users
Figure 28 XML Foreground: Error Rate Vs. Users
Figure 29 XML Background: Performance Vs. Users
Figure 30 Trace Background: Performance Vs. Users
Figure 31 Netras Memory Vs. Users
5.4 Long-Term Stability Test
Once ArsDigita had ascertained that Site X would scale and perform
acceptably, it ran a long-term stability test on the site. A long-term
stability test exerts a constant load upon a site for an extended period
of time and measures how that site degrades over time. With Site X,
ArsDigita performed an overnight, 11-hour test under a constant load of
226 users. These 226 users included the 16 background scripts and 210
foreground and XML-foreground scripts. Note that during the final
site-wide test, the servers still had throughput capacity left beyond
226 users-the system throughput at 226 users was about 35 pages/second.
Thus, 226 users was not pushing the system into any bottlenecks.
The long-term test's scripts ran for a total of 1,469,016 iterations
with 5,152 total errors for a 0.35% error rate. Thus, for all intents
and purposes, the long-term test ran without any problems.
ArsDigita noted the memory on each Netra at the beginning and end of
the stability test. The results are as follows:
Netra |
Physical RAM (MB) |
RAM free (MB)Test Start |
RAM free (MB)Test End |
RAM Consumed (MB) |
100 (main) | 256 | 80 | 0 | 80+
|
101 (main) | 512 | 327 | 202 | 125
|
102 (XML) | 512 | 317 | 250 | 67
|
103 (XML) | 256 | 111 | 91 | 20
|
104 (mixed) | 512 | 283 | 243 | 40
|
Table 20 Long-Term Stability Test Memory Usage
Based upon these results, Site X proved fairly stable. But, under
constant load, the AOLservers do slowly grow in memory usage-indicating,
perhaps, some kind of small memory leak. ArsDigita recommends that Site
X do the following with its production setup to deal with this memory
growth:
- Use 512 MB of RAM on each Netra
- Restart the AOLserver instances twice a day
Following these steps should help ensure that Site X runs stably over
time.
6 Conclusion
ArsDigita's Scaling Team worked iteratively with the Site X development
team to ensure that Site X would be prepared for its COMPANY Y
joint-venture. As a result of this effort, ArsDigita greatly improved
Site X's scalability by
- Changing Site X's server configuration
- Improving slow page performance
- Capping thread memory growth
- Determining stability-ensuring measure
As a result of this work, Site X currently far exceeds its scalability
requirement of 6.5 pages/second while running on less hardware than the
actual production site.