Quantcast Printed from embedded-computing.com
Printed from:

Benchmarking wireless telecommunications infrastructure equipment

Peter Carlston By Peter Carlston
Intel Corporation

Two and a half years ago I reported on the first phases of our project to measure the performance of our proof-of-concept Radio Network Controller (RNC) in an article entitled “How to use modular building blocks to develop an RNC system” in the Spring 2004 issue of Embedded Computing Design. At that time we had determined Intel Pentium M processors were a better choice than Intel Pentium 4 processors based on the Intel Netburst microarchitecture for the type of intense signaling plane workloads presented by an RNC.

In addition, processor cycle accurate modeling of the user plane software running on the Intel IXP2800 network processors had shown the user plane (sometimes termed data plane) for a 500K subscriber RNC would require only 12 Intel NetStructure IXB2800 3G boards with their integrated Universal Mobile Telecommunication System (UMTS) user plane software. This article discusses how we completed the rest of the benchmark efforts and summarizes the major findings.

The term benchmark is a bit misleading because there are no industry-standard procedures for measuring and comparing the performance of wireless infrastructure equipment like there are, say, for servers. Nevertheless, we had to give our customers guidance about how many AdvancedTCA boards would be required for various sized RNCs built using Intel products and third-party hardware and software.

We used a three-step process to assess the actual performance of our reference RNC. This involved:

Reference RNC
Intel’s proof-of-concept RNC utilizes the hardware and software shown in Table 1, which includes the load generation equipment involved in measuring the control plane performance.

Hardware Component

Use

Manufacturer

Software

Chassis

Chassis

Intel NetStructure MPCH0001

N/A

Chassis Management Modules

Controls blade bring-up and fan speeds.  Other high availability features not used during tests.

Intel NetStructure MPCMM0001

Intel NetStructure High Availability Middleware

GbE Backplane Switches

Intra-RNC communications

Znyx ZX5000-X3

 

Four User Plane Cards

User and Transport Planes

Intel NetStructure IXB2800 3G Boards
(Intel IXP2800 Network Processors)

MontaVista CGE 3.1 Linux (on XScale core); Intel RNC data plane software (both core and microengines)

Four User Plane PrPMCs

User Plane-Control Plane Framework APIs

RadiSys EPC-6315 with Intel Pentium III processor @ 800 MHz; 2 GB DDR266 ECC RAM

MontaVista CGE 3.1 Linux; Intel NetStructure IXB3G Software

Kasumi FPGA

Ciphering

Intel NetStructure IXB2800 3G boards (included with IXB2800KAS)

Intel RNC data plane software

Two OC-3 Mezzanines

Iub, Iu-PS line interfaces

Intel NetStructure IXB2800 3G boards (included with IXB28004XOC3)

Intel RNC data plane software

Memory compliment

 

32 MB QDR SRAM; 2 GB DDRAM; 10 MB flash

N/A

Signaling Plane SBC

Hosts Signaling/Control Plane Software

Motorola ATCA-717

TietoEnator 3G RAN Signaling
Middleware V1.0, running on MontaVista CGE 3.1 Linux;
Intel C++ compiler version 9.0  Compile flags: /usr/local/intel_cceia_90/bin/icc -ansi -Wall -Ob2  -g  -O3 -xB -tpp7 -ipo

BIOS

 

Phoenix 4.0 R 6.0

 

Processor

 

Intel Pentium M Processor at 1.6 GHz (Can also use the Intel Pentium M Processor at 1.8 GHz)

 

Memory compliment

 

2 GB DDR SDRAM 425 (133 MHz/
266 Mbps)

N/A

Hard Disk Storage

Boot disc

30 GB HDD

N/A

Load Generator

Runs UE, NodeB, and SGSN User and Signaling Plane Stacks

Dell PowerEdge 1850
1u rack-mount server

MontaVista CGE 3.1 Linux
TietoEnator 3G RAN Signalling stacks; TietoEnator Load Generator with

  • SGSN Application
  • Presentation Unit
  • RLC/MAC/FP Protocols
  • RRM (part)

Intel C++ compiler version 9.0  Compile flags: /usr/local/intel_cceia_90/bin/icc -ansi -Wall -Ob2  -g  -O3 -xB -tpp7 -ipo

BIOS

 

LSI Logic
MPTBIOS-5.06.04

 

Processors

 

2x Intel Xeon processors @ 3.0 GHz; 1 MB L2 cache each, EM64T

 

Memory

 

2 GB DDR2 SDRAM; (1 GB on each of 2 memory channels)

 

I/O Riser card

Holds Network Interface Cards

2 PCI-X slots

 

GbE Interface

Transport for control plane-only tests

Intel 1000MT Copper Dual-Port PCI-X Gigabit Adapter

 

OC-3 ATM Line Card

ATM I/F for Combined User and Control Plane Tests

Xalyo XS155 ATM Card (Quad OC-3 STM1)

Linux driver

Hard Disk Storage

 

36 GB SCSI HDD (10K RPM, Ultra3 LVD, 1")

 

Table 1

Figure 1 shows the high-level software architecture and data flow within our reference UMTS R99/R5 RNC. The Iur line card has been omitted from this diagram since it was not used during the control plane-only and combined tests outlined later.  The upper portion of the figure shows the control plane software, which would normally be horizontally distributed across multiple AdvancedTCA SBCs that use Intel processors.

Intel network processors are used for the line cards and Radio Network Layer (RNL) blades shown in the lower part of the diagram. This user plane software runs in the network processors’ microengines. Interfaces to management middleware and the control plane subsystems run in the processors’ XScale core of each network processor. Line cards connect to downstream and upstream equipment. The RNL cards do not have external interfaces. Our modeling had previously shown the dedicated RNL (RNL-d) card presents the most challenging processing load since the Kasumi ciphering algorithm runs on that card. An FPGA performs the actual ciphering. This freed up microengines to handle the Radio Layer Control (RLC), Media Access Controller (MAC), and frame protocol algorithms.

the high-level software architecture and data flow within our reference UMTS R99/R5 RNC
Figure 1

User plane tests
Figure 2 shows the hardware setup for the user plane tests. Our engineers tested the user plane software products using ANVL network test software from IXIA running on four high-performance PCs. They extended the standard ANVL test capabilities to encompass the following:

the hardware setup for the user plane tests
Figure 2 (click to zoom)

These extensions required considerable expertise and work but were reused and extended further during the stress, duration, robustness, recoverability, and reliability testing phases of the product software development project.

It would have been impractical to replicate enough ANVL PCs to emulate thousands of sessions. So we used an ATM multicast switch to multiply the load by a factor of 80. We were thus able to simulate the load from 1,600 User Equipment (UEs) during the PS-only tests, 6,080 UEs during the CS-only tests, and 3,200 UEs during the PS-only tests. This resulted in 1,600 PS-only sessions, 6,080 CS-Only sessions, and 3,200 + 1,200 CS + PS sessions over the data channel.

We based our traffic model on the UMTS Forum Report Number 6 suggestions, but adjusted it in consultation with several of the largest equipment manufacturers. Table 2 shows its major parameters.

Traffic Model

 

Parameter

User Plane Tests

Control Plane Tests

General Assumptions

Active Session Distribution

30% PS; 70% CS

100% PS

Macrodiversity

45% two branches; 5% three branches

N/A

Soft Handoff Interval (adding/removing a branch)

10 seconds

N/A

Average location updates/hour/subscriber

4

N/A

Node Bs

20

24

Cells

60

60

Traffic Penetration

 

0.1

System Information Updates
per second per Node B

 

1

UE Measurement Reports
per second per UE

 

0.5

Common Measurement Reports
per second per cell

 

1

Dedicated Measurement Reports

 

N/A

Circuit Switched Traffic

BHCA per subscriber

0.7

N/A

AMR coding

12.2 Kbps DL/UL

Average call duration

90 seconds

Average DL/UL Packet size

30.5 bytes

Average OLPC adjustment interval

200 ms

Packet Switched Traffic

 

BHCA per subscriber

0.3

 

DL/UL bit rates

128 Kbps /25.6 Kbps

 

Average DL packet size

600 bytes

 

Size of DL user data over Iu-PS

 

1,316 bytes

Size of DL user data over Iub

 

≤ 336 bytes

Average UL  packet size

120 bytes

 

Average Session Duration (seconds)

310

120

Average activity period

30%

 

OLPC adjustment interval

200 ms

N/A

RLC retransmit factor

10%

N/A

CCH to DCH to CCH transitions

4

N/A

Measured
@ 3,000 parallel calls

Packet Service Attach

 

5

Packet Service Detach

 

3

Routing Area Update

 

1

Radio Link Additions per second (Soft)

 

24

Radio Link Deletions per second (Soft)

 

24

Radio Link Additions per second (Softer)

 

24.5

Radio Link Deletions per second (Softer)

 

24.5

According to 3GPP 34.108. Common test environments for User Equipment (UE) conformance testing, Release 1999

Table 2

The results of the user plane-only tests came in better than our modeling had suggested. The Beta tests showed it would require only nine user plane boards for a 600K subscriber UMTS R5 RNC. (An additional six user plane boards are required for high-availability protection.) Table 3 summarizes the Beta hardware and software test results.

Transport-User Plane Benchmark Summary 600K Subscribers

Board Type

External Interfaces

Per-Board DL Throughput (Mbps)

Quantity
(Working + Protection)

DL Throughput Total for Boards per Type (Mbps)

ATM Line  Cards*

4 x OC-3

620

4 + 4 (1:1 Sparing)

2,480 + 2,480*

RNL-d Card

none

188

4 + 1
(n+1 Sparing)

752 + 188**

RNL-c/sh Card

none

44

1 + 1
(1 + 1 Sparing)

80 + 80

* Each card hosts four bidirectional OC-3/STM-1 interfaces
** Each board supports 170K subscribers  

Table 3

Control plane-only tests
While we were developing and testing our user plane hardware and software, we began working with the Telecom and Media Division of TietoEnator AB for a signaling/control plane software solution to be integrated with the Intel-based user plane so we could test the overall proof-of-concept RNC’s system performance. TietoEnator ported their UMTS R99/R5 control plane software stack modules to Linux and validated them on the Intel Pentium M processor-based SBCs from Motorola Embedded Communications Computing group, as listed in Table 1.

Unlike the user plane testing, an ATM multiplexing switch could not be used to increase the effective load by merely duplicating traffic flows since signaling stacks require that all parallel calls have unique Cellular IDs (CIDs), User equipment IDs (UIDs), and so on. So TietoEnator built a dynamic load generator to emulate the UE/IuB and Iu-PS traffic. The load generator included all the signaling plane modules and both ATM and IP transport/user plane modules.

This software ran on a typical Linux rack-mount server. Ordinarily the cost to develop such a load generator would have been far beyond our budget for the project, but TietoEnator was able to reuse their nearly complete, highly modular user and control plane software stacks. These stacks already incorporate high-performance APIs, for example between NBAP/ALCAP/RRC/RANAP and the high-level RNC application (also termed Radio Resource Management or RRM). They were thus able to make the extensions and modifications necessary for our test environment, such as the parts of RRM we needed to run the tests, very cost effectively.

The only major problems surfaced during the integration of the ATM hardware, drivers, and Linux operating system. The first Linux distribution we tried, for example, didn’t have high enough timing/clock resolution, so MontaVista’s Carrier Grade Linux was used instead. It also took more time than anticipated to select ATM hardware – in fact, the Xalyo ATM card was the third one we tried.

We bounded the scope of the project in the following ways. First, we limited our tests to streaming video packet switched    traffic. Setting up a channel between the UE/Node B/RNC and Serving GPRS Support Node requires approximately the same amount of signaling whether that channel is used for voice or for data traffic, so the results obtained from PS calls can also be applied to voice capacity. Second, we limited the load generator’s setup/teardown capacity to 5,000 parallel calls. Third, we limited the size of the RNC and load generator platforms.

Since the user plane tests had shown the user plane software scaled almost linearly when more cards were added, we used a minimum user plane shelf consisting of single RNL-d, RNL-c/s, Iub, and Iu-PS boards. Then we performed tests that exercised the load generator and RNC signaling blade by setting up/tearing down 5,000 parallel calls. Once that was working well we applied the traffic model parameters to the load generator traffic.

Adding signaling to handle soft radio link additions/deletions and measurement reports increased the control plane traffic by two-fifths and thus reduced the number of parallel calls the load generator could support to 3,000. We took a series of measurements at 1,000, 2,000, and 3,000 parallel calls. The results showed processor load tracked control plane traffic very closely – 30 percent processor load correlated to 3K parallel calls, 20 percent load to 2K, 10 percent to 1K, and so on. We therefore were confident we could extrapolate meaningful test results by using a single load generator.
    
It should be mentioned that we actually ended up using IP transport for these control plane-only tests. Even though the ATM card was the best available for our configuration, its device driver used a fixed number of memory buffers to store the CIDs. This limited the number of UE channels the Node B load generator software could support. We had also determined the load generator’s user plane software could support a total of 500 parallel video downloads. These limits meant the amount of signaling traffic during the combined user plane/control plane tests would have been too little to obtain statistically meaningful results. So TietoEnator replaced the ATM modules with User Datagram Protocol (UDP)/IP over GbE and was able to generate substantial signaling load using the setup shown in Figure 3. Again Tieto’s APIs (“CPCS I/F” in Figure 3) saved us considerable time and expense since no changes were required to upper layer protocols.

TietoEnator replaced the ATM modules with User Datagram Protocol (UDP)/IP over GbE and was able to generate substantial signaling load
Figure 3
(click to zoom)

Figure 4 shows a representative graph obtained when running the load generator (with traffic model) at 3,000 parallel calls. Notice the tests were run over a period of about 30 minutes, during which time the number of parallel calls built up to the target 3,000. The Linux “ps” and “top” commands were used to capture total processor load during the test period. The Cpu_sys trace in Figure 4 shows the processor load attributable to running the UDP and IP stacks. Transport plane protocols normally run on line cards, not control plane blades. This load needs to be deducted from the total “CPU_load.” Therefore the meaningful trace is Cpu_user. It shows that 3,000 simultaneous parallel calls (about 80K BHCA or 25 calls per second) consumed around 30 percent of the Intel Pentium M processors’ (1.60 GHz) resources.[1]

a representative graph obtained when running the load generator (with traffic model) at 3,000 parallel calls
Figure 4
. Processor Load at 3000 Parallel Calls

Combined user and control plane tests
TietoEnator had previously integrated their control plane software with the user plane software from Intel by using the Framework API supported by the Intel NetStructure IXB2800 3G Boards software. The Framework API is derived from the Network Processor Forum functional API coding guidelines and naming practices. This functionality was turned back on for the combined user and control plane tests. The final test setup is shown in Figure 5. Note the addition of the user plane line cards and RNL cards in dark blue, as well as the FAPI between the user and control plane blades.

TietoEnator had previously integrated their control plane software with the user plane software from Intel by using the Framework API supported by the Intel NetStructure IXB2800 3G Boards software. The Framework API is derived from the Network Processor Forum functional API coding guidelines and naming practices. This functionality was turned back on for the combined user and control plane tests
Figure 5
(click to zoom)

The network configuration for the combined user and control plane tests is shown in Figure 6.

The network configuration for the combined user and control plane tests
Figure 6 (click to zoom)

The final tests validated our assumptions and provided much additional information that can only be summarized here. First of all, a 500-call streaming video load running on the user plane cards added only a 4-5.5 percent load to the processor running the signaling stacks, as shown in Figure 7. Second, such a small number of parallel calls did not generate a measurable load on the XScale cores where the network processors’ FAPI interfaces run.

a 500-call streaming video load running on the user plane cards added only a 4-5.5 percent load to the processor running the signaling stacks
Figure 7

Extrapolated results for combined user and control plane test
Figure 8 shows the signaling blade’s processor utilization extrapolated up to a 50 percent processor load. Again, onboard IP load has been deducted from the measured and extrapolated results. The results in Figure 8 have also been increased by 50 percent because the control plane-only and combined tests were both done with TietoEnator’s partial Radio Resource Management (RRM) application software running on the same blade as the signaling software. We estimated that even the reduced RRM module consumed about 50 percent of the processor’s resources. Putting the RRM software on a signaling blade is a highly unlikely architecture since RRM is one of the most intensive consumers of processor resources. It is usually placed on separate “RRM-only” boards since it often needs to be scaled independently of the signaling and user plane processing.

the signaling blade’s processor utilization extrapolated up to a 50 percent processor load
Figure 8
. Extrapolated Control Plane Results

In addition, it should be noted that signaling boards in our reference RNC are specified at 40 percent processor load. This enables a high-availability protection board also running at 40 percent load to take over the entire load of a failed board and still not exceed 80 percent of the total processor load.  

So Figure 8 indicates a single Intel Pentium M blade could process about 9,250 parallel calls at a 40 percent processor load.

Figure 9 shows the extrapolated memory consumption we measured during the control plane-only tests. This graph includes RRM and IP transport, which would not normally be running on a signaling board.  But even with the extra memory consumed by RRM and IP, 9,250 parallel calls still require only about 1.15 GB of RAM. Other calculations not shown indicate even at 80 percent or 100 percent processor utilization (theoretically about 24,000 parallel calls) only 1.375 GB of memory would be required by the TietoEnator control plane software.

the extrapolated memory consumption we measured during the control plane-only tests. This graph includes RRM and IP transport, which would not normally be running on a signaling board
Figure 9

The control plane-only tests had indicated the backplane traffic tracked the number of parallel calls, so we extrapolated the results from those tests in Figure 10. Converting 1,215,056 bytes per second at 40 percent processor load works out to be 9.7 Mbps, or, during failover mode, 19.4 Mbps of backplane traffic per control plane blade. Again, the values are somewhat inflated since they include the RRM and IP Transport Tx traffic to the blade. Transmit traffic was always higher than received traffic, so Eth0_rx measurements have not been shown here.     

The control plane-only tests had indicated the backplane traffic tracked the number of parallel calls, so we extrapolated the results from those tests
Figure 10
. Extrapolated Backplane Usage Per Signaling Blade

One final adjustment must be made to the number of parallel calls a single signaling board can handle. Although these tests indicated an Intel Pentium M processor-based control plane board can process about 9,250 calls in parallel, the totals should actually be de-rated by 50 percent because we estimate about half of the control messages actually present in an operational RNC were not implemented in the load generator. We did not implement, for example, any of the soft or softer handoff control messages or those between RNCs on the Iur interface. We also did not implement any of the cell, URA, location area update, or paging control sequences. Basically all of these sequences are controlled by the RRM application, so implementing them would have measured the RRM module’s performance but not actually told us much more about the signaling plane’s performance.

So if we de-rate the numbers in Figure 9 by 50 percent, we come to the conclusion that a single Intel Pentium M processor-based blade running at 40 percent processor load can realistically handle 4,125 parallel calls. This works out to a Busy Hour Call Attempts (BHCA) of about 109K, or 34 calls per second. 

Calculating the number of Erlangs per blade is a bit more complicated. Erlangs are derived by the formula A = Δ*T, where Δ is the number of carried connections per unit of time (arrival rate, cell rate) and “T” is the mean duration of an average connection or hold time. In other words:

Erlangs = calls per second x (average call duration + call setup time + call release time) 

We had not measured the call setup and release times at 4,125 parallel calls. We had measured setup and release times at 3,000 calls, but these times were increasing in an unusual fashion and would have been too long by a factor of two or more at 4,100 calls. We estimate about two-thirds of this delay occurred because the load generator was running at very nearly 100 percent CPU load when handling 3,000 calls. So it was not responding to setup/teardown requests within the timeframes production network equipment would. It is also possible code optimization would have to be done. As mentioned, some stack optimization occurred naturally during the course of this project, but optimization per se was not a project goal.

So if we assume rather high call setup and release times of three and one seconds, respectively, Erlangs per blade would be 34 * (120 + 3 + 1), or 4,216.

Dimensioning
TietoEnator’s signaling software is designed to be distributed across a number of blades. They use a software scaling factor of 0.7 to calculate how signaling processing capacity scales as more boards are added to a system. Table 4 shows the estimated dimensioning matrix for a set of six (half of an AdvancedTCA shelf) Intel Pentium M processor-based signaling boards:

Board

Total Parallel Calls

1

2

3

4

5

6

 

.7 Software Scaling Factor

4,125

 

 

 

 

 

4,125

4,125

2,887

 

 

 

 

7,012

4,125

2,887

2,887

 

 

 

9,899

4,125

2,887

2,887

2,887

 

 

12,786

4,125

2,887

2,887

2,887

2,887

 

15,673

4,125

2,887

2,887

2,887

2,887

2,887

 18,560

Table 4

Calculating the number of subscribers (rather than parallel calls) is normally done using the formula:

Total Subscribers = (seconds in an hour * active sessions)/(activity period * BHCA per subscriber)

Using this formula gives the total number of subscribers that can be supported by six boards based on various traffic model factors:

(3,600 * 18,560/120 * 1) = 66,816,000/120 = 556,800 CS+PS users

The figure of approximately 556.8K subscribers per six signaling boards assumes a BHCA per subscriber of one, which means we are dimensioning a six-slot signaling plane to handle the contingency of each subscriber making one call during the busiest hour of the day. The traffic model predicts the calls should be about 30 percent packet switched and 70 percent circuit switched.

Projections for Intel Core microarchitecture processors

Anecdotal information indicates the performance of protocols quite similar to UMTS signaling about doubles when they move from an Intel Pentium M processor to a Dual-Core Intel Xeon Processor 5100 Series, even without software changes to take full advantage of the second core. If these indications are validated for UMTS signaling stacks, it means single-threaded applications rewritten to utilize core affinity or the SMP capabilities of a dual-core processor could perform at rates greater than twice those reported in this article. And, since some of the Dual-Core Intel Xeon Processors are routable and coolable in an AdvancedTCA SBC, it is possible we could be on the cusp of a performance density breakthrough for this type of intensive signaling and control plane processing. Performance per SBC greater than three times than summarized in this article may very well be possible today. We plan follow-up studies to find out.

Peter Carlston is a platform solutions architect with Intel’s Infrastructure Processor Division. His principle focus has been to ensure Intel’s embedded processor products meet the requirements derived from wireless infrastructure equipment system architectures.    

To learn more, contact Peter at:

Intel
2200 Mission College Blvd.
Santa Clara, CA 95052
peter.carlston@intel.com
www.intel.com

References
[1] Note that some code optimization was done later that more than doubled these figures.

Online references
AdvancedTCA:
http://www.picmg.org/newinitiative.stm
http://www.intel.com/go/atca

ANVL Network Test Suites
http://www.ixiacom.com/

Intel RNC user plane white paper
www.intel.com/go/rnc

Intel NetStructure IXB2800 3G Boards:
http://www.intel.com/design/telecom/products/boards/rnc/8304/overview.htm

Intel IXP2800 Network Processor
http://www.intel.com/design/network/products/npfamily/ixp2800.htm

Intel Computer Boards and Platforms for AdvancedTCA
http://www.intel.com/design/network/products/cbp/atca/index.htm

Motorola Embedded Communications Computing
http://www.motorola.com/content.jsp?globalObjectId=5202

TietoEnator Signaling Solutions
http://www.tietoenator.com/default.asp?path=1;93;16080;124;16837;17193;16242

Disclaimers
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations (http://www.intel.com/performance/resources/limits.htm).

Results have been simulated and are provided for informational purposes only. Results were derived using simulations run on an architecture simulator. Any difference in system hardware or software design or configuration may affect actual performance.

Intel does not control or audit the design or implementation of third-party benchmarks or websites referenced in this document. Intel encourages all of its customers to visit the referenced websites or others where similar performance benchmarks are reported and confirm whether the referenced benchmarks are accurate and reflect performance of systems available for purchase.

Intel, the Intel logo, Intel NetBurst, Intel NetStructure, Pentium, and Intel Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

Other names and brands may be claimed as the property of others.