Product Manager Practice: A Guide to Software System Server Planning and Selection

This guide will provide a systematic thinking framework and practical advice, covering server type classification, quantity estimation, configuration balancing, and process management methods, helping you make more scientific, pragmatic, and future-proof decisions amidst the complexity of options.

The server carries the core computing, storage, and network capabilities of the system, and the rationality of its planning and selection directly determines the upper limit, stability, expansion potential, and overall cost-effectiveness of the software system. Product managers do not have to delve into the details of technical implementation, but they must thoroughly understand how business requirements map to technical resource requirements, master the core framework of evaluation and decision-making, and improve their voice and leadership in the product technical team.

1. Server type

The choice of server type is not arbitrary, and must be closely aligned with the architecture and functional modules of the software system. Understanding the core responsibilities of different types of servers is the starting point for accurately matching requirements.

Application server

Core Responsibilities: Run application code, process user requests (such as API calls, page rendering), perform business logic calculations, and return responses after interacting with other components such as databases and caches. It is a direct back-end processor for user interactions.

Selection considerations

  • Architectural impact: A single application may be supported by a few powerful application servers; The microservice architecture requires deploying a dedicated and potentially smaller application server cluster for each independent service (such as user service, order service, and payment service) to achieve decoupling and independent scaling.
  • Performance requirements:: General services (such as content management, internal systems) can be satisfied for standard performance servers. High-concurrency, low-latency scenarios (such as real-time game battle logic processing, live barrage distribution, high-frequency trading systems) must choose a high-performance server with a powerful CPU (high clock frequency, multi-core) and sufficient memory, and even consider specific optimizations (such as GC tuning servers for Java applications).
  • Technology stack association: The chosen programming language (JavaGoNode.jsPython, etc.) and framework have a direct impact on the server’s resource requirements (especially CPU and memory) and need to be confirmed with the technical leader.

Data storage server

Core Responsibilities: Persists all data generated by system operation, ensuring data reliability, consistency, and accessibility.

Selection key – data type

1) Structured Data (Relational Database – RDBMS): Such as user information, order records, inventory information and other tabular data with strict formats and relationships. Mainstream choices include MySQL (open source, widely used), PostgreSQL (powerful and scalable), SQLServer (Windows ecosystem), and Oracle (large enterprise level). Transaction consistency requirements, data volume, complex query support, licensing costs, etc. should be considered.

2) Unstructured/semi-structured data: Such as images, videos, audio, documents, log files, JSON/XML data. Common Schemes:

  • Distributed File System (DFS): For example, CephGlusterFS. It is suitable for massive storage that requires file interface access (such as network disks, video on demand source file storage). Provides high reliability and scale-out capabilities.
  • ObjectStorageFor example, AWSS3MinIO (an open source solution compatible with S3) and Alibaba Cloud OSS. Accessing data objects through API (HTTPRESTful) is naturally suitable for the storage of media resources such as images and videos, and has high scalability and durability. It is the preferred solution in the cloud era.
  • NoSQL databaseFor example, MongoDB (document-based, flexible), Cassandra/ScyllaDB (wide column, high write), Redis (key-value type, can also be persisted), Elasticsearch (search and analysis). It is used to handle large data volumes, flexible modes, and high throughput scenarios that are difficult to efficiently support by RDBMS.

3) Extreme safety requirements: When it comes to sensitive data such as finance, healthcare, and national security, consider dedicated encrypted storage servers or hardware security modules (HSMs). HSMs provide physical-level key management and cryptographic operations, making them the highest level of security.

Caching server

Core Responsibilities: Temporarily stores frequently accessed hot data (such as user session information, popular product detail page data, and frequently queried results) in ultra-high-speed memory (RAM). Minimize direct access to back-end databases, significantly improve response times (milliseconds) and reduce database stress.

Mainstream technology:Redis (rich in features: multiple data structures, support persistence, clustering, Lua scripting) and Memcached (simple and efficient, pure memory, multi-threaded). Redis has become the de facto standard due to its versatility.

necessity: For any medium to high concurrency project with obvious hot data or database access as a bottleneck, the cache server is standard rather than optional. Product managers need to understand their critical role in improving user experience (speed) and system capacity.

Load balancing server

Core Responsibilities: Acts as the first entry point for user requests to intelligently and evenly distribute traffic across multiple application servers (or service instances) in the backend. The core value is to improve the overall throughput of the system, avoid single point overload, and enhance fault tolerance.

Selection path

  • Software Load Balancing (SLB)For example, Nginx (HTTP/HTTPS/reverse proxy), HAProxy (TCP/HTTP), LVS (Linux kernel level). Deployed on ordinary servers, low cost, flexible configuration, and easy to scale. It is the first choice for early project and small and medium-sized scenarios.
  • Hardware Load Balancer (HLB)For example, F5BIG-IPCitrixADC. Dedicated hardware devices with extremely high performance (especially SSL offload capability), powerful functions (such as WAF integration), and good stability. But it’s expensive and more complex to operate. It is suitable for scenarios with extremely high traffic volumes and extremely demanding performance and stability requirements (such as large financial core systems).
  • Cloud service provider load balancer: For example, AWSALB/NLB Alibaba Cloud SLB. It is easy to use out of the box, elastic and scalable, and has a good integrated cloud ecosystem. is a natural choice for cloud-native projects. Product managers need to pay attention to their billing model (by traffic/bandwidth/connections) and functional features.

Strategy evolution: Start quickly with software solutions in the early stages, and smoothly transition to hardware or more powerful cloud load balancing solutions as your business grows and performance requirements improve.

Security and web servers

Core Responsibilities: Build a security boundary for the system, control network access, monitor abnormal behavior, and ensure data security and business continuity. Importance is often underestimated, but once something happens, it can be costly.

Key components

  • Firewall servers/devices: Enforcing access control policies (ACLs) at the network perimeter to filter illegal traffic (such as DDoS attack attempts and malicious scans) is the first line of defense. It can be a dedicated hardware firewall or a software firewall running on a server (such as iptablesfirewalld).
  • Gateway/data exchange platform: Used for secure and controlled data exchange between different security domains (such as intranet and extranet, production network and testnet) that are physically isolated or logically strongly isolated. Prevents direct penetration of the high safety zone.
  • Log audit server: Centrally collect, store, and analyze logs from application servers, databases, network devices, and operating systems. It is used for security incident traceability, compliance audits (such as classified protection), troubleshooting, and performance analysis. ELKStack (ElasticsearchLogstashKibana) is a common solution.

Product manager focus: For projects involving user privacy data (PII), financial transactions, and government regulatory requirements, security and web server planning must be prioritized and closely integrated with compliance requirements. Its cost is a guaranteed investment.

2. Number of servers

The number of servers is not determined by clicking the head, it needs to be derived based on quantifiable business metrics, and redundancy and scalability design are incorporated.

Anchor core business indicators

Peak number of concurrent users: This is the core capacity indicator. It refers to the number of users who are online and effectively operate during the busiest business hours (such as e-commerce double 11 at 0:00, the moment of online class starting, and when news hotspots break out). Acquisition methods: historical data analysis, business growth model forecasting, competitive product reference, market research. Be sure to identify real peak scenarios.

Data growth: Estimates the amount of new data (units: GB/TB/PB) and the number of records (such as the number of orders and log entries) added by the system every day, week, and month. This is critical for capacity planning for storage servers (disk space) and database servers (processing power). Ignoring this can lead to storage overloads, drastic performance drops, and even service disruptions.

Business peak scenario model: Gain a deeper understanding of the business and identify special events that may trigger traffic surges (flash sales, rush sales, big promotions, breaking news pushes). Design the maximum load capacity of the server based on the requirements of these extreme scenarios to ensure that the system does not crash under pressure.

Quantify the processing capacity of a single machine

Performance testing is the gold standard: Theoretical estimation requires practical verification. Use professional performance testing tools (such as JMeterLoadRunnerlocustk6) to perform stress tests (StressTest) and load test (LoadTest) on typical business scenarios (user login, browsing goods, placing orders and paying).

Key performance indicators (KPIs) are obtained

  • TPS(TransactionsPerSecond): The number of transactions successfully processed by the system per second (such as “Order” transactions).
  • QPS(QueriesPerSecond): The number of query requests processed by the database or API per second.
  • Maximum number of stable concurrent users: The number of concurrent users that a single server can support under the premise that the response time (RT) is guaranteed to meet the standard (e.g., 95% of requests <1s).
  • Resource utilization: Tests the usage of CPU, memory, disk IO, and network IO to find bottlenecks.

Calculation example: Suppose the stress test results of a single application server show that it can stably handle 1000 concurrent users (RT meets the standard). If the estimated peak number of concurrent users is 5,000, the theoretical minimum number of concurrent users is 5,000/1,000 = 5.

Incorporate redundancy and flexible design

Redundancy coefficient: The server cannot be 100% reliable (hardware failure, software bug, maintenance). To avoid service disruptions caused by a single point of failure, more servers than the theoretical minimum must be deployed. Industry experience is usually 1.5 to 2 times the theoretical value. For example, 5 units are needed in theory and 7-10 units are deployed in practice. This provides N+1 or N+2 fault tolerance.

Scalability considerations

  • Scale out: Increase overall processing power by adding more servers with the same (or similar) configuration. Microservices, stateless applications, and distributed storage naturally support scale-out. This is the preferred mode in the cloud era, and it is necessary to reserve sufficient expansion space (such as load balancer capacity, network bandwidth, and cluster management capabilities) when planning.
  • Scale up: Upgrade the configuration of a single server (e.g., replace the CPU, memory, or SSD) to improve the capabilities. It is suitable for applications where single-machine bottlenecks are obvious and scale-out is difficult (such as some strong consistency database masters). The cost is higher, and upgrades may involve downtime.

Product manager decision points: Communicate closely with the architect to determine whether the system design prioritizes horizontal or vertical scaling. This directly affects the initial procurement/leasing strategy (buy big vs. how many small machines) and the long-term cost model.

3. Server configuration

Server configuration (CPU, memory, storage, networking) is the cornerstone of performance and the majority of costs. Product managers need to find the best balance between meeting performance needs, controlling budgets, and allowing room for future expansion.

CPU

Selection basis: CPU is the core of computing power, and the type of application is the decisive factor in its selection.

  • General Purpose Computing (Application Server, Web Server): Choose a CPU with multiple cores (such as 8 cores, 16 cores, 32 cores) and a higher clock frequency (GHz). Multi-core is conducive to processing multiple requests concurrently, and high frequency increases the processing speed of a single request. IntelXeon Scalable/AMDEPYC is the mainstream choice.
  • Compute-intensive (big data analysis batch processing, scientific computing, AI model training/inference, video transcoding): Requires extremely high single-core or multi-core performance, or even specific instruction set optimizations (e.g., AVX-512). At this time, you need to choose the CPU model with the highest performance, and you may need to configure a GPU (such as A100/V100/T4) for acceleration, and CPU+GPU collaboration is standard for this type of scenario.

Pragmatic strategy: In the early stage, the mainstream configuration is selected according to the estimated load (to avoid the waste of top-level equipment and the lack of entry-level). Leverage the elasticity of cloud services to scale-up or scale-out instances when business growth or performance bottlenecks arise. Monitoring CPU utilization is the basis for adjustment.

Memory (RAM)

core role: Stores the operating system, running application processes, and cache data. Insufficient memory will cause the system to use low-speed disk swaps frequently, and the speed and performance will drop off a cliff.

Configuration recommendations

  • Application server baseline: Modern applications (especially Java/. .NET applications) have a large memory consumption. 16GB is the lowest reasonable starting point at the moment. 32GB-64GB is recommended for medium-load applications.
  • High-load/memory-based applications: In-memory databases (such as Redis), big data processing (such as Spark), and large monolithic applications (such as complex ERP) may require 128GB to 256GB or more.
  • Avoid bottlenecks: Pay attention to the matching of CPU and memory. If a powerful CPU is allocated too little memory, the CPU will be idle while waiting for data to load (memory bottleneck); On the contrary, large memory is weak and memory cannot be fully utilized (CPU bottleneck). The technical team usually gives reasonable ratio recommendations based on experience or testing.

storage

Media selection – performance first

  • Solid State Drives (SSDs): Highly recommended for operating systems, applications, database files (especially transaction logs), caching. It provides IOPS (read/write operations per second) and low latency (microseconds) that far exceeds that of HDDs, greatly improving system response speed. NVMeSSD has the best performance, and SATASSD is cost-effective. It is the first choice for online production environments.
  • Mechanical Hard Drive (HDD): The advantage is the low cost per capacity. It is suitable for storing large amounts of cold data or backup data that do not require high access speed (such as historical log archiving and video source file backup).

Data Security & Reliability – RAID technology

Combining multiple physical disks into logical volumes provides redundancy and/or performance improvements. Common Levels:

  • RAID1 (Mirror): The two disks are fully mirrored. The write performance is slightly reduced, but the read performance can be improved. Provide 100% redundancy (allow 1 bad disk). It is suitable for small capacity and high availability requirements (such as system disks).
  • RAID5 (Distributed Parity): At least 3 plates. Data and parity information are distributed across all disks. Break 1 plate is allowed. It strikes a good balance between capacity utilization, performance, and redundancy, making it suitable for application servers and general databases.
  • RAID10(RAID1+0): Mirror (RAID1) and then stripe (RAID0). At least 4 plates. High performance (fast read and write) and high redundancy (1 piece can be damaged per set of images). It is the recommended choice for critical applications such as databases, but at a higher cost (50% effective capacity).

Product Managers Notice: SSD costs have dropped significantly, and priority SSDs are one of the most effective investments to improve user experience and system performance. RAID configuration is the basic guarantee of data security, and the cost needs to be included in the budget.

Internet

Bandwidth requirements

  • Internet access (extranet bandwidth): For public-facing services, bandwidth requirements depend on user visits, average page size/data transfer. 100Mbps is a common starting point for small applications. Large applications, video streaming, and download services may require 1Gbps, 10Gbps, or more. You need to confirm the bandwidth billing method with your cloud service provider or IDC (fixed bandwidth, peak bandwidth 95 billing, and traffic usage).
  • Internal network (intranet bandwidth): Inside the server cluster (e.g., web server-> application server-> database server; Distributed storage nodes) tend to be huge. Gigabit NIC (1Gbps) is the basic configuration. For high-performance computing clusters, distributed storage (such as CephHDFS), and big data transmission, 100Gbps or higher (25G/40G/100G) is required, otherwise the network will become a bottleneck.

Network latency: For applications with high real-time requirements (online transactions, games, real-time communication), network latency (ping) is crucial. Choosing a cloud region or IDC data center that is geographically close to users can significantly reduce latency.

4. Application process

The acquisition of servers involves multiple links such as budgeting, procurement, and operation and maintenance, and product managers need to effectively promote the process to ensure that resources are in place on time.

Demand analysis and program preparation

In-depth discussion: Product manager leads, and jointly reviews project requirements documents and system architecture design with the architects, development leaders, and operation and maintenance leaders of the technical team.

Clarify the specifications: Jointly finalize the server type, quantity, detailed configuration (CPU model/number of cores, memory size/type, storage type/capacity/RAID, network card requirements, operating system), deployment environment (physical machine/virtual machine/container/K8s?) Build your own IDC/public cloud/private cloud? )。

Output documentation: Jointly compile the “Server Resource Requirements Specification” with the technical team. Content should include:

  • Clear project background and goals.
  • System architecture diagram (indicate server roles).
  • Detailed list of servers (type, quantity, configuration parameters).
  • Key performance indicator requirements (such as supported concurrency and data processing capacity).
  • Deployment time requirements.
  • Preliminary cost estimate (hardware purchase price/cloud service monthly fee estimate).
  • Comparison of optional solutions (such as different configuration tiers and different cloud service provider packages).
  • Brief description of technical feasibility.

Internal approval and budget requests

target audience: Product manager promotes and strives for the support of technical director/CTO (technical feasibility approval), finance department (budget review), and management (final decision).

Communication focus

  • necessity: Clearly explain how server configuration supports key business goals (such as ensuring stability, improving user experience speed, and meeting compliance storage requirements).
  • Benefit analysis: Quantify or qualitatively describe the value of the investment (reducing downtime losses, improving user satisfaction/retention, supporting the launch of new features).
  • Cost-effectiveness: For large expenditures, prepare a more detailed cost-benefit analysis (ROI analysis) to compare the TCO (total cost of ownership) of different options.
  • Risk statement: Performance risks, stability risks, and security compliance risks that may be caused by substandard configurations.

Document support: Submit the Server Resource Requirements Statement and supplement the demonstration report materials as needed.

Supplier selection and procurement implementation

Procurement/O&M leads, and product managers confirm whether the requirements match.

Supplier assessment

  • Hardware procurement: Evaluate the brand (DellHPELenovo wave, etc.), model market reputation, after-sales service level (response time, spare parts supply), price competitiveness, and compliance.
  • Cloud service leasing: Evaluate the availability zones, service features, performance SLAs, billing models (Reserved Instances, On-Demand, Spot), technical support, ecological compatibility, and cost optimization tools of major cloud service providers (AWSAzureGCP, Alibaba Cloud, Tencent Cloud, and HUAWEI CLOUD) in the target region. The product manager must ensure that the selected cloud service plan (such as EC2 instance type) meets the configuration requirements determined in the early stage.

Contract signing

  • hardware: Clarify the detailed specifications, quantity, delivery time, acceptance standards, warranty terms (period, scope), and maintenance service content of the equipment.
  • Cloud services: Sign a service agreement that clarifies service level agreements (SLAs), data security and privacy terms, billing rules, and termination terms. Pay special attention to the feasibility of data migration and export.

Deployment, testing, and acceptance

The technical team executes, and the product manager organization participates in acceptance testing and confirms whether the requirements match.

Environment deploymentThe O&M or development team is responsible for server listing (physical machine), cloud resource provisioning and configuration, operating system installation, network configuration, and basic software deployment.

System integration and debugging: Incorporate the new server into the overall system for joint debugging.

Acceptance test: Product managers should organize or participate in the acceptance process and verify based on the performance indicators and functional requirements in the Server Resource Requirements Specification. The test includes:

  • Basic functionality testing (whether the server is accessible and whether the service starts normally).
  • Performance stress testing (verifying whether the expected TPS/QPS/concurrent number of users support capacity is reached).
  • Stability test (whether it is stable for a long time).
  • Security configuration checks (firewall rules, access controls, etc.).
  • Backup recovery drill validation.

Officially launched and O&M handover: After passing the acceptance, the server is put into production. Establish a complete monitoring system (ZabbixPrometheus + Grafana cloud monitoring), alarm mechanism, backup strategy, and daily operation and maintenance process.

End of text
 0