Skip to content

[macOS] mss.grab() intermittently blocks for exactly 30 seconds under high concurrency #436

@gebilaoman

Description

@gebilaoman

[macOS] mss.grab() intermittently blocks for exactly 30 seconds under high concurrency

Description

I am encountering a severe performance issue where mss.grab() intermittently blocks the main thread for almost exactly 30 seconds (e.g., 30.04s) when running multiple capture tasks concurrently on macOS. This behavior resembles a system-level timeout (possibly related to CoreGraphics or WindowServer locking up).

The issue occurs randomly after running the capture loop for some time (ranging from 20 minutes to several hours).

To confirm this is not an isolated hardware issue, I have tested this on multiple machines with identical configurations, and the issue is reproducible on all of them.

Environment

  • OS: macOS [Apple M4 16 GB] (Apple Silicon)
  • Python: [3.11.9]
  • MSS: [10.0.0]
  • Context: Running inside a standard terminal (not a specialized env). Screen recording permissions are granted.

Reproduction Steps

I have isolated the issue using a minimal script that simulates my production workload (6 concurrent async tasks taking screenshots in a loop).

  1. Run the script below.
  2. Wait for an indefinite amount of time (in my case, it took about 22 minutes).
  3. Observe the logs. Eventually, one of the tasks will report a capture time of ~30000ms.
import asyncio
import time
import logging
import os
import sys
import cv2
import numpy as np
import mss
from PIL import Image
from datetime import datetime
import random

# 添加项目根目录到路径,确保能导入项目模块
sys.path.append(os.getcwd())

# 配置日志
# 创建一个专门处理 ERROR 级别日志的 Handler
error_handler = logging.FileHandler("debug_error.log", mode='w', encoding='utf-8')
error_handler.setLevel(logging.ERROR)
error_handler.setFormatter(logging.Formatter('%(asctime)s - %(levelname)s - %(message)s'))

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.StreamHandler(sys.stdout),
        logging.FileHandler("debug_test.log", mode='w', encoding='utf-8'),
        error_handler
    ],
    force=True # 强制重新配置
)
logger = logging.getLogger("DebugTest")

# 强制 stdout 不缓存
sys.stdout.reconfigure(line_buffering=True)

class SimpleEventLoopMonitor:
    """简化的事件循环监控器"""
    def __init__(self, threshold=1.0):
        self.threshold = threshold
        self.running = False
        self.last_tick = time.time()

    async def start(self):
        self.running = True
        asyncio.create_task(self._monitor_loop())

    async def stop(self):
        self.running = False

    async def _monitor_loop(self):
        logger.info("🟢 事件循环监控已启动...")
        while self.running:
            start_time = time.time()
            await asyncio.sleep(1)  # 理论上应该睡1秒
            actual_sleep = time.time() - start_time
            
            if actual_sleep > self.threshold + 1: # 如果睡了超过阈值+1秒
                logger.error(f"⚠️ [严重阻塞] 事件循环卡顿! 预计休眠1秒,实际耗时: {actual_sleep:.2f}秒")
            elif actual_sleep > 1.5:
                logger.warning(f"⚠️ [轻微卡顿] 事件循环延迟. 实际耗时: {actual_sleep:.2f}秒")
            else:
                # 心跳正常,不刷屏
                pass

class DebugTester:
    def __init__(self):
        logger.info("正在初始化 MSS 截图模块...")
        self.sct = mss.mss()
        logger.info("MSS 初始化完成")
        
        # 模拟标准窗口大小 (参考 capture_service.py)
        self.monitor_region = {"top": 100, "left": 100, "width": 850, "height": 615}

    def _process_image_sync(self, screenshot):
        """模拟同步的图像处理"""
        t0 = time.time()
        
        # 1. 图像转换 (PIL -> CV2)
        img_array = np.array(screenshot)
        img_cv2 = cv2.cvtColor(img_array, cv2.COLOR_RGB2BGR)
        t1 = time.time()
        
        # 2. 模拟 heavy image comparison (例如 resize, absdiff)
        # 模拟 compare_images_direct 中的预处理
        resized = cv2.resize(img_cv2, (850, 615))
        gray = cv2.cvtColor(resized, cv2.COLOR_BGR2GRAY)
        # 做一些无意义的重计算来模拟负载
        for _ in range(5):
            cv2.GaussianBlur(gray, (5, 5), 0)
        t2 = time.time()
        
        return {
            "convert_time": (t1 - t0) * 1000,
            "cv2_time": (t2 - t1) * 1000,
            "total_time": (t2 - t0) * 1000,
        }

    async def run_capture_loop(self, window_id):
        logger.info(f"🚀 [Window {window_id}] 开始捕获循环")
        
        # 模拟错峰启动
        await asyncio.sleep(random.uniform(0, 2))
        
        count = 0
        try:
            while True:
                loop_start = time.time()
                
                # Step 1: 截图 (MSS)
                shot_start = time.time()
                # 模拟不同窗口位置(其实都截同一块也没关系,主要是测API调用开销)
                screenshot_raw = self.sct.grab(self.monitor_region)
                screenshot = Image.frombytes('RGB', screenshot_raw.size, screenshot_raw.rgb)
                shot_time = (time.time() - shot_start) * 1000
                
                if shot_time > 1000:
                    logger.error(f"🛑 [Window {window_id}-{count}] MSS截图严重耗时: {shot_time:.2f}ms")
                
                # Step 2: 图像处理 (同步阻塞)
                process_res = self._process_image_sync(screenshot)
                
                loop_time = (time.time() - loop_start) * 1000
                
                # 输出性能日志
                log_msg = (
                    f"[Win{window_id}-{count}] 总耗时: {loop_time:.1f}ms | "
                    f"截图: {shot_time:.1f}ms | "
                    f"处理: {process_res['total_time']:.1f}ms "
                    f"(CV2: {process_res['cv2_time']:.1f})"
                )
                
                if loop_time > 2000: # 超过2秒标红
                    logger.warning(f"⚠️ {log_msg} - 检测到慢循环")
                else:
                    # 每次都打印日志
                    logger.info(log_msg)
                
                count += 1
                # 模拟生产环境的间隔 (例如 0.5s 左右,并带点随机性)
                await asyncio.sleep(random.uniform(0.4, 0.6))
                
                # === 模拟偶发卡顿 (测试用) ===
                # if random.random() < 0.05: # 5% 的概率
                #     logger.warning(f"🔥 [Win{window_id}] 模拟发生一次 3.5秒 的同步阻塞...")
                #     time.sleep(3.5) # 注意:这是 time.sleep,会卡死整个事件循环
                # ==========================
                
        except asyncio.CancelledError:
            logger.info(f"[Window {window_id}] 任务取消")
        except Exception as e:
            logger.error(f"[Window {window_id}] 测试发生异常: {e}", exc_info=True)

async def main():
    # 1. 启动监控器
    monitor = SimpleEventLoopMonitor(threshold=2.0) # 2秒阈值
    await monitor.start()
    
    # 2. 启动多个测试窗口 (例如 6 个)
    tester = DebugTester()
    tasks = []
    for i in range(6):
        tasks.append(asyncio.create_task(tester.run_capture_loop(i)))
        
    try:
        # 等待所有任务(实际上会一直运行直到Ctrl+C)
        await asyncio.gather(*tasks)
    except asyncio.CancelledError:
        pass
    finally:
        await monitor.stop()

if __name__ == "__main__":
    try:
        asyncio.run(main())
    except KeyboardInterrupt:
        pass

Logs

Here is the specific log segment when the freeze happened (after running normally for ~22 minutes):

2025-11-28 15:05:03,498 - INFO - [Win5-2381] 总耗时: 22.9ms | 截图: 21.2ms | 处理: 1.7ms (CV2: 1.2)
2025-11-28 15:05:03,521 - INFO - [Win4-2366] 总耗时: 22.1ms | 截图: 20.5ms | 处理: 1.3ms (CV2: 0.8)
2025-11-28 15:05:03,772 - INFO - [Win0-2373] 总耗时: 22.1ms | 截图: 20.7ms | 处理: 1.5ms (CV2: 0.6)
2025-11-28 15:05:03,823 - INFO - [Win2-2368] 总耗时: 21.7ms | 截图: 20.8ms | 处理: 0.9ms (CV2: 0.6)
2025-11-28 15:05:03,884 - INFO - [Win3-2376] 总耗时: 21.9ms | 截图: 19.9ms | 处理: 1.6ms (CV2: 1.1)
2025-11-28 15:05:03,908 - INFO - [Win1-2366] 总耗时: 20.7ms | 截图: 19.7ms | 处理: 1.0ms (CV2: 0.7)
2025-11-28 15:05:03,946 - INFO - [Win5-2382] 总耗时: 21.9ms | 截图: 20.4ms | 处理: 1.5ms (CV2: 0.8)
2025-11-28 15:05:03,992 - INFO - [Win4-2367] 总耗时: 21.8ms | 截图: 20.2ms | 处理: 1.5ms (CV2: 0.8)
2025-11-28 15:05:04,266 - INFO - [Win2-2369] 总耗时: 22.3ms | 截图: 20.7ms | 处理: 1.2ms (CV2: 0.6)
2025-11-28 15:05:04,298 - INFO - [Win0-2374] 总耗时: 20.8ms | 截图: 19.3ms | 处理: 1.1ms (CV2: 0.5)
2025-11-28 15:05:04,377 - INFO - [Win3-2377] 总耗时: 23.2ms | 截图: 22.0ms | 处理: 1.2ms (CV2: 0.9)
2025-11-28 15:05:04,487 - INFO - [Win4-2368] 总耗时: 22.6ms | 截图: 21.1ms | 处理: 0.9ms (CV2: 0.5)
2025-11-28 15:05:04,508 - INFO - [Win1-2367] 总耗时: 21.1ms | 截图: 19.6ms | 处理: 0.9ms (CV2: 0.4)
2025-11-28 15:05:04,532 - INFO - [Win5-2383] 总耗时: 21.1ms | 截图: 19.4ms | 处理: 1.5ms (CV2: 1.0)
2025-11-28 15:05:04,771 - INFO - [Win2-2370] 总耗时: 20.6ms | 截图: 19.7ms | 处理: 0.9ms (CV2: 0.4)
2025-11-28 15:05:04,792 - INFO - [Win0-2375] 总耗时: 21.0ms | 截图: 19.4ms | 处理: 1.5ms (CV2: 1.0)
2025-11-28 15:05:34,868 - ERROR - 🛑 [Window 3-2378] MSS截图严重耗时: 30047.48ms
2025-11-28 15:05:34,870 - WARNING - ⚠️ [Win3-2378] 总耗时: 30049.6ms | 截图: 30047.5ms | 处理: 1.5ms (CV2: 0.9) - 检测到慢循环
2025-11-28 15:05:34,896 - INFO - [Win5-2384] 总耗时: 25.7ms | 截图: 24.5ms | 处理: 1.0ms (CV2: 0.5)
2025-11-28 15:05:34,919 - INFO - [Win1-2368] 总耗时: 22.9ms | 截图: 21.7ms | 处理: 0.9ms (CV2: 0.5)
2025-11-28 15:05:34,940 - INFO - [Win4-2369] 总耗时: 20.7ms | 截图: 19.4ms | 处理: 1.3ms (CV2: 0.8)
2025-11-28 15:05:34,961 - INFO - [Win2-2371] 总耗时: 21.3ms | 截图: 20.0ms | 处理: 1.0ms (CV2: 0.6)
2025-11-28 15:05:34,984 - INFO - [Win0-2376] 总耗时: 22.7ms | 截图: 20.5ms | 处理: 1.8ms (CV2: 1.5)
2025-11-28 15:05:34,984 - ERROR - ⚠️ [严重阻塞] 事件循环卡顿! 预计休眠1秒,实际耗时: 30.42秒
2025-11-28 15:05:35,381 - INFO - [Win1-2369] 总耗时: 23.6ms | 截图: 21.2ms | 处理: 1.0ms (CV2: 0.5)
2025-11-28 15:05:35,403 - INFO - [Win3-2379] 总耗时: 21.5ms | 截图: 19.9ms | 处理: 1.2ms (CV2: 0.7)
2025-11-28 15:05:35,426 - INFO - [Win4-2370] 总耗时: 22.5ms | 截图: 20.9ms | 处理: 1.4ms (CV2: 0.9)
2025-11-28 15:05:35,458 - INFO - [Win2-2372] 总耗时: 20.7ms | 截图: 19.7ms | 处理: 1.0ms (CV2: 0.6)
2025-11-28 15:05:35,482 - INFO - [Win5-2385] 总耗时: 22.0ms | 截图: 20.7ms | 处理: 1.3ms (CV2: 0.7)
2025-11-28 15:05:35,526 - INFO - [Win0-2377] 总耗时: 22.0ms | 截图: 20.8ms | 处理: 1.2ms (CV2: 0.7)
2025-11-28 15:05:35,837 - INFO - [Win1-2370] 总耗时: 21.5ms | 截图: 20.0ms | 处理: 1.1ms (CV2: 0.6)
2025-11-28 15:05:35,860 - INFO - [Win4-2371] 总耗时: 22.2ms | 截图: 21.1ms | 处理: 1.2ms (CV2: 0.7)
2025-11-28 15:05:35,892 - INFO - [Win3-2380] 总耗时: 21.9ms | 截图: 20.4ms | 处理: 1.6ms (CV2: 1.0)
2025-11-28 15:05:35,915 - INFO - [Win2-2373] 总耗时: 22.2ms | 截图: 20.8ms | 处理: 1.4ms (CV2: 0.8)

Analysis / Suspicion

  • The duration is consistently around 30 seconds, suggesting a timeout in the underlying macOS window server or graphics subsystem (CGWindowListCreateImage?).
  • It effectively blocks the GIL or the main thread, freezing the entire Python asyncio event loop. Even though other tasks are just sleeping or waiting, they do not get scheduled during this 30s window.
  • It happens under high concurrency (multiple tasks accessing sct.grab frequently).

Question

Is there any known limitation with CoreGraphics concurrency on macOS, or a way to configure a lower timeout for grab() so it fails fast instead of blocking the entire application for 30s?

Image

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions