In recent years, Vision-Language Models (VLMs) have exhibited powerful capacity of reasoning, decomposing long-horizon tasks and motion planning in robotic manipulation tasks. However, the current ...