I am a Deep Learning Expert at NEURA Robotics. I received an M.Sc. in Data Science, and a bachelor's degree in Applied Mathematics from one of the best Russian universities. I have worked as a Research Scientist at Samsung Research for 5 years. Overall, I have over 6 years of both industrial and research experience, focusing on 2D and 3D computer vision throughout my career. I co-authored 15+ research papers accepted to top-tier conferences, prepared a number of technical patents, and gained hands-on experience with various deep learning (CNN, RNN, Transformer) models and frameworks (PyTorch, Tensorflow).
Solved various 2D and 3D computer vision tasks in robotic scenarios. Adapted existing methods and/or developed new methods addressing 3D reconstruction, object segmentation, antipodal and suction grasp generation. Generated data for training and benchmarking developed methods. Contributed to documentation on AI Safety, wrote customer documentation and internal guides.
Developed state-of-the-art algorithms addressing 2D and 3D computer vision tasks: SLAM, visual and sensor-based localization, 3D reconstruction of indoor scenes, depth estimation, object segmentation, 2D and 3D object detection. Formulated scientific hypotheses and conducted experiments to prove them. Wrote a number of academic papers accepted to top-tier CV and robotics conferences such as CVPR, ECCV, WACV, IROS. Overall, contributed to 16 papers. Outstanding Reviewer at NeurIPS 2022 Datasets and Benchmarks track. Own international patents on technical inventions. Developed demos and PoCs on visual odometry, visual indoor navigation, fruit and vegetable weight measurement based on RGB-D data. Collected, labeled and prepared data for prototyping and research purposes: visual navigation, 3D reconstruction of indoor scenes, visual analytics for retail. Mastered all kinds of writing: academic manuscripts, annual reports, patents, tasks for data annotators, documentation, and internal guides.
Contributed to a project on cinema visitor monitoring based on video surveillance data. Developed algorithms based on deep neural networks (segmentation, classification, detection, tracking). Collected, labeled, and prepared training data. Conducted experiments and presented the results in the form of reports and slides. What started as a small toy project run by one intern (me), was considered so successful that it convinced top management to create a computer vision department, mostly to develop and maintain the cinema monitoring system. The implemented solution was used to collect statistics in over 700 cinema halls in Russia.
Completed courses: Bayesian Networks, Functional Analysis,
Convex Optimization, Autonomous Driving
Thesis: Visual Odometry with Ego-motion Sampling
GPA: 4.5 (8.68 / 10)
Completed courses: Machine Learning, Deep Learning,
Statistical Learning Theory, NLP, Computer Vision, Reinforcement Learning,
Bayesian ML, Advanced Algorithms and Data Structures,
Probability Theory and Statistics
Thesis: Person Re-identification Based on Visual Attributes
GPA: 4.69 (8.1 / 10)
We present OneFormer3D, a unified, simple, and effective model jointly solving semantic, instance, and panoptic segmentation of 3D point clouds. The model is trained end-to-end in a single run with panoptic annotations, and achieves top performance on all three tasks simultaneously, thereby setting a new state-of-the-art in several 3D segmentation benchmarks.
We conducted a user study of clicking patterns and found that the standard assumption made in the common evaluation strategy may not hold, making the accuracy and robustness of existing methods questionable. We propose a novel evaluation strategy providing a more comprehensive analysis of a model’s performance. Besides, we introduce a novel benchmark for measuring the robustness of interactive segmentation, and report the results of an extensive evaluation of numerous models.
Selfies captured from a short distance might look unnatural due to heavy distortions and improper posing. We propose SUPER, a novel method of correcting distortions and adjusting head poses in selfies. SUPER combines generative and rendering approaches to ensure correct geometry while preserving identity.
We propose FAWN, a modification of truncated signed distance function (TSDF) reconstruction methods. FAWN takes the standard scene structure in account by detecting walls and floor in a scene, and penalizing their normals for deviating from the horizontal and vertical directions. We add FAWN to state-of-the-art TSDF reconstruction methods and demonstrate a quality gain in a number of indoor benchmarks.
Single-view depth estimation methods cannot guarantee consistency throughout a sequence of frames. Minimizing discrepancy across multiple views takes hours, making these methods infeasible. Our MeDEA takes RGB frames with camera parameters and outputs temporally-consistent depth maps orders of magnitude faster then previous test-time optimization approaches. MeDEA sets a new state-of-the-art in indoor benchmarks and handles smartphone-captured data.
Most 3D instance segmentation methods are bottom-up and typically include resource-exhaustive post-processing. We address 3D instance segmentation with a TD3D: the pioneering cluster-free, fully-convolutional approach trained end-to-end. This is the first top-down method outperforming bottom-up approaches in a 3D domain. It demonstrates outstanding accuracy while being much up to 2.6x faster on inference than the current state-of-the-art grouping-based approaches.
We present the first neural inverse rendering approach capable of processing inter-reflections. We formulate a novel neural global illumination model, which estimates both direct environment light and indirect light as a surface light field, and build a Monte Carlo differentiable rendering framework. Our framework effectively handles complex lighting effects and facilitates the end-to-end reconstruction of physically-based spatially-varying materials.
We introduce a fast fully-convolutional 3D object detection model trained end-to-end, that achieves state-of-the-art results on the standard benchmarks. Moreover, to take advantage of both point cloud and RGB inputs, we propose an early fusion of 2D and 3D features. The versatile and efficient fusion module can be applied to make a conventional 3D object detection method multimodal, thereby improving its detection accuracy.
Interactive segmentation can be used to speed up and simplify image editing and labeling. Most approaches use clicks, which might be inconvenient when selecting small objects. We present a first-in-class contour-based interactive segmentation approach and demonstrate that a single contour provides the same accuracy as multiple clicks, thus reducing the number of interactions.
FCAF3D is a first-in-class fully convolutional anchor-free indoor 3D object detection method. FCAF3D can handle large-scale scenes with minimal runtime through a single feed-forward pass. Moreover, we propose a novel parametrization of oriented bounding boxes that consistently improves detection accuracy. State-of-the-art on ScanNet, SUN RGB-D, and S3DIS datasets.
A technical floorplan depicts walls, partitions, and doors, being a valuable source of information about the general scene structure. We propose a novel floorplan-aware 3D reconstruction algorithm that extends bundle adjustment, and show that using a floorplan improves 3D reconstruction quality on the Redwood dataset and our self-captured data.
ImVoxelNet is a fully convolutional 3D object detection method that operates in monocular and multi-view modes. ImVoxelNet takes an arbitrary number of RGB images with camera poses as inputs. General-purpose: state-of-the-art on outdoor (KITTI and nuScenes) and indoor (SUN RGB-D and ScanNet) datasets.
GP2 is a General-Purpose and Geometry-Preserving scheme of training single-view depth estimation models. GP2 allows training on a mixture of a small part of geometrically correct depth data and voluminous stereo data. State-of-the-art results in the general-purpose geometry-preserving single-view depth estimation.
A synthetic dataset for training and benchmarking semantic SLAM. Contains 200 sequences of 3000-5000 frames (RGB images generated using physically-based rendering, depth, IMU) and ground truth occupancy grids. In addition, we establish baseline results for SLAM, mapping, semantic and panoptic segmentation on our dataset.
A feasibility study of RGB-D SLAM. We extensively evaluate the popular ORBSLAM2 on several benchmarks, perform statistical analysis of the results, and find correlations between the metric values and the attributes of the trajectories. While the accuracy is high, robustness is still an issue.
Instead of ego-motion estimation, we address a dual problem of estimating the motion of a scene w.r.t a static camera. Using optical flow and depth, we calculate the motion of each point of a scene in terms of 6DoF and create motion maps, each one addressing a single degree of freedom. Such a decomposition improves accuracy over naive stacking of depth and optical flow.