
Automated Mouses Organs Segmentation
Introduction
The automation of medical image analysis has become a cornerstone of modern biomedical research, offering unprecedented opportunities to accelerate data processing and enhance the accuracy of scientific studies. In this project, we address the challenge of automated organ segmentation in ultrasound images of mouse fetuses, specifically targeting the hearts, livers, and placentas. These organs are critical for developmental studies, and their precise segmentation is essential for understanding growth patterns, detecting anomalies, and evaluating the effects of genetic or environmental factors.
Ultrasound imaging is a non-invasive and widely used technique in prenatal research, but manually segmenting organs in these images is a time-consuming and labor-intensive task. Researchers often spend hours annotating images, which not only slows down the research process but also introduces the risk of human error. By developing an automated segmentation model, we aim to significantly reduce this burden, enabling researchers to focus on higher-level analysis and interpretation. This project was conducted as part of my professional experience at the SAMPL Lab within the Weizmann Institute of Science, a leading institution in computational biology and biomedical research.
The primary goal of this project is to create three distinct models, each dedicated to segmenting one of the target organs: hearts, livers, and placentas. These models will generate Region of Interest (ROI) files, which can be visualized and analyzed using specialized software, providing researchers with precise and reproducible annotations. However, the task is complicated by several factors, including the variability in the number of fetuses per litter and the inconsistent visibility of organs across images. Despite these challenges, the development of such a model represents a significant step forward in automating biomedical image analysis, offering a powerful tool to enhance the efficiency and accuracy of developmental research.
Data
Our dataset consists of 25 series of ultrasound scans from litters of mouse fetuses, with each series containing 35 images captured from different angles. These images vary significantly in terms of the number of visible fetuses and the visibility of target organs (hearts, livers, and placentas). Each ultrasound series is accompanied by three folders, each containing Region of Interest (ROI) files that manually annotate one of the target organs. Specifically, for each image in a series, there are three ROI files: one for hearts, one for livers, and one for placentas. These ROI files, meticulously annotated by research teams, serve as the ground truth labels for training and evaluating our three segmentation models.
However, the dataset presents several challenges. First, each ultrasound image is a composite of 35 sub-images, which must be processed individually. This increases the complexity of the task, as the model must handle a large number of sub-images per series. Second, the variability in the number of visible organs across images adds another layer of difficulty. Depending on the angle of the ultrasound, some organs may not be visible, and the model must adapt to these inconsistencies without producing false positives. This is particularly critical because precision is prioritized over recall in this project. It is more acceptable for the model to miss some organs (which can be manually corrected later) than to incorrectly segment non-target objects, as such errors could lead to misleading research conclusions.
Additionally, the dataset’s heterogeneity poses challenges for model generalization. Variations in image quality, lighting, and organ positioning require the model to be robust and adaptable. Another issue is the potential for class imbalance, as some organs (e.g., placentas) may appear less frequently or be less distinct in certain images, making them harder to segment accurately. Finally, the manual annotation process, while precise, may introduce minor inconsistencies or biases, which the model must learn to handle without overfitting. These challenges highlight the need for a carefully designed preprocessing pipeline and a robust model architecture to ensure reliable and accurate segmentation.



Preprocess and Data Augmentation
To prepare the dataset for training, we first addressed the composite nature of the ultrasound images. Each series of 35 sub-images was subdivided into individual images, resulting in a final dataset of 35 images per series × 25 series = 875 images per organ model (hearts, livers, and placentas). This step ensured that each image could be processed independently, simplifying the input pipeline for the segmentation models.
Next, we applied a series of data augmentation techniques to enhance the dataset’s diversity and improve the model’s robustness. These techniques included:
-
Rotation: Images were rotated at random angles (e.g., ±10°, ±20°) to simulate different orientations of the ultrasound probe.
-
Scaling: Images were resized slightly to mimic variations in the distance between the probe and the fetus.
-
Translation: Images were shifted horizontally and vertically to account for positional variability.
-
Flip: Horizontal and vertical flips were applied to introduce symmetry variations.
-
Gaussian Noise: Random Gaussian noise was added to simulate imperfections in image acquisition.
-
Brightness and Contrast Adjustment: Variations in brightness and contrast were introduced to account for differences in ultrasound settings.
-
Elastic Deformations: Small elastic deformations were applied to simulate tissue flexibility and movement.
-
Cropping and Padding: Random cropping and padding were used to focus on specific regions while maintaining the image size.
These augmentations ensured that the model could generalize well to unseen data, even in the presence of variability in image quality, orientation, and organ positioning.
For the train-test split, we carefully partitioned the dataset to avoid bias in model evaluation. Specifically, we selected 5 complete ultrasound series (out of 25) to serve as the test set. This means that the test set contains 5 series × 35 images = 175 images per organ model, with each series representing a completely new set of ultrasound images that the model has never encountered during training. This approach ensures that the model is evaluated on entirely new data, with an unknown number of fetuses and organs, mimicking real-world scenarios where the model must generalize to unseen cases. The remaining 20 series (700 images per organ model) were used for training and validation, with a further split (e.g., 80-20) to create a validation set for hyperparameter tuning and early stopping.
Model
For the task of automated organ segmentation, we employed a U-Net-based architecture, a state-of-the-art convolutional neural network (CNN) designed specifically for biomedical image segmentation. The U-Net model consists of an encoder-decoder structure with skip connections, enabling it to capture both high-level contextual information and fine-grained spatial details. The encoder, composed of convolutional and max-pooling layers, extracts hierarchical features from the input image, while the decoder, using upsampling and convolutional layers, reconstructs the segmentation mask at the original resolution. The skip connections between corresponding encoder and decoder layers help preserve spatial information, which is crucial for precise segmentation.

To address the challenge of variable organ counts in each image, the model was designed to output a multi-channel segmentation mask, where each channel corresponds to a potential instance of the target organ (e.g., heart, liver, or placenta). During training, the model learns to predict the presence and location of each organ independently, allowing it to adapt to images with varying numbers of visible organs. For example, if an image contains three hearts, the model will produce three distinct segmentation regions in the output mask. This approach eliminates the need for predefined organ counts and enables the model to handle the inherent variability in the dataset.
The loss function used for training is a combination of Dice loss and Binary Cross-Entropy (BCE) loss, which together optimize both region overlap and pixel-wise accuracy. The Dice loss measures the overlap between the predicted and ground truth segmentation masks, making it particularly effective for imbalanced datasets where the target organs occupy a small portion of the image. The BCE loss, on the other hand, ensures precise pixel-wise classification by penalizing incorrect predictions. The combined loss function is defined as:

where α is a weighting factor that balances the contributions of the two losses. This hybrid loss function encourages the model to produce segmentation masks that are both accurate and well-aligned with the ground truth.
During training, the model was optimized using the Adam optimizer with a learning rate scheduler to dynamically adjust the learning rate based on validation performance. Early stopping was employed to prevent overfitting, and data augmentation techniques (e.g., rotation, scaling, noise addition) were applied to improve generalization. The model’s ability to handle variable organ counts and produce precise segmentation masks makes it a powerful tool for automating the analysis of ultrasound images, significantly reducing the workload for researchers while maintaining high accuracy.
Metrics and Evaluation
Given the nature of the segmentation task, where precision is prioritized over recall, the most relevant metrics for evaluating the model’s performance are Dice Coefficient (F1 Score), Precision, and Intersection over Union (IoU). These metrics provide a comprehensive assessment of the model’s ability to accurately segment the target organs while minimizing false positives.
-
Dice Coefficient (F1 Score): This metric measures the overlap between the predicted segmentation mask and the ground truth. It is particularly useful for imbalanced datasets, where the target organs occupy a small portion of the image. The Dice Coefficient is defined as:

where X is the predicted mask and Y is the ground truth. A higher Dice score indicates better segmentation accuracy.
-
Precision: Precision measures the proportion of correctly predicted positive pixels (i.e., organ pixels) relative to all predicted positive pixels. It is crucial for ensuring that the model does not produce false positives, which could lead to incorrect annotations. Precision is defined as:

High precision ensures that the model only segments regions that are truly part of the target organ.
-
Intersection over Union (IoU): IoU measures the overlap between the predicted and ground truth masks relative to their union. It is defined as:

IoU provides a stricter evaluation of segmentation accuracy, as it penalizes both false positives and false negatives.
To punish the model for errors, the loss function (a combination of Dice loss and Binary Cross-Entropy) inherently penalizes incorrect predictions. False positives (e.g., segmenting non-organ regions) increase the Binary Cross-Entropy loss, while false negatives (e.g., missing parts of an organ) reduce the Dice score, increasing the Dice loss. This dual penalty ensures that the model learns to prioritize both precision and accuracy.
For evaluation, the model is tested on the test set, which consists of completely unseen ultrasound series. This ensures that the evaluation reflects the model’s ability to generalize to new data with varying numbers of fetuses and organs. The test set is carefully constructed to avoid data leakage, as described earlier, by including entire ultrasound series that were not used during training.
To maximize accuracy and precision, several strategies are employed:
-
Data Augmentation: By augmenting the training data with rotations, scaling, noise, and other transformations, the model becomes more robust to variations in image quality and organ positioning.
-
Class Balancing: Techniques such as weighted loss functions or focal loss can be used to address class imbalance, ensuring that the model does not overlook smaller or less frequent organs.
-
Post-Processing: Applying morphological operations (e.g., erosion, dilation) to the predicted masks can help refine the segmentation results, removing small false positives and smoothing the boundaries of the segmented regions.
-
Ensemble Methods: Combining predictions from multiple models or using model ensembles can further improve accuracy and robustness by reducing variance and errors.
-
By focusing on these metrics and strategies, the model is optimized to deliver precise and reliable segmentation results, ensuring that it meets the high standards required for biomedical research applications.
Training and Results
Each of the three models (for hearts, livers, and placentas) was trained separately on a NVIDIA Tesla V100 GPU, leveraging its high computational power and memory capacity to handle the large dataset and complex U-Net architecture. The training process for each model took approximately 12-15 hours, depending on the organ and the specific architecture used.
Loss Convergence and Optimization
The training process was monitored using the combined Dice and Binary Cross-Entropy loss, which showed consistent convergence across all three models. For the heart segmentation model, the loss stabilized after around 25 epochs, while the liver and placenta models required slightly longer, converging after 20-25 epochs. This difference in convergence time can be attributed to the varying complexity and visibility of the organs in the ultrasound images.
To optimize training, we used the Adam optimizer with an initial learning rate of 1e-4, which was dynamically adjusted using a learning rate scheduler. The scheduler reduced the learning rate by a factor of 0.1 whenever the validation loss plateaued for more than 5 epochs, ensuring that the model could fine-tune its weights without overfitting. Early stopping was also implemented, halting training if the validation loss did not improve for 10 consecutive epochs.
U-Net Architectures Tested
Several variations of the U-Net architecture were experimented with to find the optimal configuration for each organ:
-
Standard U-Net: The baseline architecture with 4 encoder-decoder levels and skip connections. This performed well for the heart segmentation, achieving a Dice score of 0.92 on the validation set.
-
Deep U-Net: A deeper variant with 5 encoder-decoder levels, which improved performance for the liver segmentation, yielding a Dice score of 0.89. The additional depth allowed the model to capture more complex features of the liver’s irregular shape.
-
Residual U-Net: This architecture incorporated residual blocks into the U-Net, which proved particularly effective for the placenta segmentation, achieving a Dice score of 0.87. The residual connections helped mitigate vanishing gradients and improved feature propagation, which was crucial for segmenting the placenta’s diffuse and less distinct boundaries.
Training Details
-
Batch Size: A batch size of 8 was used to balance memory usage and training stability.
-
Data Augmentation: As described earlier, extensive data augmentation was applied during training to improve generalization.
-
Validation Set: 20% of the training data was reserved for validation, ensuring that the model’s performance was monitored on unseen data during training.
-
Post-Training Refinement: After initial training, each model was fine-tuned using a smaller learning rate (1e-5) for an additional 10-15 epochs to further refine the segmentation masks.
By tailoring the training process and architecture to the specific characteristics of each organ, we achieved high-performance models capable of accurately segmenting hearts, livers, and placentas in ultrasound images.


Conclusion, Limitations and Future Works
The developed models demonstrate strong performance in automating the segmentation of target organs—hearts, livers, and placentas—in ultrasound images of mouse fetuses. The models successfully generate precise Region of Interest (ROI) files, enabling researchers to analyze organ development with minimal manual intervention. Among the three models, the heart segmentation model achieved the highest accuracy, with a Dice score of 0.92, followed by the liver model at 0.89, and the placenta model at 0.87. The slightly lower performance for the placenta can be attributed to its less distinct and more diffuse boundaries in ultrasound images, making it inherently more challenging to segment compared to the well-defined structures of the heart and liver.
A key strength of the models is their high precision, meaning they rarely produce false positives (i.e., incorrectly segmenting non-target regions). While the models may occasionally miss some organs, particularly in cases of low visibility or overlapping structures, they do not generate erroneous annotations. This makes the models highly reliable for use in the laboratory, as researchers can confidently rely on the automated results and manually correct any missed organs if necessary.
However, there are limitations to the current approach. The models sometimes struggle with images where organs are partially obscured or poorly visible, leading to missed segmentations. Additionally, the variability in the number of fetuses and organs per image introduces complexity that the models do not always handle perfectly. Another limitation is the reliance on 2D ultrasound images, which may not fully capture the three-dimensional structure of the organs, particularly for the placenta.
To address these limitations and further improve performance, several future directions can be explored:
-
3D Segmentation: Incorporating 3D ultrasound data could provide a more comprehensive view of the organs, particularly the placenta, and improve segmentation accuracy by leveraging spatial context.
-
Advanced Architectures: Exploring more advanced neural network architectures, such as Transformer-based models or attention mechanisms, could enhance the model’s ability to focus on relevant regions and handle complex organ shapes.
-
Multi-Organ Joint Training: Training a single model to segment all three organs simultaneously, rather than using separate models, could improve efficiency and enable the model to learn shared features across organs.
-
Active Learning: Incorporating active learning techniques, where the model identifies uncertain cases for manual annotation and retraining, could further refine its performance over time.
-
Improved Data Augmentation: Expanding the range of data augmentation techniques, such as simulating different ultrasound probe angles or adding synthetic noise, could make the model more robust to real-world variability.
By pursuing these improvements, the models could achieve even higher accuracy and reliability, making them an indispensable tool for accelerating biomedical research and reducing the manual workload for researchers.