Pharmaceutical manufacturing monitoring data set
Source:R/pharma_bioreactors.R
data_pharma_bioreactors.Rd
Samples were collected each day from all bioreactors and glucose was measured using both spectroscopy and the traditional manner. The goal is to create models on the data from the more numerous small-scale bioreactors and then evaluate if these results can accurately predict what is happening in the large-scale bioreactors (see details below).
Source
Kuhn, Max, and Kjell Johnson. Feature engineering and selection: A practical approach for predictive models. Chapman and Hall/CRC, 2019.
https://bookdown.org/max/FES/illustrative-data-pharmaceutical-manufacturing-monitoring.html
Arguments
- ...
Arguments passed to
pins::pin_read()
.
Details
Experimental Background
Pharmaceutical companies use spectroscopy measurements to assess critical process parameters during the manufacturing of a biological drug. Models built on this process can be used with real-time data to recommend changes that can increase product yield. In the example that follows, Raman spectroscopy was used to generate the data. These data were generated from real data, but have been distinctly modified to preserve confidentiality and achieve illustration purposes.
To manufacture the drug being used for this example, a specific type of protein is required and that protein can be created by a particular type of cell. A batch of cells are seeded into a bioreactor which is a device that is designed to help grow and maintain the cells. In production, a large bioreactor would be about 2000 liters and is used to make large quantities of proteins in about two weeks.
Many factors can affect product yield. For example, because the cells are living, working organisms, they need the right temperature and sufficient food (glucose) to generate drug product. During the course of their work, the cells also produce waste (ammonia). Too much of the waste product can kill the cells and reduce the overall product yield. Typically key attributes like glucose and ammonia are monitored daily to ensure that the cells are in optimal production conditions. Samples are collected and off-line measurements are made for these key attributes. If the measurements indicate a potential problem, the manufacturing scientists overseeing the process can tweak the contents of the bioreactor to optimize the conditions for the cells.
One issue is that conventional methods for measuring glucose and ammonia are time consuming and the results may not come in time to address any issues. Spectroscopy is a potentially faster method of obtaining these results if an effective model can be used to take the results of the spectroscopy assay to make predictions on the substances of interest (i.e., glucose and ammonia).
However, it is not feasible to do experiments using many large-scale bioreactors. Two parallel experimental systems were used:
15 small-scale (5 liters) bioreactors were seeded with cells and were monitored daily for 14 days.
Three large-scale bioreactors were also seeded with cells from the same batch and monitored daily for 14 days
Notes on Data
The intensity values have undergone signal processing up to smoothing. See the reference for more details.
License
data_pharma_bioreactors()
#> # A tibble: 664,524 x 6
#> reactor_id day glucose wave_number intensity size
#> <chr> <int> <dbl> <dbl> <dbl> <chr>
#> 1 S_01 1 24.7 407 0.909 small
#> 2 S_01 1 24.7 408 0.858 small
#> 3 S_01 1 24.7 409 0.766 small
#> 4 S_01 1 24.7 410 0.627 small
#> 5 S_01 1 24.7 411 0.448 small
#> 6 S_01 1 24.7 412 0.236 small
#> 7 S_01 1 24.7 413 0.00707 small
#> 8 S_01 1 24.7 414 -0.222 small
#> 9 S_01 1 24.7 415 -0.438 small
#> 10 S_01 1 24.7 416 -0.629 small
#> # i 664,514 more rows
glimpse()
tibble::glimpse(data_pharma_bioreactors())
#> Rows: 664,524
#> Columns: 6
#> $ reactor_id <chr> "S_01", "S_01", "S_01", "S_01", "S_01", "S_01", "S_01", "S~
#> $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
#> $ glucose <dbl> 24.74713, 24.74713, 24.74713, 24.74713, 24.74713, 24.74713~
#> $ wave_number <dbl> 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418~
#> $ intensity <dbl> 0.909439216, 0.857607637, 0.766150467, 0.626862221, 0.4480~
#> $ size <chr> "small", "small", "small", "small", "small", "small", "sma~
Examples
# \donttest{
data_pharma_bioreactors()
#> # A tibble: 664,524 × 6
#> reactor_id day glucose wave_number intensity size
#> <chr> <int> <dbl> <dbl> <dbl> <chr>
#> 1 S_01 1 24.7 407 0.909 small
#> 2 S_01 1 24.7 408 0.858 small
#> 3 S_01 1 24.7 409 0.766 small
#> 4 S_01 1 24.7 410 0.627 small
#> 5 S_01 1 24.7 411 0.448 small
#> 6 S_01 1 24.7 412 0.236 small
#> 7 S_01 1 24.7 413 0.00707 small
#> 8 S_01 1 24.7 414 -0.222 small
#> 9 S_01 1 24.7 415 -0.438 small
#> 10 S_01 1 24.7 416 -0.629 small
#> # ℹ 664,514 more rows
# }