Pharmaceutical manufacturing monitoring data set — data_pharma

Samples were collected each day from all bioreactors and glucose was measured using both spectroscopy and the traditional manner. The goal is to create models on the data from the more numerous small-scale bioreactors and then evaluate if these results can accurately predict what is happening in the large-scale bioreactors (see details below).

Usage

data_pharma_bioreactors(...)

Source

Kuhn, Max, and Kjell Johnson. Feature engineering and selection: A practical approach for predictive models. Chapman and Hall/CRC, 2019.

https://bookdown.org/max/FES/illustrative-data-pharmaceutical-manufacturing-monitoring.html

Arguments

...: Arguments passed to pins::pin_read().

Value

tibble

Details

Experimental Background

Pharmaceutical companies use spectroscopy measurements to assess critical process parameters during the manufacturing of a biological drug. Models built on this process can be used with real-time data to recommend changes that can increase product yield. In the example that follows, Raman spectroscopy was used to generate the data. These data were generated from real data, but have been distinctly modified to preserve confidentiality and achieve illustration purposes.

To manufacture the drug being used for this example, a specific type of protein is required and that protein can be created by a particular type of cell. A batch of cells are seeded into a bioreactor which is a device that is designed to help grow and maintain the cells. In production, a large bioreactor would be about 2000 liters and is used to make large quantities of proteins in about two weeks.

Many factors can affect product yield. For example, because the cells are living, working organisms, they need the right temperature and sufficient food (glucose) to generate drug product. During the course of their work, the cells also produce waste (ammonia). Too much of the waste product can kill the cells and reduce the overall product yield. Typically key attributes like glucose and ammonia are monitored daily to ensure that the cells are in optimal production conditions. Samples are collected and off-line measurements are made for these key attributes. If the measurements indicate a potential problem, the manufacturing scientists overseeing the process can tweak the contents of the bioreactor to optimize the conditions for the cells.

One issue is that conventional methods for measuring glucose and ammonia are time consuming and the results may not come in time to address any issues. Spectroscopy is a potentially faster method of obtaining these results if an effective model can be used to take the results of the spectroscopy assay to make predictions on the substances of interest (i.e., glucose and ammonia).

However, it is not feasible to do experiments using many large-scale bioreactors. Two parallel experimental systems were used:

15 small-scale (5 liters) bioreactors were seeded with cells and were monitored daily for 14 days.
Three large-scale bioreactors were also seeded with cells from the same batch and monitored daily for 14 days

Notes on Data

The intensity values have undergone signal processing up to smoothing. See the reference for more details.

License

data_pharma_bioreactors()
#> # A tibble: 664,524 x 6
#>    reactor_id   day glucose wave_number intensity size 
#>    <chr>      <int>   <dbl>       <dbl>     <dbl> <chr>
#>  1 S_01           1    24.7         407   0.909   small
#>  2 S_01           1    24.7         408   0.858   small
#>  3 S_01           1    24.7         409   0.766   small
#>  4 S_01           1    24.7         410   0.627   small
#>  5 S_01           1    24.7         411   0.448   small
#>  6 S_01           1    24.7         412   0.236   small
#>  7 S_01           1    24.7         413   0.00707 small
#>  8 S_01           1    24.7         414  -0.222   small
#>  9 S_01           1    24.7         415  -0.438   small
#> 10 S_01           1    24.7         416  -0.629   small
#> # i 664,514 more rows

glimpse()

tibble::glimpse(data_pharma_bioreactors())
#> Rows: 664,524
#> Columns: 6
#> $ reactor_id  <chr> "S_01", "S_01", "S_01", "S_01", "S_01", "S_01", "S_01", "S~
#> $ day         <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
#> $ glucose     <dbl> 24.74713, 24.74713, 24.74713, 24.74713, 24.74713, 24.74713~
#> $ wave_number <dbl> 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418~
#> $ intensity   <dbl> 0.909439216, 0.857607637, 0.766150467, 0.626862221, 0.4480~
#> $ size        <chr> "small", "small", "small", "small", "small", "small", "sma~

Examples

# \donttest{
data_pharma_bioreactors()
#> # A tibble: 664,524 × 6
#>    reactor_id   day glucose wave_number intensity size 
#>    <chr>      <int>   <dbl>       <dbl>     <dbl> <chr>
#>  1 S_01           1    24.7         407   0.909   small
#>  2 S_01           1    24.7         408   0.858   small
#>  3 S_01           1    24.7         409   0.766   small
#>  4 S_01           1    24.7         410   0.627   small
#>  5 S_01           1    24.7         411   0.448   small
#>  6 S_01           1    24.7         412   0.236   small
#>  7 S_01           1    24.7         413   0.00707 small
#>  8 S_01           1    24.7         414  -0.222   small
#>  9 S_01           1    24.7         415  -0.438   small
#> 10 S_01           1    24.7         416  -0.629   small
#> # ℹ 664,514 more rows
# }