openrefine extracting a number from a text column using regex -


i'm trying parse out column of data openfoodfacts dataset found via kaggle. there attribute called "serving_size" contains whatever serving size information presented on package food item. of time serving size expressed in grams (g), there other text well. i'd able search through string, find number corresponds number of grams, , extract value own field. value not integer - might have decimal.

i'm new regular expressions, seems ought possible search "g" character , if proceeded numeric values extract them. i've found recipes suggest possible, far nothing i've tried has worked. in openrefine documentation give example of extracting decimal data using regex: /[-+]?[0-9]+(.[0-9]+)?/, there no variation of work in our scenario. i've tried commands "value.match(/(.)?(/d+[g]).?/)". i'm finding don't understand how regex supposed work - when tell "/d" i'm expecting give me numeric values, not appear case - gives whatever there regardless of character type.

any appreciated.

here example text strings data:

serving_size    - 113.5g  - 20g  - 1 cup (227g)  - 4 cookies (15g)  - 13 pieces (39g)  - 1/4 packet (21g) makes 1/2 cup  - 0.75 oz (21g)  - 1 can (12 fl oz) 355g  - 15.2 fl oz (450g)  - 1 can (355ml)  - 1/4 tsp (1.4g)  - 10 fl oz 1 bottle.  - 20 fl oz  - 1 envelope (21g)  - 1 tbsp (4.5g)  - 45.2g  - 1/2 pack 142.5gms  - 1 carré de chocolat de 20g  - 4 biscottes (≈ 35g) ce paquet contient 8.5 portions de 4    biscottes.  - 0.33l  - 2galettes 10.5g  - 0.041649313g  - 1 package (79g) 

screenshot of attempt

in openrefine grel (the language used write transformations) 'match' function requires regular expression match entire string in cell - can't use partial match.

the output of 'match' function array of capture groups. specific value have select array, or convert array string.

so example try:

value.match(/.*?(\d+\.?\d*)g(ram)?(s)?\b?.*/)[0] 

this find strings there number (with or without decimal point) in front of letter 'g', or 'gram' or 'grams', followed non-word character (e.g. space or bracket) , capture number first member of resulting array of capture groups.

the '?' needed after first '.*' make lazy, capture group gets whole number, not last digit.


Comments

Popular posts from this blog

java - pagination of xlsx file to XSSFworkbook using apache POI -

Unlimited choices in BASH case statement -

apache - How do I stop my index.php being run twice for every user -