Python 2.7, comparing 3 columns of a CSV -
what easiest/simplest way iterate through large csv file in python 2.7, comparing 3 columns?
i total beginner , have completed few online courses, have managed use csv reader basic stats on csv file, nothing comparing groups within each other.
the data set follows:
group sub-group processed 1 y 1 y 1 y 1 b 1 b 1 b 1 c y 1 c y 1 c 2 d y 2 d y 2 d y 2 e y 2 e 2 e 2 f y 2 f y 2 f y 3 g 3 g 3 g 3 h y 3 h 3 h
everything belongs group, within each group sub-groups of 3 rows (replicates). working through samples, adding processed column, don't full complement, there 1 or 2 processed out of potential 3.
i'm trying work towards statistic showing % completeness of each group, sub group being "complete" if has @ least 1 row processed (doesn't have have 3).
i've managed halfway there, using following:
for row in reader: all_groups[group] = all_groups.get(group,0)+1 if not processed == "": processed_groups[group] = processed_groups.get(group,0)+1 result = {} family in (processed_groups.viewkeys() | all_groups.keys()): if group in processed_groups: result.setdefault(group, []).append(processed_groups[group]) if group in processed_groups: result.setdefault(group, []).append(all_groups[group]) group,v1 in result.items(): todo = float(v1[0]) done = float(v1[1]) progress = round((100 / done * todo),2) print group,"--", progress,"%"
the problem above code doesn't take account fact sub-groups may not totally processed. result, statistic never read 100% unless processed column complete.
what get: group 1 -- 55.56% group 2 -- 77.78% group 3 -- 16.67% want: group 1 -- 66.67%% group 2 -- 100% group 3 -- 50%
how make looks see if first row each sub column complete, , use that, before continuing on next sub group?
one way couple of defaultdict
of sets. first keeps track of of subgroups seen, second keeps track of subgroups have been processed. using set simplifies code somewhat, using defaultdict
when compared using standard dictionary (although it's still possible).
import csv collections import defaultdict subgroups = defaultdict(set) processed_subgroups = defaultdict(set) open('data.csv') csvfile: group, subgroup, processed in csv.reader(csvfile): subgroups[group].add(subgroup) if processed == 'y': processed_subgroups[group].add(subgroup) group in sorted(processed_subgroups): print("group {} -- {:.2f}%".format(group, (len(processed_subgroups[group]) / float(len(subgroups[group])) * 100)))
output
group 1 -- 66.67% group 2 -- 100.00% group 3 -- 50.00%
Comments
Post a Comment